<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Nutri-Grade labels

---
## Problem Statement
Singaporeans are living longer but spending more time in ill-health. There are top 3 chronic medical conditions that Singaporeans suffer from are: Hypertension, Diabetes and Hyperlipidemia<br>

There are 3 main ways to prevent chronic illness:
- Physical Activity (Engage in at least 150-300 minutes of moderate-intensity aerobic activity in a week)
- Diet (Consume the receommended dietary allowances for sugar, saturated fat and salt)
- Healthy life choices (Avoid tobacco and excessive drinking)<br>

We will focus on the diet portion. More than half of Singaporeans’ daily sugar intake comes from beverages. This is why the government has came up with a nutri-grade labelling system in hopes that Singaporeans will reduce their sugar intake by making heatheir choices when choosing which drink to buy. However the nutrigrade labels only take into account trans fat and sugar and do not provide a holistic picture of the health of the drinks. Is there a way to create a more comprehensive indicator of how healthy drinks are?

## Contents:
- [Import libraries](#Import-libraries) 
- [Data scraping](#Data-scraping)

# Import libraries

In [1]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd

# Data scraping

Here, we use selenium and beautiful soup to get the nutritional information of drinks from the supermarket website. The supermarket we have chosen is NTUC FairPrice. We chose to get the information from 4 categories of drinks:
- Coffee
- Tea
- Juice
- Drinks

In [8]:
# Initialize the Selenium WebDriver
driver = webdriver.Chrome()

# Navigate to the webpage
url = 'https://www.fairprice.com.sg/category/coffee'
driver.get(url)

# Wait for the page to load
time.sleep(10)

# Scroll to load more products
num_scrolls = 40
for _ in range(num_scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)

# Extract links to individual product pages
product_links = []
soup = BeautifulSoup(driver.page_source, 'html.parser')
product_cards = soup.find_all('div', class_='sc-2a85da88-0 fhLHEV product-container')

# Assuming product_cards contains the relevant div elements with class 'sc-2a85da88-0'
for card in product_cards:
    a_tag = card.find('a', class_='sc-2a85da88-3')
    if a_tag:
        product_link = a_tag['href']
        product_links.append(product_link)

# Extract nutritional data for each product
all_nutritional_data = []

for product_link in product_links:
    product_url = f'https://www.fairprice.com.sg{product_link}'
    driver.get(product_url)
    time.sleep(4)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

    nutritional_data = {}

    # Extracting the drink's volume
    volume = soup.find('span', class_='sc-aa673588-1 sc-d5ac8310-3 kZssPC jGBApJ')
    if volume:
        nutritional_data['Drink Volume'] = volume.text

    # Extracting the drink's name
    drink_name = soup.find('span', class_='sc-aa673588-1 drdope')
    if drink_name:
        nutritional_data['Drink Name'] = drink_name.text

    nutrient_list = soup.find('ul', class_='sc-ad6d339b-0 lhIfvG')
    if nutrient_list:
        nutrient_rows = nutrient_list.find_all('li', class_='sc-ad6d339b-1 iVtCiL')
        first_row = nutrient_rows[0]
        attributes_name = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU')
        attributes_value = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT')
        for attr_name, attr_value in zip(attributes_name, attributes_value):
            attribute_name = attr_name.text.strip()
            attribute_value = attr_value.text.strip()
            nutritional_data[attribute_name] = attribute_value
        for row in nutrient_rows[1:]:
            nutrient_name = row.find('span', class_='sc-ilhmMj').text.strip()
            nutrient_value = row.find('span', class_='sc-ilhmMj fQpIpK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT').text.strip()
            nutritional_data[nutrient_name] = nutrient_value

    all_nutritional_data.append(nutritional_data)

# Close the Selenium WebDriver
driver.quit()

# Create a DataFrame from the nutritional data
df = pd.DataFrame(all_nutritional_data)

# Remove the rows that have NaN in every column
# Drop rows where all columns except 'Drink Name' and 'Drink Volume' have NaN
df.dropna(subset=df.columns.difference(['Drink Name', 'Drink Volume']), how='all', inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Print the DataFrame after removing rows with NaN
df

Unnamed: 0,Drink Volume,Drink Name,Attributes,Energy,Protein,Total Fat,Carbohydrate,Calcium,Vitamin A,Vitamin B1,...,Vitamin D,Total Calorieso,Calories From Saturated Fat,Salt (Sodium),Serving,TotalSugers,Monosaturated Fat,Satutated Fat,Total Dietary Fiber,Enery
0,392g,Milkmaid Sweetened Condensed Milk - Full Cream...,Per Serving (20g),65kcal,1.4g,1.6g,11.2g,52mg,42µg,0.2mg,...,,,,,,,,,,
1,600 G,Killiney 3-in-1 White Coffee,Per Serving (1.1g),,1.1g,,,,,,...,,,,,,,,,,
2,200g,Nescafe Gold Original,Per 100ml,4kcal,0.3g,0.0g,0.7g,,,,...,,,,,,,,,,
3,450g,Nestle Coffeemate Creamer - Pouch,Per Serving (6g),33kcal,0.1g,2.1g,3.4g,,,,...,,,,,,,,,,
4,100 x 20g,Indocafe 3 in 1 Instant Coffee Mix,Per Serving (20g),90kcal,< 1g,2g,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,250 G,Cafe specialists Traditional Signature Ground ...,Per Serving (2.5g),50kcal,2.5g,,10g,,,,...,,,,,,,,,,
138,270 G,Mycofe Long Black O,Per Serving (18),,0.9g,0g,,,,,...,,,,,,,,,,
139,6 X 260G,UCC Blended Coffee Luxurious Cafe Au Lait,Per Serving (),,0.6g,,,,,,...,,,,,,,,,,
140,6 X 185G,UCC Black 100% Roasted Coffee Sugar Free,Per Serving (),,,,,,,,...,,,,,,,,,,


In [10]:
df.to_csv("../data/kopi.csv", index=False)

In [11]:
# Initialize the Selenium WebDriver
driver = webdriver.Chrome()

# Navigate to the webpage
url = 'https://www.fairprice.com.sg/search?query=Tea'
driver.get(url)

# Wait for the page to load
time.sleep(10)

# Scroll to load more products
num_scrolls = 40
for _ in range(num_scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)

# Extract links to individual product pages
product_links = []
soup = BeautifulSoup(driver.page_source, 'html.parser')
product_cards = soup.find_all('div', class_='sc-2a85da88-0 fhLHEV product-container')

# Assuming product_cards contains the relevant div elements with class 'sc-2a85da88-0'
for card in product_cards:
    a_tag = card.find('a', class_='sc-2a85da88-3')
    if a_tag:
        product_link = a_tag['href']
        product_links.append(product_link)

# Extract nutritional data for each product
all_nutritional_data = []

for product_link in product_links:
    product_url = f'https://www.fairprice.com.sg{product_link}'
    driver.get(product_url)
    time.sleep(4)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

    nutritional_data = {}

    # Extracting the drink's volume
    volume = soup.find('span', class_='sc-aa673588-1 sc-d5ac8310-3 kZssPC jGBApJ')
    if volume:
        nutritional_data['Drink Volume'] = volume.text

    # Extracting the drink's name
    drink_name = soup.find('span', class_='sc-aa673588-1 drdope')
    if drink_name:
        nutritional_data['Drink Name'] = drink_name.text

    nutrient_list = soup.find('ul', class_='sc-ad6d339b-0 lhIfvG')
    if nutrient_list:
        nutrient_rows = nutrient_list.find_all('li', class_='sc-ad6d339b-1 iVtCiL')
        first_row = nutrient_rows[0]
        attributes_name = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU')
        attributes_value = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT')
        for attr_name, attr_value in zip(attributes_name, attributes_value):
            attribute_name = attr_name.text.strip()
            attribute_value = attr_value.text.strip()
            nutritional_data[attribute_name] = attribute_value
        for row in nutrient_rows[1:]:
            nutrient_name = row.find('span', class_='sc-ilhmMj').text.strip()
            nutrient_value = row.find('span', class_='sc-ilhmMj fQpIpK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT').text.strip()
            nutritional_data[nutrient_name] = nutrient_value

    all_nutritional_data.append(nutritional_data)

# Close the Selenium WebDriver
driver.quit()

# Create a DataFrame from the nutritional data
df = pd.DataFrame(all_nutritional_data)

# Remove the rows that have NaN in every column
# Drop rows where all columns except 'Drink Name' and 'Drink Volume' have NaN
df.dropna(subset=df.columns.difference(['Drink Name', 'Drink Volume']), how='all', inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Print the DataFrame after removing rows with NaN
df

Unnamed: 0,Drink Volume,Drink Name,Attributes,Energy,Protein,Carbohydrate,Fat,Cholesterol,Sugars,Dietary Fibre,...,Calories From Fat,Salt,Total Energy,Total Fats,Saturated fat,Amount per,Saturated fats,Total Carboyhdrate,Dietary fiber,Carbohyrate
0,100 per pack,Lipton Yellow Label Tea Bags - International B...,Per Serving (100ml),0g,0g,0g,0g,,,,...,,,,,,,,,,
1,12 x 300ml,Authentic Tea House Ayataka No Sugar Japanese ...,Per Serving (100ml),0kcal,0g,0g,,0g,0g,0g,...,,,,,,,,,,
2,6 X 600G,Killiney Premium Milk Tea Family Bundle,Per Serving (null),,0.6g,,,,19.4g,,...,,,,,,,,,,
3,1.5L,Pokka Bottle Drink - Ice Lemon Tea,Per Serving (250ml),100kcal,0.0g,25.0g,,,,,...,,,,,,,,,,
4,6 x 250ml,Pokka Packet Drink - Ice Lemon Tea,Per Serving (250ml),100kcal,0.0g,25.0g,,,25.0g,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,400g,Cha Tra Mue Extra Gold Premium Tea Powder - Va...,Per Serving (2),,,,,300mg,,,...,,,,,20g,,,,25g,
88,1 X 30G,Nature's Superfoods Ceremonial Matcha Powder G...,Per Serving (1g),0.01kcal,1g,,,0mg,0g,0g,...,,,,,,,,,,
89,1 L,Minor Figures Organic Barista Chai Concentrate,Per Serving (),105kj,0g,6g,0g,,6g,,...,,0g,,,,,,,,
90,216 G,Kim's Duet Instant Chrysanthemum,Per Serving (8.6g),35g,,,,,,,...,,,,,,,,,,8.7g


In [12]:
df.to_csv("../data/tea.csv", index=False)

In [5]:
# Initialize the Selenium WebDriver
driver = webdriver.Chrome()

# Navigate to the webpage
url = 'https://www.fairprice.com.sg/search?query=Juice'
driver.get(url)

# Wait for the page to load
time.sleep(10)

# Scroll to load more products
num_scrolls = 30
for _ in range(num_scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)

# Extract links to individual product pages
product_links = []
soup = BeautifulSoup(driver.page_source, 'html.parser')
product_cards = soup.find_all('div', class_='sc-2a85da88-0 fhLHEV product-container')

# Assuming product_cards contains the relevant div elements with class 'sc-2a85da88-0'
for card in product_cards:
    a_tag = card.find('a', class_='sc-2a85da88-3')
    if a_tag:
        product_link = a_tag['href']
        product_links.append(product_link)

# Extract nutritional data for each product
all_nutritional_data = []

for product_link in product_links:
    product_url = f'https://www.fairprice.com.sg{product_link}'
    driver.get(product_url)
    time.sleep(4)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

    nutritional_data = {}

    # Extracting the drink's volume
    volume = soup.find('span', class_='sc-aa673588-1 sc-d5ac8310-3 kZssPC jGBApJ')
    if volume:
        nutritional_data['Drink Volume'] = volume.text

    # Extracting the drink's name
    drink_name = soup.find('span', class_='sc-aa673588-1 drdope')
    if drink_name:
        nutritional_data['Drink Name'] = drink_name.text

    nutrient_list = soup.find('ul', class_='sc-ad6d339b-0 lhIfvG')
    if nutrient_list:
        nutrient_rows = nutrient_list.find_all('li', class_='sc-ad6d339b-1 iVtCiL')
        try:
            first_row = nutrient_rows[0]
            attributes_name = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU')
            attributes_value = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT')
            for attr_name, attr_value in zip(attributes_name, attributes_value):
                attribute_name = attr_name.text.strip()
                attribute_value = attr_value.text.strip()
                nutritional_data[attribute_name] = attribute_value
            for row in nutrient_rows[1:]:
                nutrient_name = row.find('span', class_='sc-ilhmMj').text.strip()
                nutrient_value = row.find('span', class_='sc-ilhmMj fQpIpK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT').text.strip()
                nutritional_data[nutrient_name] = nutrient_value
        except IndexError:
            continue

    all_nutritional_data.append(nutritional_data)

# Close the Selenium WebDriver
driver.quit()

# Create a DataFrame from the nutritional data
df = pd.DataFrame(all_nutritional_data)

# Remove the rows that have NaN in every column
# Drop rows where all columns except 'Drink Name' and 'Drink Volume' have NaN
df.dropna(subset=df.columns.difference(['Drink Name', 'Drink Volume']), how='all', inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Print the DataFrame after removing rows with NaN
df

Unnamed: 0,Drink Volume,Drink Name,Attributes,Energy,Protein,Total Fat,Saturated Fat,Trans Fat,Cholesterol,Carbohydrate,...,sugars,Fats,saturates,Saturated Fats,fats,Saturaed Fat,Per Serving,Total Calories,Dietary Fiber,Proteins
0,1.89L,Sunkist Fruit Bottle Juice - Apple,Per Serving (100ml),42kcal,0.2g,0.0g,0.0g,0.0g,0mg,10.4g,...,,,,,,,,,,
1,6 x 250ml,Pokka Packet Drink - Carrot Juice,Per Serving (250g),103kcal,0g,0g,,,,25.8g,...,,,,,,,,,,
2,1.89L,Sunkist Fruit Bottle Juice - Orange,Per Serving (250ml),108kcal,0.5g,0.0g,0.0g,0.0g,0mg,26.3g,...,,,,,,,,,,
3,946ml,F&N NutriWell Reduced Sugar Drink - Chrysanthe...,Per Serving (250ml),65kcal,0g,0g,0g,,0mg,15.5g,...,,,,,,,,,,
4,1L,Marigold 100% Packet Juice - Tropical Fruits,Per Serving (250ml),118kcal,1.3g,0g,0g,,0mg,28.3g,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,3 X 200ML,Marks & Spencer Percy Pig Fruit Juice Drink,Per Serving (),46kcal,0.4g,,0.1g,,,10.3g,...,,,,,,,,,,
60,330 ML,Wolf + Wald Organic Sparkling Apple Juice,Per Serving (330),,0.1g,0.1g,,,,,...,,,,,,,,24kcal,0g,
61,250 ML,Muno Organic 100% Sea Buckthorn Juice,Per Serving (100),236kj,,,,,,,...,,0.65g,,,,,,,,1.1g
62,6 X 580ML,BOMY BOMY Grape Juice,Per Serving (),,0g,,,,,,...,,,,,,,,,,


In [6]:
df.to_csv("../data/juice.csv", index=False)

In [9]:
# Initialize the Selenium WebDriver
driver = webdriver.Chrome()

# Navigate to the webpage
url = 'https://www.fairprice.com.sg/search?query=Drinks'
driver.get(url)

# Wait for the page to load
time.sleep(10)

# Scroll to load more products
num_scrolls = 80
for _ in range(num_scrolls):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)

# Extract links to individual product pages
product_links = []
soup = BeautifulSoup(driver.page_source, 'html.parser')
product_cards = soup.find_all('div', class_='sc-2a85da88-0 fhLHEV product-container')

# Assuming product_cards contains the relevant div elements with class 'sc-2a85da88-0'
for card in product_cards:
    a_tag = card.find('a', class_='sc-2a85da88-3')
    if a_tag:
        product_link = a_tag['href']
        product_links.append(product_link)

# Extract nutritional data for each product
all_nutritional_data = []

for product_link in product_links:
    product_url = f'https://www.fairprice.com.sg{product_link}'
    driver.get(product_url)
    time.sleep(4)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

    nutritional_data = {}

    # Extracting the drink's volume
    volume = soup.find('span', class_='sc-aa673588-1 sc-d5ac8310-3 kZssPC jGBApJ')
    if volume:
        nutritional_data['Drink Volume'] = volume.text

    # Extracting the drink's name
    drink_name = soup.find('span', class_='sc-aa673588-1 drdope')
    if drink_name:
        nutritional_data['Drink Name'] = drink_name.text

    nutrient_list = soup.find('ul', class_='sc-ad6d339b-0 lhIfvG')
    if nutrient_list:
        nutrient_rows = nutrient_list.find_all('li', class_='sc-ad6d339b-1 iVtCiL')
        try:
            first_row = nutrient_rows[0]
            attributes_name = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU')
            attributes_value = first_row.find_all('span', class_='sc-kMjNwy dSxhfK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT')
            for attr_name, attr_value in zip(attributes_name, attributes_value):
                attribute_name = attr_name.text.strip()
                attribute_value = attr_value.text.strip()
                nutritional_data[attribute_name] = attribute_value
            for row in nutrient_rows[1:]:
                nutrient_name = row.find('span', class_='sc-ilhmMj').text.strip()
                nutrient_value = row.find('span', class_='sc-ilhmMj fQpIpK sc-khsqcC cWAEyU sc-ad6d339b-2 hKcaKT').text.strip()
                nutritional_data[nutrient_name] = nutrient_value
        except IndexError:
            continue

    all_nutritional_data.append(nutritional_data)

# Close the Selenium WebDriver
driver.quit()

# Create a DataFrame from the nutritional data
df = pd.DataFrame(all_nutritional_data)

# Remove the rows that have NaN in every column
# Drop rows where all columns except 'Drink Name' and 'Drink Volume' have NaN
df.dropna(subset=df.columns.difference(['Drink Name', 'Drink Volume']), how='all', inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Print the DataFrame after removing rows with NaN
df

Unnamed: 0,Drink Volume,Drink Name,Attributes,Energy,Trans Fat,Protein,Total Fat,Saturated Fat,Cholesterol,Carbohydrate,...,Blueberry Extract,Pomegranate Extract,Sulfates,pH Value,Total Dissolved Solids,ats,Serving Size: 4.5 tbsp,Servings per container,Saturated Fats,Vit D
0,24 x 200ml (CTN),Milo Chocolate Malt Milk UHT Packet Drink,Per Serving,1.7g,1mg,1.7g,0.9g,0g,9.5g,8g,...,,,,,,,,,,
1,12 x 320ml (CTN),Coca-Cola Can Drink - Zero Sugar,Per Serving (100ml),0kcal,,0g,0g,,,0g,...,,,,,,,,,,
2,6 x 180ml,Coca-Cola Mini Can Drink - Zero Sugar,Per Serving (180ml),0kcal,,0g,0g,,,0g,...,,,,,,,,,,
3,24 x 325ml (CTN),100 Plus Isotonic Can Drink - Original,Per Serving (100ml),27kcal,,0g,0g,,,6.8g,...,,,,,,,,,,
4,24 x 200ml (CTN),Ribena Blackcurrant Fruit Packet Drink - Regular,Per Serving (100g),43kcal,0g,0g,0g,0g,,10.6g,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,500 G,Health Paradise Organic Black Soya Milk Powder,Per Serving (),,0g,8g,4.5g,0.66g,0mg,,...,,,,,,,30g,16pax,,
185,6 X 1L,Minor Figures Barista Oat Milk,Per Serving (),410kJ,,0.4g,,,,19g,...,,,,,,,,,0.4g,
186,330 ML,Shozu Yuzu & Ume Sparkling Prebiotics,Per Serving (0g),146kj,,0g,,,,7g,...,,,,,,,,,,
187,500 G,Health Paradise Instant Soya Milk Powder (NSA),Per Serving (),,0g,6g,4.5g,0.66g,0mg,,...,,,,,,,30g,16pax,,0mcg


In [10]:
df.to_csv("../data/drinks.csv", index=False)