<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Snack-O-Meter: a tool to inform consumers on consumption of biscuits

# Content Page

1. [Webscraping for data](01_webscraping.ipynb)
2. [Data cleaning](02_cleaning.ipynb)
3. [EDA](03_eda.ipynb)
4. [Data Modelling](04_modelling.ipynb)

## Problem Statement

The National Health Population Survey highlighted various actions to improve Singaporean health. The focus of our project will be on eating healthier. 

Several measures has been implemented to encourage healthy eating. This includes the Nutri-Grade labeling for beverages, which focuses on sugar and saturated fat, which has shifted consumption and led to reduction in sugar intake. More recently, Singapore has also shared that it is studying possible regulatory measures to reduce sodium content in food dishes. However, the nutritional values in snacks has not been widely discussed and this is the gap we are targeting to cover. 

While all snacks should be considered, biscuit is used as the initial proof-of-concept due to its popularity, where 57% of respondents of a survey has purchased biscuits over other snacks (https://www.statista.com/statistics/1341575/singapore-most-bought-sweet-snacks-in-the-past-week/). The objective of the project is to build a user-friendly tool that can inform if a biscuit is healthy or not, helping consumers make healthier choices. 

## Webscraping

In [1]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

A list of URLs of individual products from Fairprice website is manually collated in Excel for subsequent webscraping.

In [2]:
url_df = pd.read_csv("../url/url_cookie.csv")

In [3]:
url_df.head()

Unnamed: 0,url
0,https://www.fairprice.com.sg/product/beryl-s-c...
1,https://www.fairprice.com.sg/product/beryl-s-c...
2,https://www.fairprice.com.sg/product/beryl-s-c...
3,https://www.fairprice.com.sg/product/beryl-s-s...
4,https://www.fairprice.com.sg/product/beryl-s-c...


In [3]:
url_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     22 non-null     object
dtypes: object(1)
memory usage: 304.0+ bytes


In [4]:
url_list = []

for i in range(len(url_df)):
    url_list.append(url_df['url'][i])
    

In [5]:
len(url_list)

22

In [6]:
url_list

['https://www.fairprice.com.sg/product/beryl-s-chocolate-orange-cashew-nuts-cookies-95-g-90054476',
 'https://www.fairprice.com.sg/product/beryl-s-coconut-sable-with-macadamia-nuts-95-g-90056744',
 'https://www.fairprice.com.sg/product/beryl-s-cookies-chocolate-sable-180-g-90040649',
 'https://www.fairprice.com.sg/product/beryl-s-strawberry-sable-95-g-90053941',
 'https://www.fairprice.com.sg/product/beryl-s-cookies-exquisite-selection-tin-216-g-90040652',
 'https://www.fairprice.com.sg/product/chipsmore-cookies-multipack-original-224g-13162989',
 'https://www.fairprice.com.sg/product/chipsmore-cookies-multipack-double-chocolate-224g-13162990',
 'https://www.fairprice.com.sg/product/fuel10k-double-chocolate-high-protein-oat-cookie-50-g-90158283',
 'https://www.fairprice.com.sg/product/jules-destrooper-biscuits-belgian-chocolate-thins-100g-11093580',
 'https://www.fairprice.com.sg/product/julie-s-oat-25-cookies-hazelnuts-chocolate-chips-200g-13054342',
 'https://www.fairprice.com.sg/pro

In [7]:
df = pd.DataFrame()

final_df = pd.DataFrame()

while len(final_df)<31:
    
    final_df = pd.DataFrame()

    for link in url_list:
        #print(link)
        URL = link
        page = requests.get(URL)
        soup = BeautifulSoup(page.content, "html.parser")
        nutri_info = soup.find_all('span', mode="light")
        nutri_list = list(nutri_info)
    
        nutri_names = []
        nutri_values = []

        for i in range(len(nutri_list)):
            if i%2 == 0:
                nutri_names.append(nutri_list[i])
            else:
                nutri_values.append(nutri_list[i])
    
        title_full = soup.title
        title = re.findall(r'\>([A-z0-9 -?%?]+)\<',str(title_full))
        
        weight_info = soup.find_all('span', attrs={'class':'sc-aa673588-1 sc-d5ac8310-3 kZssPC jGBApJ'})
        weight = re.findall(r'\>([A-z0-9 x?-?]+)\<',str(weight_info))
        
        avg_rating = soup.find_all('span', attrs={'class':'sc-6fe931dc-4 gnxVUm pdp'})
        avg_rating_value = re.findall(r'\>([0-9.?]+)\<', str(avg_rating))


        nutri_names_only = re.findall(r'\>([A-z0-9. ()]+)\s?\<', str(nutri_names))
        #print(nutri_names_only)
        nutri_values_only = re.findall(r'\>([A-z0-9. ()]+)\s?\<', str(nutri_values))
        #print(nutri_values_only)
    
        data = zip(nutri_names_only, nutri_values_only)
        data_list = list(data)
    
        try: 
            df_interim = pd.DataFrame(data_list)
            df_interim.columns = [['names','values']]
            df_interim = df_interim.set_index(['names'])
            df_transpose = df_interim.T
            df_transpose['product'] = title
            df_transpose['weight'] = weight
            df_transpose['rating'] = avg_rating_value
            df_transpose.reset_index(drop=True, inplace = True)
            df_transpose = df_transpose.set_index(['product'])
    
            #display(df_transpose.head())
    
            df = pd.concat([df,df_transpose])
        except:
            continue
        
    final_df = df.reset_index()
    
    

In [8]:
final_df['type'] = "cookie"
final_df = final_df.drop_duplicates()

In [10]:
final_df.head()

names,product,"(Attributes,)","(Protein,)","(Carbohydrate,)","(Sugars,)","(Fats,)","(Energy,)",weight,rating,"(Total Fat,)","(Sodium,)","(Cholesterol,)","(Saturated Fat,)","(Trans Fat,)","(Monounsaturated Fat,)","(Polyunsaturated Fat,)","(Dietary Fibre,)","(Soluble Fibre,)","(Insoluble Fibre,)",type
0,Beryl's Cookies Exquisite Selection (Tin),Per Serving (1.7g),1.7g,11.7g,5.0g,4.0g,378kj,216 G,4.8,,,,,,,,,,,cookie
1,Chipsmore Cookies Multipack - Original,Per Serving (28g),1.5g,19.4g,8.5g,,137kcal,224g,4.3,6g,98mg,,,,,,,,,cookie
2,Munchy's Oat Krunch - Nutty Chocolate,Per Serving (120kcal),2g,17g,,,,15 per pack,4.6,5g,50mg,0g,2.5g,0g,,,,,,cookie
4,Chipsmore Cookies Multipack - Double Chocolate,Per Serving (28g),1.6g,19.4g,8.5g,,135kcal,224g,4.3,5.9g,94mg,,,,,,,,,cookie
5,Quaker Oats Cookies - Chocolate Chip,Per Serving (27g),2g,19g,9g,,126.4kcal,6 x 27g,4.3,4.7g,122.2mg,1.5mg,2.3g,0g,1.5g,0.7g,1.1g,0.2g,0.8g,cookie


In [11]:
final_df.to_csv("../data/cookie.csv")

The above steps is repeated for:

- crackers
<br>url_df = pd.read_csv("../url/url_crackers.csv")
<br>final_df['type'] = "crackers"
<br>final_df.to_csv("../data1/crackers.csv")
<br>

- cream
<br>url_df = pd.read_csv("../url/url_cream.csv")
<br>final_df['type'] = "cream"
<br>final_df.to_csv("../data1/cream.csv")
<br>

- wafer
<br>url_df = pd.read_csv("../url/url_wafer.csv")
<br>final_df['type'] = "wafer"
<br>final_df.to_csv("../data1/wafer.csv")

[Click for data cleaning](02_cleaning.ipynb)