<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Snack-O-Meter: a tool to inform consumers on consumption of biscuits

# Content Page

1. [Webscraping for data](01_webscraping.ipynb)
2. [Data cleaning](02_cleaning.ipynb)
3. [EDA](03_eda.ipynb)
4. [Data Modelling](04_modelling.ipynb)

## Data Cleaning 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Set the display option to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
cookie = pd.read_csv('../data/cookie.csv')
cracker = pd.read_csv('../data/crackers.csv')
cream = pd.read_csv('../data/cream.csv')
wafer = pd.read_csv('../data/wafer.csv')

In [4]:
cookie.head()

Unnamed: 0,Names,Per Serving,Total Fat,Sodium,Sugars,Type,Cholesterol,Carbohydrate,Protein,Saturated Fat,Trans Fat,Energy,Dietary Fibre,Salt,Calories,Fibre
0,Beryl's Chocolate Orange Cashew Nuts Cookies,25g,4.6g,,7.6g,cookie,,15.9g,2.6g,,,494kj,,,,
1,Beryl's Coconut Sable with Macadamia Nuts,25g,9.9g,,5.1g,cookie,,12.4g,1.4g,,,636kj,,,,
2,Beryl's Cookies Chocolate Sable,25g,7.2g,,9.2g,cookie,,16.3g,1.6g,,,563kj,,,,
3,Beryl's Strawberry Sable,25g,9.4g,,4.8g,cookie,,13.7g,1.4g,,,607kj,,,,
4,Beryl's Cookies Exquisite Selection (Tin),25g,4.0g,,5.0g,cookie,,11.7g,1.7g,,,378kj,,,,


In [5]:
cookie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Names          22 non-null     object
 1   Per Serving    22 non-null     object
 2   Total Fat      22 non-null     object
 3   Sodium         10 non-null     object
 4   Sugars         22 non-null     object
 5   Type           22 non-null     object
 6   Cholesterol    3 non-null      object
 7   Carbohydrate   22 non-null     object
 8   Protein        22 non-null     object
 9   Saturated Fat  11 non-null     object
 10  Trans Fat      3 non-null      object
 11  Energy         16 non-null     object
 12  Dietary Fibre  8 non-null      object
 13  Salt           7 non-null      object
 14  Calories       2 non-null      object
 15  Fibre          6 non-null      object
dtypes: object(16)
memory usage: 2.9+ KB


As this project focuses on 3 key nutrients - Total Fat, Sodium and Sugars, the rest of the columns are not included in subsequent dataframe. 'Salt' column is included as it can be convered to sodium.

In [5]:
cookie.rename(columns={
    'Names':'Product'},inplace=True)
cookie = cookie[['Type', 'Product', 'Per Serving', 'Total Fat', 'Sodium', 'Sugars', 'Salt']]
cookie.columns = cookie.columns.str.lower()
cookie.head(10)

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25g,4.6g,,7.6g,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25g,9.9g,,5.1g,
2,cookie,Beryl's Cookies Chocolate Sable,25g,7.2g,,9.2g,
3,cookie,Beryl's Strawberry Sable,25g,9.4g,,4.8g,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25g,4.0g,,5.0g,
5,cookie,Chipsmore Cookies Multipack - Original,28g,6g,98mg,8.5g,
6,cookie,Chipsmore Cookies Multipack - Double Chocolate,28g,5.9g,94mg,8.5g,
7,cookie,FUEL10K Double Chocolate High Protein Oat Cookie,100g,14.9g,,3.8g,0.70g
8,cookie,Jules Destrooper Biscuits - Belgian Chocolate ...,100g,25g,,51g,0.6g
9,cookie,Julie's Oat 25 Cookies - Hazelnuts &amp; Choco...,25g,6g,80mg,6.5g,


In [6]:
cracker.head()

Unnamed: 0,product,Attributes,Calories,Calories from Fat,Total Fat,Saturated Fat,Trans Fat,Cholesterol,Sodium,Carbohydrate,Dietary Fibre,Sugars,Protein,Energy,Iron,Calcium,Energy from Fat
0,Julie's Crackers - Butter,Per Serving (25g),140kcal,70g,8g,3.5g,0g,0mg,140mg,15g,0g,2g,2g,,,,
1,Julie's Veggie Crackers,Per Serving (23g),,,6g,2.5g,0g,0mg,140mg,14g,0g,1g,2g,120kcal,,,
2,Ritz Crackers Box - Original,Per Serving (19g),,,5g,,,,130mg,12g,,1g,1g,100kcal,,,
3,Hup Seng Crackers - Cream,Per Serving (31g),150,60,7g,4g,0g,0mg,180mg,,1g,3g,,,,,
4,Hup Seng Crackers - Sugar,Per Serving (22.5g),110kcal,25kcal,3g,1.5g,0g,0mg,140mg,19g,4g,4g,2g,,,,


In [7]:
cracker['Per Serving'] = cracker['Attributes'].str.extract(r'\(([^)]+)\)')
cracker['Type'] = 'cracker'
cracker = cracker[['Type', 'product', 'Per Serving', 'Total Fat', 'Sodium', 'Sugars']]
cracker.columns = cracker.columns.str.lower()
cracker.head()

Unnamed: 0,type,product,per serving,total fat,sodium,sugars
0,cracker,Julie's Crackers - Butter,25g,8g,140mg,2g
1,cracker,Julie's Veggie Crackers,23g,6g,140mg,1g
2,cracker,Ritz Crackers Box - Original,19g,5g,130mg,1g
3,cracker,Hup Seng Crackers - Cream,31g,7g,180mg,3g
4,cracker,Hup Seng Crackers - Sugar,22.5g,3g,140mg,4g


In [8]:
cream.head()

Unnamed: 0,product,Per Serving,Total Fat,Sodium,Sugars,type,Attributes,Calories,Calories from Fat,Protein,Saturated Fat,Trans Fat,Cholesterol,Carbohydrate,Dietary Fibre,Energy,Salt,Monounsaturated Fat,Polyunsaturated Fat,Energy from Fat
0,Arnott's Tim Tam Biscuits - Chewy Caramel,100g,23.9g,193mg,46.8g,cream,Per Serving (100g),,,4.7g,13.7g,,,65.1g,,2080kJ,,,,
1,Cowhead Sandwich Crackers with Calcium - Cheese,24g,6g,140mg,4g,cream,Per Serving (24g),,,2g,3.5g,,,15g,140mg,0kJ,,,,
2,Glico Collon Biscuit Roll - Chocolate,46g,11g,12mg,17.9g,cream,Per Serving (46g),,,2.9g,5.6g,,10mg,31.1g,1.5g,235kcal,,,,99kcal
3,Jack 'n Jill Dewberry Sandwich Biscuits - Blue...,36g,9.9g,183mg,12.6g,cream,Per Serving (36g),,,2.1g,3.8g,0.0g,5mg,21.8g,0.7g,185kcal,,,,
4,Jack 'n Jill Dewberry Sandwich Biscuits - Stra...,36g,7.9g,100mg,12g,cream,Per Serving (36g),,,2.2g,4.4g,0g,6mg,23.8g,0.7g,175kcal,,,,


In [9]:
cream = cream[['type', 'product', 'Per Serving', 'Total Fat', 'Sodium', 'Sugars','Salt']]
cream.columns = cream.columns.str.lower()
cream.head()

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt
0,cream,Arnott's Tim Tam Biscuits - Chewy Caramel,100g,23.9g,193mg,46.8g,
1,cream,Cowhead Sandwich Crackers with Calcium - Cheese,24g,6g,140mg,4g,
2,cream,Glico Collon Biscuit Roll - Chocolate,46g,11g,12mg,17.9g,
3,cream,Jack 'n Jill Dewberry Sandwich Biscuits - Blue...,36g,9.9g,183mg,12.6g,
4,cream,Jack 'n Jill Dewberry Sandwich Biscuits - Stra...,36g,7.9g,100mg,12g,


In [10]:
wafer.head()

Unnamed: 0,product,Attributes,Energy,Total Fat,Saturated Fat,Carbohydrate,of which Sugars,Protein,Salt,Per Serving,Trans Fat,Sodium,Sugars,Cholesterol,Dietary Fibre,Dietary Fiber,Iron,Calcium,Energy from Fat,Total Carbohydrates,Polyunsaturated Fat,Calories,Saturated fat,Trans fat,Monounsaturated Fat,OfWhichSaturates,ofwhichSugars,Fibre,Moisture,Ash,Potassium,type
0,Loacker Quadratini Crispy Wafers - Matcha-Gree...,Per Serving (100g),503kcal,24g,21g,62g,27g,8g,0.42g,100g,,,27g,,,,,,,,,,,,,,,,,,,wafer
1,Alor Durian Wafer 21g x 10,Per Serving (100g),541kcal,31.1g,,61.1g,,4.6g,,100g,0g,61mg,,,,,,,,,,,,,,,,,,,,wafer
2,Beryl's Coconut Rolls Original,Per Serving (75g),941kj,7.1g,,38.3g,,1.8g,,75g,,30mg,23.0g,,,,,,,,,,,,,,,,,,,wafer
3,Julie's Love Letter Wafer Biscuit Roll - Choco...,Per Serving (30g),,4g,2g,23g,,,,30g,0g,65mg,13g,0g,13g,,,,,,,,,,,,,,,,,wafer
4,Julie's Love Letter Wafer Biscuit Roll - Straw...,Per Serving (30g),,5g,2.5g,22g,,,,30g,0g,65mg,12g,0g,0g,,,,,,,,,,,,,,,,,wafer


In [11]:
wafer = wafer[['type', 'product', 'Per Serving', 'Total Fat', 'Sodium', 'Sugars', 'Salt']]
wafer.columns = wafer.columns.str.lower()
wafer.head()

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt
0,wafer,Loacker Quadratini Crispy Wafers - Matcha-Gree...,100g,24g,,27g,0.42g
1,wafer,Alor Durian Wafer 21g x 10,100g,31.1g,61mg,,
2,wafer,Beryl's Coconut Rolls Original,75g,7.1g,30mg,23.0g,
3,wafer,Julie's Love Letter Wafer Biscuit Roll - Choco...,30g,4g,65mg,13g,
4,wafer,Julie's Love Letter Wafer Biscuit Roll - Straw...,30g,5g,65mg,12g,


In [12]:
df = pd.concat([cookie,cracker,cream,wafer])
df.reset_index(inplace=True)
df.drop('index',axis =1, inplace=True)
df.head(100)

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25g,4.6g,,7.6g,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25g,9.9g,,5.1g,
2,cookie,Beryl's Cookies Chocolate Sable,25g,7.2g,,9.2g,
3,cookie,Beryl's Strawberry Sable,25g,9.4g,,4.8g,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25g,4.0g,,5.0g,
5,cookie,Chipsmore Cookies Multipack - Original,28g,6g,98mg,8.5g,
6,cookie,Chipsmore Cookies Multipack - Double Chocolate,28g,5.9g,94mg,8.5g,
7,cookie,FUEL10K Double Chocolate High Protein Oat Cookie,100g,14.9g,,3.8g,0.70g
8,cookie,Jules Destrooper Biscuits - Belgian Chocolate ...,100g,25g,,51g,0.6g
9,cookie,Julie's Oat 25 Cookies - Hazelnuts &amp; Choco...,25g,6g,80mg,6.5g,


In [13]:
# Remove the units
df['per_serving_g'] = df['per serving'].str.replace('g', '').astype(float)
df['total_fat_g'] = df['total fat'].str.replace('g', '').astype(float)
df['sugars_g'] = df['sugars'].str.replace('g', '').astype(float)
df['sodium_g'] = df['sodium'].str.replace('mg', '').str.replace('g','').astype(float)/1000 #convert to mg to g
df['salt_g'] = df['salt'].str.replace('g', '').astype(float)

df.head()

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt,per_serving_g,total_fat_g,sugars_g,sodium_g,salt_g
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25g,4.6g,,7.6g,,25.0,4.6,7.6,,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25g,9.9g,,5.1g,,25.0,9.9,5.1,,
2,cookie,Beryl's Cookies Chocolate Sable,25g,7.2g,,9.2g,,25.0,7.2,9.2,,
3,cookie,Beryl's Strawberry Sable,25g,9.4g,,4.8g,,25.0,9.4,4.8,,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25g,4.0g,,5.0g,,25.0,4.0,5.0,,


In [14]:
# If sodium_g is null, use salt_g, where sodium_g = salt_g / 2.5 (https://www.healthdirect.gov.au/salt)

df['sodium_g'].fillna(df['salt_g'] / 2.5, inplace=True)
df.head(10)

Unnamed: 0,type,product,per serving,total fat,sodium,sugars,salt,per_serving_g,total_fat_g,sugars_g,sodium_g,salt_g
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25g,4.6g,,7.6g,,25.0,4.6,7.6,,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25g,9.9g,,5.1g,,25.0,9.9,5.1,,
2,cookie,Beryl's Cookies Chocolate Sable,25g,7.2g,,9.2g,,25.0,7.2,9.2,,
3,cookie,Beryl's Strawberry Sable,25g,9.4g,,4.8g,,25.0,9.4,4.8,,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25g,4.0g,,5.0g,,25.0,4.0,5.0,,
5,cookie,Chipsmore Cookies Multipack - Original,28g,6g,98mg,8.5g,,28.0,6.0,8.5,0.098,
6,cookie,Chipsmore Cookies Multipack - Double Chocolate,28g,5.9g,94mg,8.5g,,28.0,5.9,8.5,0.094,
7,cookie,FUEL10K Double Chocolate High Protein Oat Cookie,100g,14.9g,,3.8g,0.70g,100.0,14.9,3.8,0.28,0.7
8,cookie,Jules Destrooper Biscuits - Belgian Chocolate ...,100g,25g,,51g,0.6g,100.0,25.0,51.0,0.24,0.6
9,cookie,Julie's Oat 25 Cookies - Hazelnuts &amp; Choco...,25g,6g,80mg,6.5g,,25.0,6.0,6.5,0.08,


In [15]:
# Remove the columns which will not be used in subsequent EDA and modeling 
df_clean = df.drop(['per serving','total fat','sodium','sugars','salt', 'salt_g'], axis=1)
df_clean.head()

Unnamed: 0,type,product,per_serving_g,total_fat_g,sugars_g,sodium_g
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25.0,4.6,7.6,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25.0,9.9,5.1,
2,cookie,Beryl's Cookies Chocolate Sable,25.0,7.2,9.2,
3,cookie,Beryl's Strawberry Sable,25.0,9.4,4.8,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25.0,4.0,5.0,


In [16]:
df_clean.shape

(88, 6)

In [17]:
df_clean.dtypes

type              object
product           object
per_serving_g    float64
total_fat_g      float64
sugars_g         float64
sodium_g         float64
dtype: object

In [18]:
# Divide nutrients by per serving size
df_clean['total_fat_g_per_gram_of_serving'] = df_clean['total_fat_g'] / df_clean['per_serving_g']
df_clean['sugars_g_per_gram_of_serving'] = df_clean['sugars_g'] / df_clean['per_serving_g']
df_clean['sodium_g_per_gram_of_serving'] = df_clean['sodium_g'] / df_clean['per_serving_g']

In [19]:
df_clean.head(100)

Unnamed: 0,type,product,per_serving_g,total_fat_g,sugars_g,sodium_g,total_fat_g_per_gram_of_serving,sugars_g_per_gram_of_serving,sodium_g_per_gram_of_serving
0,cookie,Beryl's Chocolate Orange Cashew Nuts Cookies,25.0,4.6,7.6,,0.184,0.304,
1,cookie,Beryl's Coconut Sable with Macadamia Nuts,25.0,9.9,5.1,,0.396,0.204,
2,cookie,Beryl's Cookies Chocolate Sable,25.0,7.2,9.2,,0.288,0.368,
3,cookie,Beryl's Strawberry Sable,25.0,9.4,4.8,,0.376,0.192,
4,cookie,Beryl's Cookies Exquisite Selection (Tin),25.0,4.0,5.0,,0.16,0.2,
5,cookie,Chipsmore Cookies Multipack - Original,28.0,6.0,8.5,0.098,0.214286,0.303571,0.0035
6,cookie,Chipsmore Cookies Multipack - Double Chocolate,28.0,5.9,8.5,0.094,0.210714,0.303571,0.003357
7,cookie,FUEL10K Double Chocolate High Protein Oat Cookie,100.0,14.9,3.8,0.28,0.149,0.038,0.0028
8,cookie,Jules Destrooper Biscuits - Belgian Chocolate ...,100.0,25.0,51.0,0.24,0.25,0.51,0.0024
9,cookie,Julie's Oat 25 Cookies - Hazelnuts &amp; Choco...,25.0,6.0,6.5,0.08,0.24,0.26,0.0032


In [20]:
# Check for null values
df_clean.isnull().sum()

type                               0
product                            0
per_serving_g                      0
total_fat_g                        0
sugars_g                           1
sodium_g                           6
total_fat_g_per_gram_of_serving    0
sugars_g_per_gram_of_serving       1
sodium_g_per_gram_of_serving       6
dtype: int64

In [21]:
# As null value indicates absence of particular nutrient, impute null values with 0
df_final = df_clean.fillna(0)

In [22]:
df_final.isnull().sum()

type                               0
product                            0
per_serving_g                      0
total_fat_g                        0
sugars_g                           0
sodium_g                           0
total_fat_g_per_gram_of_serving    0
sugars_g_per_gram_of_serving       0
sodium_g_per_gram_of_serving       0
dtype: int64

In [36]:
df_final.to_csv("../data/final_df.csv",index=False)

[Click for eda](03_eda.ipynb)