<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4 - Unveiling Chronic Disease in Singaporean Lifestyle

> Authors: Chung Yau, Gilbert, Han Kiong, Zheng Gang
---

**Problem Statement:**  
In Singapore, the increasing prevalence of chronic diseases presents a pressing public health concern, underscoring the need for proactive intervention strategies. 

How can we identify individuals at high risk for chronic diseases based on their behavioral habits? By doing so, we can enable early detection and provide recommendations, fostering a proactive approach to preventing various chronic diseases.

  
**Target Audience:**  
Product team in Synapxe, in preparation for Healthier SG 2025 roadmap workshop. 

These are the notebooks for this project:  
 1. `01_Data_Collection_Food.ipynb`  
 2. `02_Data_Preprocessing.ipynb`   
 3. `03_FeatureEngineering_and_EDA.ipynb`
 4. `04_Data_Modelling.ipynb` 
 5. `05_Hyperparameter_Model Fitting_Evaluation.ipynb`
 6. `05a_Model_Pickling.ipynb`
 7. `06_Implementation_FoodRecommender.ipynb` 

 ---

# This Notebook: 01_Data_Collection
There are two sections to this project. We have built a classifier as well as a food recommender.   
  
**For the classifier:**

**Source:** Data sourced from the Behavioral Risk Factor Surveillance System (BRFSS), as detailed on the [CDC's BRFSS Questionnaires page](https://www.cdc.gov/brfss/questionnaires/index.htm).  
We chose this dataset as the inputs are comprehensive and of a substantial volume (Combing both 2015 and 2013, we have managed to get more than 10k datapoints for our model training). It is important to note that we have only included data of people with Asian race profile to be more relevant to Singapore. 

**For the recommender:** 

The categories and recommended nutrition food profiles are derived from the below webpages:
- [HealthHub Dietary Allowances](https://www.healthhub.sg/live-healthy/recommended_dietary_allowances)
- [HealthHub Calorie Calculator](https://www.healthhub.sg/programmes/nutrition-hub/tools-and-resources#calorie-calculator)
- [HealthHub Protein Importance](https://www.healthhub.sg/live-healthy/why_protein_is_important#:~:text=For%20average%20Singaporean%20adults%20aged,1.2g%2Fkg%20bodyweight%20instead.)
- [HealthHub Getting the Fats Right](https://www.healthhub.sg/live-healthy/getting%20the%20fats%20right#:~:text=Fat%20should%20make%20up%20about,if%20one%20is%20not%20mindful.)
- [USDA National Agricultural Library](https://www.nal.usda.gov/programs/fnic#:~:text=How%20many%20calories%20are%20in,Facts%20label%20on%20food%20packages.)
- [Centrum Singapore - Healthy Diet](https://www.centrum.sg/expert-corner/health-blog/healthy-diet-do-you-follow-dietary-guidelines/)
- [HPB National Nutrition Survey 2022 Report](https://www.hpb.gov.sg/docs/default-source/pdf/nns-2022-report.pdf)
- [Signos - Sugar Intake for Type 2 Diabetics](https://www.signos.com/blog/how-much-sugar-should-a-type-2-diabetic-have-a-day)
- [HealthXchange - Diabetes Glycaemic Index](https://www.healthxchange.sg/diabetes/essential-guide-diabetes/diabetes-glycaemic-index-know)
- [NDTV Food - Dividing Calories in Each Meal](https://food.ndtv.com/food-drinks/how-to-divide-calories-in-each-meal-we-help-deconstruct-it-for-you-1750305#:~:text=NIN%20recommends%20dividing%20equal%20portion,the%20total%20calories%20you%20consume.)
- [Statistics Canada - Sodium Intake](https://www150.statcan.gc.ca/n1/pub/82-003-x/2006004/article/sodium/4148995-eng.htm)

The nutritional profile of the dishes are labelled into their cuisine types manually, and the nutrition values can either be found in the below link in [ObservableHQ - SG Hawker Food Nutrition](https://observablehq.com/@yizhe-ang/sg-hawker-food-nutrition) or manually scrapped from [HPB website](https://focos.hpb.gov.sg/eservices/ENCF/). The rest of this notebook will focus on the code to scrap the needed information.


---
### **Step 1: Import Libraries**

In [2]:
from bs4 import BeautifulSoup
import pandas as pd 
import requests
import selenium
import time


from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


--- 
### **Step 2: Scraping**


Get information from HPB [Energy & Nutrient Composition of Food](https://focos.hpb.gov.sg/eservices/ENCF/)

In [3]:
url = 'https://focos.hpb.gov.sg/eservices/ENCF/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')

Create Datasets Table for Option Value and Food Group Name

In [7]:
#create list of food group:
food_group = []
for item in soup.find('select', {'name' : 'ddlFoodGroup'}).find_all('option'):
    # if item.attrs:
        # print(item.attrs)
    food_group_choice = {
        'food_group_choice' : item.text,
        'value' : item.attrs['value']
    }
    food_group.append(food_group_choice)

food_group_df = pd.DataFrame(food_group)
food_group_df.head()

Unnamed: 0,food_group_choice,value
0,-- Select Food Group --,
1,BEVERAGES,83.0
2,CEREAL AND CEREAL PRODUCTS,73.0
3,EGG AND EGG PRODUCTS,78.0
4,FAST FOODS,87.0


Create Datasets Table for Option Value and Nutrition Name

In [6]:
#create list of nutrient
nutrient_group = []
for item in soup.find('select', {'name' : 'ddlNutrient'}).find_all('option'):
    nutrient_group_choice = {
        'nutrient_choice' : item.text,
        'value' : item.attrs['value']
    }
    nutrient_group.append(nutrient_group_choice)

nutrient_group_df = pd.DataFrame(nutrient_group)
nutrient_group_df.head()

Unnamed: 0,nutrient_choice,value
0,-Select-,All
1,ALA,199
2,B-Carotene,146
3,Calcium,163
4,Carbohydrate,141


Browsing Through the Webpage with Selenium - We use selenium function to automate the browsing process to get the information from the website

In [8]:
#create function to extract information from website with Beautiful Soup
def scrape(webpage, food_group_name, nutrient_type_name):
    #instantiate Beautiful Soup
    soup = BeautifulSoup(webpage)

    table_data = []
    for row in soup.find('table', class_ = 'gridviewlist').find_all('tr'):
        row_data = []
        cells = row.find_all('td')
        for index in range(len(cells)):
            if len(cells) > 7: 
                row_data = {
                    'food_name' : cells[1].text,
                    'food_group' : cells[2].text,
                    'food_sub_group': cells[3].text,
                    'serving_measure': cells[4].text,
                    'nutrient' : cells[5].text,
                    f'{cells[5].text}_amount' : cells[6].text,
                    'nutrient_unit' : cells[7].text
                }
        if row_data != []:
            
            table_data.append(row_data)

    df = pd.DataFrame(table_data)

    #export file to csv
    df.to_csv(f'../data/{food_group_name}_{nutrient_type_name}.csv')

In [9]:
#create function to browse webpage
def page_driver(food_group_value, food_group_name,nutrient_value,nutrient_group_name):
    driver = webdriver.Chrome()
    driver.get('https://focos.hpb.gov.sg/eservices/ENCF/')
    #Food Group dropdown selection
    select_food_group_element = driver.find_element(By.NAME, 'ddlFoodGroup')
    select_food_group = Select(select_food_group_element)
    #select type of Food Group from dropdown
    select_food_group.select_by_value(food_group_value) 

    #Nutrient dropdown selection
    select_nutrient_element = driver.find_element(By.NAME, 'ddlNutrient')
    select_nutrient = Select(select_nutrient_element)
    
    #select type of nutrient from dropdown
    select_nutrient.select_by_value(nutrient_value) 

    #trigger search button to display the table
    find_search = driver.find_element(By.XPATH,'//*[@id="btnSearch"]')
    find_search.click()

    WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CLASS_NAME,'gridviewlist')))

    #save browsed page
    page_source = driver.page_source
    time.sleep(10)
    return scrape(page_source, food_group_name = food_group_name, nutrient_type_name = nutrient_group_name)

Create Datasets of Food with the Nutrients

In [10]:
#mixed ethnic dishes, and sugar
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[34]['value'],
            nutrient_group_df.iloc[34]['nutrient_choice'])

In [11]:
#mixed ethnic dishes, and carbohydrate
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[4]['value'],
            nutrient_group_df.iloc[4]['nutrient_choice'])

In [12]:
#mixed ethnic dishes, and total fat
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[38]['value'],
            nutrient_group_df.iloc[38]['nutrient_choice'])

In [13]:
#mixed ethnic dishes, and cholesterol
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[6]['value'],
            nutrient_group_df.iloc[6]['nutrient_choice'])

In [14]:
#mixed ethnic dishes, and sodium
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[32]['value'],
            nutrient_group_df.iloc[32]['nutrient_choice'])

In [15]:
#mixed ethnic dishes, and protein
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[27]['value'],
            nutrient_group_df.iloc[27]['nutrient_choice'])

In [16]:
#mixed ethnic dishes, and calories
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[9]['value'],
            nutrient_group_df.iloc[9]['nutrient_choice'])

In [17]:
#mixed ethnic dishes, and glycemic index
page_driver(food_group_df.iloc[11]['value'],
            food_group_df.iloc[11]['food_group_choice'],
            nutrient_group_df.iloc[12]['value'],
            nutrient_group_df.iloc[12]['nutrient_choice'])

At the end of this section, we would have created 8 csv with the below names that we will need to proceed with the cleaning in the next section.   
The csv file can be found in the data folder in the form of `{food_group_name}_{nutrient_type_name}.csv`

---
### **Step 3: Cleaning**

Read CSV file saved in previous section

In [18]:
df1 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Protein.csv')
df2 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Total fat.csv')
df3 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Carbohydrate.csv')
df4 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Cholesterol.csv')
df5 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Sodium.csv')
df6 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Sugar.csv')
df7 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Glycemic index.csv')
df8 = pd.read_csv('../data/MIXED ETHNIC DISHES, ANALYZED IN SINGAPORE_Energy.csv')

Merge all datasets into one dataset

In [19]:
#combine all datasets 
df = pd.concat([df1, df2],axis = 0)
df = pd.concat([df,df3], axis = 0)
df = pd.concat([df,df4], axis = 0)
df = pd.concat([df,df5], axis = 0)
df = pd.concat([df,df6], axis = 0)
df = pd.concat([df,df7], axis = 0)
df = pd.concat([df,df8], axis = 0)

Check for null data

In [20]:
df.isnull().sum()

Unnamed: 0                 0
food_name                  0
food_group                 0
food_sub_group             0
serving_measure            0
nutrient                   0
Protein_amount           689
nutrient_unit              0
Total fat_amount         689
Carbohydrate_amount      689
Cholesterol_amount       689
Sodium_amount            689
Sugar_amount             689
Glycemic index_amount    763
Energy_amount            689
dtype: int64

Drop duplicated food names

In [21]:
#drop duplicated food name
df.drop_duplicates(subset= ['food_name'], inplace = True)

In [23]:
#reset index
df = df.reset_index(drop = True)
df.shape

(575, 15)

We see that there are many null data from the original dataset. We will proceed to scrap again but this time using the code to seek for the relevant nutrional values in the dataset instead of obtaining them just from the table presented as we found that pagination does not work for the website we intend to scrap. 

In [24]:
df.drop(columns=['nutrient_unit']).columns[6:]

Index(['Protein_amount', 'Total fat_amount', 'Carbohydrate_amount',
       'Cholesterol_amount', 'Sodium_amount', 'Sugar_amount',
       'Glycemic index_amount', 'Energy_amount'],
      dtype='object')

In [25]:
#convert column nutrient type to value for scraping purpose
def option_value(column):
    if 'Protein' in column:
        value = '135'
    elif "Total fat" in column: 
        value = "136"
    elif 'Carbohydrate' in column:
        value = '141'
    elif 'Cholesterol' in column:
        value = '140'
    elif 'Energy' in column:
        value = '134'
    elif 'Sodium' in column:
        value = '160'
    elif 'Glycemic index' in column:
        value = '219'
    elif 'Sugar' in column:
        value = "143"

    return value    

In [26]:
#scraping function
def scrape(webpage):
    soup = BeautifulSoup(webpage)
    try:
        for row in soup.find('table', class_ = 'gridviewlist').find_all('tr'):
            row_data = []
            cells = row.find_all('td')
            for item in cells:
                row_data.append(item.text.strip())
        print("Success")
        return row_data[4]
    except:
        print("No Data")

In [27]:
#accessing web with Selenium
def page_driver(data, column_number, index): 
    #instantiate selenium
    driver = webdriver.Chrome()
    #fetch webpage from the given link
    driver.get('https://focos.hpb.gov.sg/eservices/ENCF/')

    #find Food Name textbox
    find_food_name = driver.find_element(By.NAME, 'txtFoodName')
    #input values into textbox
    enter_food_name = find_food_name.send_keys(data['food_name'][index])

    #Nutrient dropdown selection
    select_nutrient_element = driver.find_element(By.NAME, 'ddlNutrient')
    select_nutrient = Select(select_nutrient_element)
    #select type of nutrient from dropdown
    select_nutrient.select_by_value(option_value(data.columns[column_number])) 

    #trigger search button to display the table
    find_search = driver.find_element(By.XPATH,'//*[@id="btnSearch"]')
    find_search.click()

    # WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CLASS_NAME,'gridviewlist')))
    #save browsed page
    page_source = driver.page_source

    return scrape(page_source)

In [33]:
df.drop(columns=['nutrient'], inplace= True)

In [None]:
# create dictionary of column name and its index
column_list = {'Protein_amount' : 5,
               'Total fat_amount' : 7,
               'Carbohydrate_amount' : 8,
               'Cholesterol_amount' : 9,
               'Sodium_amount': 10,
               'Sugar_amount': 11,
               'Glycemic index_amount' : 12,
               'Energy_amount' : 13
}

#run loop to fill in nan values by scraping value from source
for column_name, column_index in column_list.items():
    for index, row in df.iterrows():
        if pd.isna(row[column_name]):
            replacement_value = page_driver(df,column_index,index)

            df.at[index, column_name] = replacement_value

In [35]:
df.isnull().sum()

Unnamed: 0                 0
food_name                  0
food_group                 0
food_sub_group             0
serving_measure            0
Protein_amount            22
nutrient_unit              0
Total fat_amount          72
Carbohydrate_amount      484
Cholesterol_amount       499
Sodium_amount            514
Sugar_amount             520
Glycemic index_amount    545
Energy_amount            519
dtype: int64

In [36]:
df.shape

(575, 14)

---
### **Step 4: Export**

We now export the scrapped data for manually labelling and cleaning the development of the recommender

In [37]:
df.to_csv('../data/food_data_sg.csv', index = False)