# Part 1 (Obtain and Scrub the data)

## Created by Konstantin Georgiev

### Email: dragonflareful@gmail.com

## Introduction

This research aims to provide more insight on the healthiness of french food products based on their nutrition value. 

The reason why I've chosen to single out french products is because France is one of the first countries to create the nutrition labelling system, which was introduced in 2015 and aims to decrease diet-related diseases and obesity across Europe. They have labeled the products with grades ranging from __'A'__ to __'E'__, __'A'__ meaning that the product has excellent nutritional quality and __'E'__ meaning that it has very poor nutritional quality. I will be using this grading system to determine which nutrients make the most impact on the grade level of the product. In order to compare my findings, I will use two additional fast food related datasets to see how the nutrition values relate to one another.

I have chosen to divide the process in three separate parts:<br><br>
 __1. Loading and cleaning the three datasets.__<br><br>
 __2. Performing Exploratory Data Analysis using the cleaned data from part 1.__<br><br>
 __3. Performing simple modelling also using the cleaned data from part 1.__<br><br>

### Libraries

 - pandas - main library used for loading, cleaning and filtering the datasets
 - numpy - for math calculations, working with NaN's, filtering with conditions 
 - matplotlib - for visualizations during exploration
 - nose.tools - for unit testing
 - scipy - for hypothesis testing
 - basemap - for plotting a map of the locations where the products were packaged
 - scikit-learn - preprocessing and logistic regression

### Datasets 

 - __Open Food Facts__ - https://www.kaggle.com/openfoodfacts/world-food-facts - provides information on food products like ingredients, alergens, and most importantly various nutrition facts, which will be very useful in my case<br><br>
 
 - __Nutrition Facts for McDonalds Menu__ - https://www.kaggle.com/mcdonalds/nutrition-facts - provides detailed information on the amount of nutrients contained in each McDonalds product<br><br>
 
 - __Nutrition Facts for Starbucks Menu__ - https://www.kaggle.com/starbucks/starbucks-menu - provides detailed information on the amount of nutrients contained in each Starbucks product

### Problem statements

In an attempt to research the nutrition values of the french products and what affects them, I have chosen to compare them to fast-food products such as McDonalds - in terms of meat quality, and Starbucks - in terms of beverage quality.
I will also be looking into other factors such as packaging, food additive count and whether the products contain ingredients with palm oil or not.
So the main questions I'll be looking to answer are:
 - __Which nutrient has the biggest impact on the nutrition grade?__<br><br>
 - __How do french meat and beverages compare to McDonalds meat and Starbucks beverages in terms of nutrients?__<br><br>
 - __Do other factors like food packaging, additive count and palm oil in the ingredients have an impact on the nutrition value?__<br><br>

To answer these questions I have chosen to single out the three most popular and most essential nutrients contained in food: __carbohydrates__, __fat__ and __protein__.

### Project structure

I have chosen to divide my project in three separate notebooks, in order to improve readability. The first part involves obtaining and cleaning the three datasets, as well as filtering them for EDA and finally exporting them. The second part is the core of the project and includes exploratory data analysis on the three cleaned datasets, visualizations and hypothesis testing. The final part is done just for fun - a simple logistic regression model, which will attempt to predict whether a french product contains additives or not.

I will start this research by loading and cleaning each dataset separately. The first thing we need to do is load the required packages.

In [None]:
import pandas as pd
import numpy as np

from nose.tools import *

## Step 1 - The Open Food Facts dataset

### Obtaining the dataset

Let's start by obtaining the dataset and checking whether it was correctly loaded into a `pandas` dataframe. It seems to be tab-separated, so I'm keeping that in mind.

In [None]:
world_food_data=pd.read_csv("../input/world-food-facts/en.openfoodfacts.org.products.tsv", sep="\t", low_memory=False)

In [None]:
assert_is_not_none(world_food_data)

Next I'm going get a view of what the dataframe looks like by printing the first few rows and its shape.

In [None]:
world_food_data.head()

In [None]:
print("Total {} observations on {} features".format(world_food_data.shape[0],world_food_data.shape[1]))

### Cleaning the dataset

We can see that the dataframe is pretty large and that's not ideal for exploration. So I'm going to pick the features I'll be using later on in part 2:
`["product_name","packaging","main_category","nutrition_grade_fr",`<br>`"nutrition_score_fr_100g","fat_100g","carbohydrates_100g","proteins_100g","additives_n",`<br>`"ingredients_from_palm_oil_n","first_packaging_code_geo"]`<br>
These columns are the main factors for exploration, which I set in my questions. Also, the last column represents the packaging coordinates of the products. I will keep that as well for confirmation of the product locations. After that, I'm going to rename some of the columns so that their names are more pythonic and accessible.<br>
Now there should be only 11 columns left in the dataframe.

In [None]:
cols_to_keep=["product_name","packaging","main_category","nutrition_grade_fr",
              "nutrition-score-fr_100g","fat_100g","carbohydrates_100g","proteins_100g",
               "additives_n","ingredients_from_palm_oil_n","first_packaging_code_geo"]
world_food_data=world_food_data[cols_to_keep]
world_food_data=world_food_data.rename(columns={"nutrition-score-fr_100g":"nutrition_score",
                                                "fat_100g":"fat_g",
                                               "carbohydrates_100g":"carbohydrates_g",
                                               "proteins_100g":"proteins_g"})

In [None]:
assert_equal(world_food_data.shape[1],11)

In [None]:
world_food_data.head()

Next I'm going to check out the values in some columns to get familiar with the data.

In [None]:
len(world_food_data[world_food_data.packaging.isnull()])

In [None]:
len(world_food_data[world_food_data.first_packaging_code_geo.isnull()])

In [None]:
world_food_data.additives_n.unique()

In [None]:
world_food_data.ingredients_from_palm_oil_n.unique()

In [None]:
len(world_food_data[world_food_data.ingredients_from_palm_oil_n==2])

Now the dataframe seems more readabale but there are a lot of null values in each column. Just dropping each row would result in a great loss of data, so before I do that I decided to apply some filtering to the columns.<br><br> First of all, it seems that a lot of the products have an unknown type of packaging and packaging coordinates, so I'm just going to fill those values with the most common ones in the column.<br><br>
After that, I decided to fill the additive counts with the column mean because there seems to be a lot of products with different counts. However, that is not the case with the palm oil column, so I decided to fill these null values with zeroes, since there is a very small amount of products with 2 such ingredients and they are likely to disappear when I filter the data.<br><br>
After applying these changes, I will drop the remaining rows with NaN's to keep the data in the rest of the features more accurate.

In [None]:
most_common_coords=world_food_data.first_packaging_code_geo.value_counts().index[0]
most_common_packaging=world_food_data.packaging.value_counts().index[0]
mean_additives=world_food_data.additives_n.mean()

world_food_data.additives_n.loc[world_food_data.additives_n.isnull()]=mean_additives
world_food_data.ingredients_from_palm_oil_n.loc[world_food_data.ingredients_from_palm_oil_n.isnull()]=0
world_food_data.first_packaging_code_geo.loc[world_food_data.first_packaging_code_geo.isnull()]=most_common_coords
world_food_data.packaging.loc[world_food_data.packaging.isnull()]=most_common_packaging

world_food_data=world_food_data.dropna()

In [None]:
print("Total {} observations on {} features".format(world_food_data.shape[0],world_food_data.shape[1]))

We can see that we still have 71091 observations to work with, which should be enough for this research.


In [None]:
assert_is_not_none(most_common_coords)
assert_is_not_none(most_common_packaging)
assert_is_not_none(mean_additives)
assert_false(world_food_data.any().isnull().any())
assert_equal(world_food_data.shape,(71091,11))

Let's take a look at the feature data types and some of the unique column values.

In [None]:
world_food_data.dtypes

In [None]:
world_food_data.nutrition_score.unique()

In [None]:
world_food_data.additives_n.unique()

In [None]:
world_food_data.ingredients_from_palm_oil_n.unique()

In [None]:
world_food_data.head()

There are a few changes I would like to make here.<br><br>
First of all, since I'll be extracting the french products from this dataframe, I'm going to remove the abbreviations from the `main_category` column.<br><br> Secondly, there's no need for the columns `additives_n`, `ingredients_from_palm_oil` and `nutrition_score` to be floating point, so I'm going to convert them into integers.<br><br>Then, I'd like to split the first packaging coordinates column `first_packaging_code_geo` into two separate columns - one for the latitude, and one for the longitude for easy plotting later on. I will also round these coordinates to two decimal places and drop the old column.<br><br>Finally, I'm going to reset the index column, since I dropped a lot of rows in the previous steps.<br><br> The dataframe should now have 12 features.

In [None]:
world_food_data["main_category"]=world_food_data["main_category"].map(lambda x: str(x)[3:])
world_food_data[["additives_n","ingredients_from_palm_oil_n"]]=world_food_data[["additives_n","ingredients_from_palm_oil_n"]].astype(int)
world_food_data[["fp_lat","fp_lon"]]=world_food_data["first_packaging_code_geo"].str.split(",", 1, expand=True)
world_food_data.fp_lat=round(world_food_data.fp_lat.astype(float),2)
world_food_data.fp_lon=round(world_food_data.fp_lon.astype(float),2)
world_food_data=world_food_data.drop(columns="first_packaging_code_geo")

world_food_data.nutrition_score=world_food_data.nutrition_score.astype(int)

world_food_data=world_food_data.reset_index(drop=True)

In [None]:
assert_equal(world_food_data.fp_lat.dtype,float)
assert_equal(world_food_data.fp_lon.dtype,float)
assert_equal(world_food_data.nutrition_score.dtype,int)
assert_equal(world_food_data.shape[1],12)

In [None]:
world_food_data.dtypes

In [None]:
world_food_data.head()

The dataframe seems much cleaner now and the data types are correct. Now there's just a few more things I would like to add.<br><br>I'm going to add a column called `contains_additives`, which will be:
 - 1 - if the additive count is > 0
 - 0 - if the additive count is = 0

This will be used later on for modelling. I also noticed that the `packaging` column contains string values starting with both uppercase and lowercase. So I'm going to convert all of the words into lowercase for correct filtering later on in the exploration.

In [None]:
world_food_data["contains_additives"]=pd.Series(np.where(world_food_data.additives_n>0,1,0)).astype(int)
world_food_data.packaging=world_food_data.packaging.str.lower()

In [None]:
assert_less(world_food_data.contains_additives.any(),2)
assert_greater_equal(world_food_data.contains_additives.any(),0)
assert_equal(world_food_data.shape[1],13)

The features should now be 13 and the `contains_additives` column should have values between 0 and 1.

In [None]:
world_food_data.head()

Let's take a look at the `ingredients_from_palm_oil_n` column. It seems that there are still 59 products, which contain 2 ingredients from palm oil. For simplicity, I'm just going to change these values with 1's, which will indicate that the french product either contains or doesn't contain such ingredients.

In [None]:
world_food_data["ingredients_from_palm_oil_n"].unique()

In [None]:
len(world_food_data[world_food_data.ingredients_from_palm_oil_n==2])

In [None]:
world_food_data["ingredients_from_palm_oil_n"].loc[world_food_data["ingredients_from_palm_oil_n"]==2]=1

In [None]:
assert_greater_equal(world_food_data.ingredients_from_palm_oil_n.any(),0)
assert_less_equal(world_food_data.ingredients_from_palm_oil_n.any(),1)

This concludes the cleaning of the Open Food Facts dataset and now I'm going to move on to the fast food datasets.

## Step 2 - The Starbucks dataset

### Obtaining the dataset

The other two datasets are smaller and much easier to clean and obtain.<br><br>
First I will load the Starbucks dataset from the `.csv` file and check if it was loaded correctly. Then I will print the first few rows, similarly to what I did to the previous one.

In [None]:
starbucks_data=pd.read_csv("../input/starbucks-menu/starbucks_drinkMenu_expanded.csv")

In [None]:
assert_is_not_none(starbucks_data)

In [None]:
starbucks_data.head()

### Cleaning the dataset

The first change I'd like to make would be to rename the columns so that I can access them more easily.<br><br>
First, I'll filter out the brackets in the column names and convert these names to lowercase. After that, I'll save the ones, which I'll be using later on in the list `cols_to_keep`.<br><br>I'll also rename the nutrient columns and the `beverage` column to match the features in the `world_food_data` dataset for simplicity.

In [None]:
starbucks_data.columns=starbucks_data.columns.str.replace(")","")
starbucks_data.columns=starbucks_data.columns.str.replace(" ","")
starbucks_data.columns=starbucks_data.columns.str.replace("(","_")
starbucks_data.columns=starbucks_data.columns.str.lower()
cols_to_keep=["beverage_category", "beverage","beverage_prep","calories","totalfat_g","totalcarbohydrates_g",
               "protein_g"]
starbucks_data=starbucks_data[cols_to_keep]
starbucks_data=starbucks_data.rename(columns={"totalfat_g":"fat_g",
                                                "totalcarbohydrates_g":"carbohydrates_g",
                                              "protein_g":"proteins_g",
                                              "beverage":"product_name"
                                               })

In [None]:
starbucks_data.shape

The resulting dataset should have 242 observations and 7 features.

In [None]:
assert_equal(starbucks_data.shape,(242,7))

In [None]:
starbucks_data.head()

Let's take a look at the data types.

In [None]:
starbucks_data.dtypes

It seems that the carbohydrates were rounded to integers here, but I'm going to convert them to float, so that they match the carbohydrates column data types in `world_food_data`.

In [None]:
starbucks_data.carbohydrates_g=starbucks_data.carbohydrates_g.astype(float)

In [None]:
assert_equal(starbucks_data.carbohydrates_g.dtype,float)

The `fat_g` column is also of type `object`, which is pretty odd. Let's take a look at why that is.

In [None]:
starbucks_data.fat_g.unique()

It seems that there is just a mistake in the data. As I'm not really sure what the value `3 2` is supposed to be, I'm just going to replace it with NaN's and convert the data type to float.

In [None]:
starbucks_data_mistake=starbucks_data.fat_g.loc[starbucks_data.fat_g=="3 2"]
starbucks_data.fat_g=starbucks_data.fat_g.replace(starbucks_data_mistake,np.nan)
starbucks_data.fat_g=starbucks_data.fat_g.astype(float)

In [None]:
assert_equal(starbucks_data.fat_g.dtype,float)

In [None]:
starbucks_data.dtypes

In [None]:
starbucks_data.head()

The Starbucks dataset seems ready for exploration now. So I'll move on to the last dataset - the __McDonalds__ dataset.

## Step 3 - The McDonalds dataset

### Obtaining the dataset

Similar to the Starbucks dataset, this one is also very simple to load

In [None]:
mcd_menu_data=pd.read_csv("../input/nutrition-facts/menu.csv")

In [None]:
assert_is_not_none(mcd_menu_data)

### Cleaning the dataset

I'm going to apply the same column name cleaning rules as in the previous dataset here, as the columns are almost identical. Also I'll change the `item` column to `product_name`, again to match the french product name column.

In [None]:
mcd_menu_data.columns=mcd_menu_data.columns.str.replace(")","")
mcd_menu_data.columns=mcd_menu_data.columns.str.replace(" ","")
mcd_menu_data.columns=mcd_menu_data.columns.str.replace("(","_")
mcd_menu_data.columns=mcd_menu_data.columns.str.lower()
cols_to_keep=["category","item","calories","totalfat","carbohydrates","protein"]
mcd_menu_data=mcd_menu_data[cols_to_keep]
mcd_menu_data=mcd_menu_data.rename(columns={"totalfat":"fat_g",
                                            "item":"product_name",
                                           "carbohydrates":"carbohydrates_g",
                                           "protein":"proteins_g"
                                           })

In [None]:
mcd_menu_data.shape

The dataset should now have 260 observations and 6 features.

In [None]:
assert_equal(mcd_menu_data.shape,(260,6))

In [None]:
mcd_menu_data.head()

I'll analyze the data types of the columns again.

In [None]:
mcd_menu_data.dtypes

Similar to the previous dataset, the carbohydrates and the proteins columns are rounded to integers. I'll convert them to `float` to match the french product data types.

In [None]:
mcd_menu_data.carbohydrates_g=mcd_menu_data.carbohydrates_g.astype(float)
mcd_menu_data.proteins_g=mcd_menu_data.proteins_g.astype(float)

In [None]:
assert_equal(mcd_menu_data.carbohydrates_g.dtype,float)
assert_equal(mcd_menu_data.proteins_g.dtype,float)

In [None]:
mcd_menu_data.head()

Finally, let's confirm that the proteins and carbohydrates column values are correct.

In [None]:
mcd_menu_data.proteins_g.unique()

In [None]:
mcd_menu_data.carbohydrates_g.unique()

This dataset seems clean now.<br><br>
We are finally finished with the rigorous cleaning process. Now we can move on to exploring the data.<br><br>

## Step 4 - Export the cleaned datasets

As a final step, I'm just going to export the cleaned datasets into `.csv` files for easy usage in part 2, while also removing the index column, because it isn't needed in this case.

In [None]:
mcd_menu_data.to_csv("mcd_menu_scrubbed.csv",index=False)
starbucks_data.to_csv("star_menu_scrubbed.csv",index=False)
world_food_data.to_csv("world_food_scrubbed.csv",index=False)