# Part 3 (Modelling)

## Created by Konstantin Georgiev
### Email: dragonflareful@gmail.com

Next up, just for fun, I'll try to apply logistic regression and make a model which will predict whether a product contains additives or not.

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from nose.tools import *

## Step 1 - Obtaining the data

First up, of course, we'll load the cleaned french products dataset from __Part 1__.

In [None]:
world_food_data=pd.read_csv("../input/openfoodfactsclean/world_food_scrubbed.csv")

As we previosly saw, the dataset should have 71091 observations and 13 features.

In [None]:
assert_is_not_none(world_food_data)
assert_equal(world_food_data.shape,(71091,13))

In [None]:
world_food_data.head()

## Step 2 - Preparing the data for modelling

I'm going to drop the columns which I'm not going to use for training and confirm that the number of features is __8__.

In [None]:
world_food_data.drop(columns=["product_name","packaging","additives_n","fp_lat","fp_lon"],inplace=True)

In [None]:
assert_equal(world_food_data.shape[1],8)

In [None]:
world_food_data.head()

After that I'm going to prepare the data for modelling a bit.<br>First, I'll call the `pd.get_dummies()` function in `pandas` which will expand our features based on each unique row value and change these values to numeric ones. Next I'm going to drop the `contains_additives` column and use it as a __target__, since this will be our prediction.

In [None]:
world_food_data_for_modelling=pd.get_dummies(world_food_data)
world_food_data_features=world_food_data_for_modelling.drop(columns=["contains_additives"])
world_food_data_target=world_food_data.contains_additives

In [None]:
world_food_data_for_modelling.shape[1]

The features should be one less than the entire dataframe and the target should only be one column.

In [None]:
assert_equal(world_food_data_for_modelling.shape[1],1935)
assert_equal(world_food_data_features.shape[1],1934)
assert_equal(world_food_data_target.shape,(71091,))

For preprocessing I will use scipy's `MaxAbsScaler`, which scales each feature by its maximum absolute value.

In [None]:
scaler=MaxAbsScaler()
world_food_data_features_scaled = scaler.fit_transform(world_food_data_features)

In [None]:
assert_is_not_none(world_food_data_features_scaled)

In [None]:
print(world_food_data_features_scaled)

There are 71091 observations and 1934 features total so this takes a while to compute.

## Step 3 - Training and test split and creating the model

The next step would be to split the data into training and test sets. I've decided to go with a __70/30__ split, because if I did __80/20__ it would take a bit longer for the model to train.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    world_food_data_features_scaled, world_food_data_target, train_size = 0.7, test_size = 0.3, random_state = 42)

In [None]:
print("Training data shapes: Features:{}, Labels:{}".format(features_train.shape,target_train.shape))
print("Test data shapes: Features:{}, Labels:{}".format(features_test.shape,target_test.shape))

Now we have obtained our training and test set and we can apply a simple logistic regression model with no parameters and try to predict whether the test products contain additives or not.

In [None]:
model=LogisticRegression()
model.fit(features_train,target_train)

In [None]:
assert_is_not_none(model)

## Step 4 - Scoring the model

The training is complete! Now we can see how well our model performed by printing the accuracy.

In [None]:
score = model.score(features_test,target_test)
print("Additives prediction accuracy: {:.2f}".format(score*100))

In [None]:
assert_greater(score,0.5)
assert_less_equal(score,1)

The model has an overall accuracy of around __70%__. This isn't that bad considering that we used a simple regression model with no regularization applied.<br><br> This concludes our research. Now let's see what the results look like.

## Answers to the problem statements

- __Which nutrient has the biggest impact on the nutrition grade?__<br><br>
__Answer:__ Based on grouping and hypothesis testing the result was __fat__.<br><br>
- __How do french meat and beverages compare to McDonalds meat and Starbucks beverages in terms of nutrients?__<br><br>
__Answer:__ The french beverages proved to be a lot healhtier than Starbucks beverages, as they were mostly plant-based. The french meat proved to be less healthier than McDonalds meat in general.<br><br>
- __Do other factors like food packaging, additive count and palm oil in the ingredients, have an impact on the nutrition value?__<br><br>
__Answer:__ There seemed to be some indicators, but mostly these factors followed more or less the same patterns when compared to the nutrition value. So the answer here is no.<br><br>

## Project conclusion

In conclusion, I've made an attempt to make an in-depth research on the nutrition level of french products and what affects them by exploring the __Open Food Facts__ dataset and comparing it to the __Starbucks__ and __McDonalds__ datasets in terms of nutrtion value. I think the research was a success as I managed to answer the questions I laid out in the beginning of the project. I also hope that this project was of interest to the reader and that maybe it provided some helpful information on this topic.

## Resources

https://towardsdatascience.com/ - insights on Data Science and Machine Learning<br>
https://www.webmd.com/diet - food nutrition statistics<br>
https://www.foodnavigator.com/Article/2017/10/31/Nutri-Score-labelling-comes-into-force-in-France - information on the french nutrition scoring system<br>
https://www.healthline.com/health/food-nutrition/ - information on the most crucial nutrients for the human body

## Communication and contacts

If you have any questions, criticism or suggestions feel free to email me at: dragonflareful@gmail.com<br>This project will be shared on my GitHub page: https://github.com/JadeBlue96 and it will be free to fork, use and manipulate to the public.<br>This project will also be submitted as a kernel at https://www.kaggle.com/openfoodfacts/world-food-facts/kernels.<br>I will also appreciate any type of feedback on this matter.

## Acknowledgements

Many thanks to Yordan Darakchiev (https://www.kaggle.com/iordan93) for his insights and examples on his Data Science course at SoftUni.