# Idées 
- Système de recommandation de produits, d'alternative avec un meilleure nutriscore et/ou label bio: 
    - L'utilisateur rentre un aliment
    - Il découvre les aliments les plus similaires nutritionnellement
    - L'application lui suggère des produits ayant un meilleur nutriscore et/ou un label spécifique.
- Si un produit n'est pas dans la table (dans un try/except), proposer à l'utilisateur de rentrer ce produit.

In [11]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors

import ipywidgets as widgets

In [12]:
df = pd.read_csv("en.openfoodfacts.org.products.csv", sep="\t", encoding="utf-8")
df.head(3)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,...,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,225,http://world-en.openfoodfacts.org/product/0000...,nutrinet-sante,1623855208,2021-06-16T14:53:28Z,1623855209,2021-06-16T14:53:29Z,jeunes pousses,,,...,,,,,,,,,,
1,3429145,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1630483911,2021-09-01T08:11:51Z,1630484064,2021-09-01T08:14:24Z,L.casei,,,...,,,,,,,,,,
2,17,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,...,,,,,,,,,,


In [68]:
df.shape

(1955842, 186)

In [69]:
#list(df.columns)

In [70]:
df.product_name = df.product_name.astype(str)

In [71]:
nutrition_table_cols = ["energy_100g", "fat_100g", "carbohydrates_100g", "sugars_100g", "proteins_100g", "salt_100g"]
nutrition_table = df[nutrition_table_cols].copy()
print(nutrition_table.shape)
nutrition_table.head(3)

(1955842, 6)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g
0,,,,,,
1,,1.4,9.8,9.8,2.7,0.1
2,1569.0,7.0,70.1,15.0,7.8,1.4


In [72]:
# drop rows with missing values (to improve with clever imputer ?)
nutrition_table = nutrition_table.dropna(axis=0, how="any")
print(nutrition_table.shape)
nutrition_table.head(3)

(1417166, 6)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g
2,1569.0,7.0,70.1,15.0,7.8,1.4
5,3661.0,15.1,2.6,1.0,15.7,2.1
6,936.0,8.2,29.0,22.0,5.1,4.6


# Feature engineering 

Before we start our analysis we have to make some more adjustments to our dataset. First of all we want to add some more features that could be helpful for clustering our data.

Therefore we add the feature __g_sum__ which represents the __rounded sum of the fat-, carbohydrates-, proteins- and salt-values__ in our data. By doing that we can easily see if there are some products with false entries.

Furthermore we add the feature __other_carbs__ which includes the value of __all carbs that are not sugars__. Because of that our model can see the correlation between carbohydrates and sugars.

The last feature we want to add is __reconstructed_engery__. It calculates the energy value of a product __based on energy values of the features fat, carbohydrates and proteins.__ We can compare this feature to the amount of energy that is given in our dataset to see if there possibly are some wrong entries.

In [73]:
nutrition_table["g_sum"] = round(nutrition_table.fat_100g + nutrition_table.carbohydrates_100g +\
                                 nutrition_table.proteins_100g + nutrition_table.salt_100g)

nutrition_table["other_carbs"] = nutrition_table.carbohydrates_100g - nutrition_table.sugars_100g

nutrition_table["reconstructed_energy"] = nutrition_table.fat_100g * 37 +\
                                        (nutrition_table.proteins_100g + nutrition_table.carbohydrates_100g) * 17
print(nutrition_table.shape)
nutrition_table.head(3)

(1417166, 9)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,g_sum,other_carbs,reconstructed_energy
2,1569.0,7.0,70.1,15.0,7.8,1.4,86.0,55.1,1583.3
5,3661.0,15.1,2.6,1.0,15.7,2.1,36.0,1.6,869.8
6,936.0,8.2,29.0,22.0,5.1,4.6,47.0,7.0,883.1


# Eliminating obvious error sources
Now that we implemented our new features we also want to exclude obvious wrong entries, so that we delete all products with:

- a feature (except for the energy-ones) higher than 100g
- a feature with a negative entry
- an energy-amount of more than 3700kJ (the maximum amount of energy a product can have; in this case it would conists of 100% fat)
- more sugars than carbohydrates
- g_sum higher than 100g

In [74]:
for col in nutrition_table.columns:
    if col not in ["energy_100g", "reconstructed_energy"]:
        nutrition_table = nutrition_table.loc[nutrition_table[col] <= 100]
    nutrition_table = nutrition_table.loc[nutrition_table[col] >= 0]

nutrition_table = nutrition_table.loc[nutrition_table.energy_100g <= 3700]
nutrition_table = nutrition_table.loc[nutrition_table.carbohydrates_100g >= nutrition_table.sugars_100g]
nutrition_table = nutrition_table.loc[nutrition_table.g_sum <= 100]

nutrition_table["product_name"] = df.loc[nutrition_table.index]["product_name"]  # quite long, to improve
nutrition_table["nutriscore_grade"] = df.loc[nutrition_table.index]["nutriscore_grade"]  # quite long, to improve
nutrition_table["url"] = df.loc[nutrition_table.index]["url"]  # quite long, to improve

print(nutrition_table.shape)
nutrition_table.head(3)

(1392891, 12)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,g_sum,other_carbs,reconstructed_energy,product_name,nutriscore_grade,url
2,1569.0,7.0,70.1,15.0,7.8,1.4,86.0,55.1,1583.3,Vitória crackers,,http://world-en.openfoodfacts.org/product/0000...
5,3661.0,15.1,2.6,1.0,15.7,2.1,36.0,1.6,869.8,Hamburguesas de ternera 100%,,http://world-en.openfoodfacts.org/product/0000...
6,936.0,8.2,29.0,22.0,5.1,4.6,47.0,7.0,883.1,moutarde au moût de raisin,d,http://world-en.openfoodfacts.org/product/0000...


In [75]:
# to improve
nutrition_table = nutrition_table.dropna(axis=0, how="any")
nutrition_table = nutrition_table.reset_index(drop=True)
print(nutrition_table.shape)
nutrition_table.head(3)

(699106, 12)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,g_sum,other_carbs,reconstructed_energy,product_name,nutriscore_grade,url
0,936.0,8.2,29.0,22.0,5.1,4.6,47.0,7.0,883.1,moutarde au moût de raisin,d,http://world-en.openfoodfacts.org/product/0000...
1,134.0,0.3,5.3,3.9,0.9,0.42,7.0,1.4,116.5,Salade de carottes râpées,b,http://world-en.openfoodfacts.org/product/0000...
2,1594.0,22.0,27.3,21.9,4.6,0.1,54.0,5.4,1356.3,Tarte noix de coco,d,http://world-en.openfoodfacts.org/product/0000...


In [76]:
nutrition_table["product_name"] = nutrition_table["product_name"].str.lower()
nutrition_table["product_name"] = nutrition_table["product_name"].str.strip()

In [135]:
sample_of_product = list(nutrition_table["product_name"])[:10]

In [137]:
product = widgets.Combobox(
    #value='John',
    placeholder='What do you like ?',
    options=sample_of_product,
    description='Product:',
    ensure_option=True,
    disabled=False
)

display(product)

Combobox(value='', description='Product:', ensure_option=True, options=('moutarde au moût de raisin', 'salade …

In [138]:
product.value

'compote de poire'

# Recommendation system

In [139]:
num_attribs = [f for f in nutrition_table if f not in ["product_name", "nutriscore_grade", "url"]]
scaler = MinMaxScaler()
nutrition_table_scaled = nutrition_table.copy() 
nutrition_table_scaled[num_attribs] = scaler.fit_transform(nutrition_table_scaled[num_attribs])
print(nutrition_table_scaled.shape)
nutrition_table_scaled.head(3)

(699106, 12)


Unnamed: 0,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,g_sum,other_carbs,reconstructed_energy,product_name,nutriscore_grade,url
0,0.252973,0.082,0.29,0.22,0.051,0.046,0.47,0.07,0.23843,moutarde au moût de raisin,d,http://world-en.openfoodfacts.org/product/0000...
1,0.036216,0.003,0.053,0.039,0.009,0.0042,0.07,0.014,0.031454,salade de carottes râpées,b,http://world-en.openfoodfacts.org/product/0000...
2,0.430811,0.22,0.273,0.219,0.046,0.001,0.54,0.054,0.36619,tarte noix de coco,d,http://world-en.openfoodfacts.org/product/0000...


In [140]:
food_input = product.value

# s'il y a plus d'un produit, je garde uniquement celui avec le meilleur score
if len(nutrition_table_scaled[nutrition_table_scaled["product_name"] == food_input]) > 1:
    print(f"Plusieurs produits correspondent à '{food_input}'")
    food_output = nutrition_table_scaled[nutrition_table_scaled["product_name"] == food_input]\
    .sort_values("nutriscore_grade").head(1)[num_attribs].values
    
elif len(nutrition_table_scaled[nutrition_table_scaled["product_name"] == food_input]) == 1:
    print(f"1 produit correspond à '{food_input}'")
    food_output = nutrition_table_scaled[nutrition_table_scaled["product_name"] == food_input][num_attribs].values
else:
    print(f"Aucun produit ne correspond à '{food_input}'")

Plusieurs produits correspondent à 'compote de poire'


In [141]:
K_NEIGHBOURS = 10
knn = NearestNeighbors(n_neighbors=K_NEIGHBOURS)
knn.fit(nutrition_table_scaled[num_attribs])

NearestNeighbors(n_neighbors=10)

In [142]:
knn_result = knn.kneighbors(food_output)
result_table = nutrition_table.loc[knn_result[1][0]] #.reset_index(drop=True).sort_values("nutriscore_grade")"distance"
result_table["euclidian_distance"] = knn_result[0][0]
result_table = result_table[['product_name', 'nutriscore_grade',
                             'euclidian_distance', 'url', 'energy_100g',
                             'fat_100g', 'carbohydrates_100g',
                             'sugars_100g', 'proteins_100g',
                             'salt_100g', 'g_sum', 'other_carbs',
                             'reconstructed_energy']].reset_index(drop=True)
result_table

Unnamed: 0,product_name,nutriscore_grade,euclidian_distance,url,energy_100g,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,g_sum,other_carbs,reconstructed_energy
0,compote de poire,a,0.0,http://world-en.openfoodfacts.org/product/0000...,657.0,0.0,36.0,27.0,0.6,0.0,37.0,9.0,622.2
1,confiture extra d'abricots,c,0.013499,http://world-en.openfoodfacts.org/product/5903...,634.0,0.0,36.0,27.0,0.0,0.0,36.0,9.0,612.0
2,velours de vinaigre aux agrumes,c,0.014992,http://world-en.openfoodfacts.org/product/8712...,640.0,0.5,36.0,27.0,1.0,0.4,38.0,9.0,647.5
3,original pancake syrup,c,0.016182,http://world-en.openfoodfacts.org/product/0876...,628.0,0.0,36.67,26.67,0.0,0.2925,37.0,10.0,623.39
4,sorbet citron,c,0.016587,http://world-en.openfoodfacts.org/product/3760...,616.0,0.0,35.9,27.1,0.0,0.0,36.0,8.8,610.3
5,confit d'échalotes cuites au chaudron à feu nu,c,0.016638,http://world-en.openfoodfacts.org/product/3330...,662.0,0.5,36.0,27.0,1.3,0.58,38.0,9.0,652.6
6,"musselman's, pie filling, lemon",c,0.01731,http://world-en.openfoodfacts.org/product/0037...,640.0,0.0,36.47,28.24,0.0,0.265,37.0,8.23,619.99
7,agridulce frasco 235 ml,c,0.017398,http://world-en.openfoodfacts.org/product/8412...,628.0,0.1,37.0,28.0,0.1,0.21,37.0,9.0,634.4
8,sorbet au citron,c,0.018295,http://world-en.openfoodfacts.org/product/8717...,607.0,0.001,35.9,27.1,0.001,0.001,36.0,8.8,610.354
9,sorbet au citron,c,0.018301,http://world-en.openfoodfacts.org/product/8718...,607.0,0.0,35.9,27.1,0.0,0.0,36.0,8.8,610.3
