In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
import os
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from requests import get
from bs4 import BeautifulSoup
from pandas.util import hash_pandas_object

In [5]:
data = pd.read_csv('../data/en.openfoodfacts.org.products.csv', sep = '\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [14]:
# - fields that end with _t are dates in the UNIX timestamp format (number of seconds since Jan 1st 1970)
data_unixdate = data[data.columns[[x.endswith('_t') for x in data.columns]]]

# - fields that end with _datetime are dates in the iso8601 format: yyyy-mm-ddThh:mn:ssZ
data_dateiso8601 = data[data.columns[[x.endswith('_datetime') for x in data.columns]]]

# - fields that end with _tags are comma separated list of tags (e.g. categories_tags is the set of normalized tags computer from the categories field)
data_categorytags= data[data.columns[[x.endswith('_tags') for x in data.columns]]]

# - fields that end with _100g correspond to the amount of a nutriment (in g, or kJ for energy) for 100 g or 100 ml of product
data_100gnutri= data[data.columns[[x.endswith('_100g') for x in data.columns]]]

#- fields that end with _serving correspond to the amount of a nutriment (in g, or kJ for energy) for 1 serving of the product
data_servnutri= data[data.columns[[x.endswith('_serving') for x in data.columns]]]

## Missing Values

In [91]:
#Creation of Missing Percentage DataFrame 
missing_percentages = pd.DataFrame({'field' : data.columns.values, 'Nb NA': data.columns.values, '% NA': data.columns.values}) 
missing_percentages = missing_percentages.set_index(['field'])
for col in data.columns:
    miss = (data[col].size-data[col].count())
    tot = data[col].size
    percentage = round((100*(data[col].size-data[col].count())/data[col].size))
    print("The field --%(col)s-- contained %(miss)d missing entries : %(miss)d/%(tot)d = %(percentage)d percentages of missing entries"\
          % {'col': str(col), 'miss': miss, 'tot': tot, 'percentage': percentage})
    missing_percentages.loc[col, "Nb NA"] = miss
    missing_percentages.loc[col, "% NA"] = percentage
    
missing_percentages.head(50)

The field --code-- contained 0 missing entries : 0/1017858 = 0 percentages of missing entries
The field --url-- contained 0 missing entries : 0/1017858 = 0 percentages of missing entries
The field --creator-- contained 3 missing entries : 3/1017858 = 0 percentages of missing entries
The field --created_t-- contained 0 missing entries : 0/1017858 = 0 percentages of missing entries
The field --created_datetime-- contained 1 missing entries : 1/1017858 = 0 percentages of missing entries
The field --last_modified_t-- contained 0 missing entries : 0/1017858 = 0 percentages of missing entries
The field --last_modified_datetime-- contained 0 missing entries : 0/1017858 = 0 percentages of missing entries
The field --product_name-- contained 49872 missing entries : 49872/1017858 = 5 percentages of missing entries
The field --generic_name-- contained 925571 missing entries : 925571/1017858 = 91 percentages of missing entries
The field --quantity-- contained 699415 missing entries : 699415/101785

Unnamed: 0_level_0,Nb NA,% NA
field,Unnamed: 1_level_1,Unnamed: 2_level_1
code,0,0
url,0,0
creator,3,0
created_t,0,0
created_datetime,1,0
last_modified_t,0,0
last_modified_datetime,0,0
product_name,49872,5
generic_name,925571,91
quantity,699415,69


## Investigating new categories for diet type

First, we look at fields that can be used in order to create new diet type categorie

In [59]:
#categories field
print("The categories field contained %d not nan unique entries\n" % data[data.categories.notnull()].categories.unique().shape)
data[data["categories"].notnull()]["categories"].unique()
for categorie in data[data["categories"].notnull()]["categories"].unique()[:15]:
    print(categorie)

The categories field contained 79016 not nan unique entries

Epicerie,Condiments,Sauces,Moutardes
fr:Xsf
Plats préparés, Légumes préparés, Carottes râpées, Carottes râpées assaisonnées
Tartes, Tartes sucrées, Tartes à la noix de coco
Aliments et boissons à base de végétaux, Aliments d'origine végétale, Aliments à base de fruits et de légumes, Desserts, Fruits et produits dérivés, Compotes, Compotes de poire
Viandes, Volailles, Poulets, Aiguillettes de poulet
Plats préparés, Légumes préparés, Entrées, Entrées froides, Macédoines de légumes préparées
Produits laitiers, Produits fermentés, Produits laitiers fermentés, Fromages, Fromages de France, Abondance
Viandes, Volailles, Poulets, Cuisses de poulet
Aliments et boissons à base de végétaux, Aliments d'origine végétale, Céréales et pommes de terre, Pains, Pains spéciaux, Pains Bagel
Aliments et boissons à base de végétaux, Aliments d'origine végétale, Céréales et pommes de terre, Pains, Baguettes
Produits de la mer, Poissons, Saumons, P

In [55]:
#pnns_groups_1 field
print("The pnns_groups_1 field contained %d not nan unique entries" % data[data.pnns_groups_1.notnull()].pnns_groups_1.unique().shape)
for pnns in data[data["pnns_groups_1"].notnull()]["pnns_groups_1"].unique():
    print(pnns)

The pnns_groups_1 field contained 14 not nan unique entries
unknown
Fat and sauces
Composite foods
Sugary snacks
Fruits and vegetables
Fish Meat Eggs
Milk and dairy products
Cereals and potatoes
sugary-snacks
Beverages
Salty snacks
fruits-and-vegetables
cereals-and-potatoes
salty-snacks


In [60]:
#pnns_groups_2 field
print("The pnns_groups_2 field contained %d not nan unique entries" % data[data.pnns_groups_2.notnull()].pnns_groups_2.unique().shape)
for pnns in data[data["pnns_groups_2"].notnull()]["pnns_groups_2"].unique():
    print(pnns)

The pnns_groups_2 field contained 46 not nan unique entries
unknown
Dressings and sauces
One-dish meals
Biscuits and cakes
Fruits
Meat
Cheese
Bread
Fish and seafood
Dried fruits
Legumes
Vegetables
Dairy desserts
pastries
Fruit juices
Sweetened beverages
Unsweetened beverages
Pizza pies and quiche
Sweets
Alcoholic beverages
Salty and fatty products
Cereals
Appetizers
Fats
Processed meat
Nuts
Chocolate products
Milk and yogurt
Ice cream
Breakfast cereals
vegetables
Teas and herbal teas and coffees
Sandwiches
Soups
Potatoes
Artificially sweetened beverages
Plant-based milk substitutes
Offals
fruits
Eggs
Waters and flavored waters
Fruit nectars
cereals
Pizza pies and quiches
legumes
nuts


In [62]:
#labels_en
print("The labels_en field contained %d not nan unique entries" % data[data.labels_en.notnull()].labels_en.unique().shape)
for label in data[data["labels_en"].notnull()]['labels_en'].unique()[:50]:
    print(label)

The labels_en field contained 35106 not nan unique entries
fr:delois-france
Organic
Organic,EU Organic,FR-BIO-01
Organic,EU Organic,EU/non-EU Agriculture,FR-BIO-15
Organic,EU Organic,fr:ab-agriculture-biologique
Made in France
Low or no fat,Low fat,Organic,High fibres
fr:viande-francaise,Made in France
Green Dot
Contains GMOs
Nutriscore,Nutriscore Grade A
Made in Spain
Organic,EU Organic,FR-BIO-10,fr:ab-agriculture-biologique
Kosher,Contains GMOs
No artificial flavors,Contains GMOs
Gluten-free
Incorrect data on label
Not advised for specific people,PDO,Made in France,With sulfites,fr:Déconseillé aux femmes enceintes,fr:Origine France,fr:Triman
Sustainable farming,UTZ Certified,UTZ Certified Cacao
Do not freeze again
Low or no sugar,Contains a source of phenylalanine,Excessive consumption can have laxative effects,Green Dot,No sugar,With sweeteners
Vegetarian,No artificial colors,No flavors,No preservatives
Organic,EU Organic,NL-BIO-01
de:0000080176800
Vegetarian,Vegan
No preservatives,

In [66]:
#ingredients_text
print("The ingredients_text field contained %d not nan unique entries \n" % data[data.ingredients_text.notnull()].ingredients_text.unique().shape)
for label in data[data["ingredients_text"].notnull()]['ingredients_text'].unique()[:50]:
    print(label + '\n')

The ingredients_text field contained 393971 not nan unique entries 

eau graines de téguments de moutarde vinaigre de vin rouge sel vin rouge sucre   moût de raisin (6.2%) oignons colorants extraits de carotte et extrait de paprika huile de tournesol son de moutarde sel (cette _moutarde_ uniquement disponible chez courte paille)

antioxydant : érythorbate de sodium, colorant : caramel - origine UE), tomate 33,3%, MAYONNAISE 11,1% (huile de colza 78,9%, eau, jaunes d'OEUF 6%, vinaigre, MOUTARDE [eau, graines de MOUTARDE, sel, vinaigre, curcuma], sel, dextrose, stabilisateur : gomme de cellulose, conservateur : sorbate de potassium, colorant : ?-carotène, arôme)

Lait entier, sucre, amidon de maïs, cacao, Agar agar.

baguette Poite vin Pain baguette 50,6%: farine de BLÉ, eau, sel, levure, GLUTEN, farine de BLE maité, levure désactivée, acide ascorbique, Garniture FROMAGE mi-chèvre 46% (LAIT pasteurisé [95 0% LAIT de vache, 5 0% LAIT de chèvre], sel, FERMENTS LACTIQUES et daffinage, coagu

## Type of diet to create new diet category 

Most popular diet according to https://www.medicalnewstoday.com/articles/5847.php

#### __The Atkins diet__ 
https://www.medicalnewstoday.com/articles/7379.php#foods_to_eat_and_avoid

When a person is on the Atkins Diet, their body's metabolism switches from burning glucose, or sugar, as fuel to burning its own stored body fat. This switching is called ketosis. 
* focuses on controlling the levels of insulin in the body through a low-carbohydrate diet.

_Foods to eat/drink includes_:
* meats, including beef, pork, and bacon
* fatty fish and seafood
* eggs
* avocados
* low-carb vegetables, such as kale, broccoli and asparagus
* full-fat dairy products
* nuts and seeds
* healthy fats, such as extra-virgin olive oil, coconut oil, and avocado oil
* water
* coffee
* green tea

_Foods to avoid includes_: 
* sugar, such as soft drinks, cakes, and candy
* grains including wheat, spelt, and rice
* "diet" and "low-fat" foods, as they can be high in sugar
* legumes, such as lentils, beans, and chickpeas
* high-carb fruits (bananas, apples, grapes)
* high-carb vegetables(carrots)

#### __The Zone diet__ 
https://www.medicalnewstoday.com/articles/7382.php
Keeping insulin levels within the therapeutic zone means staying in "The Zone." Zone diet foods need to be taken in the right proportions to help to control insulin production.
Zone diet meal plans focus on eating:
* low-density carbohydrates
* dietary fat
* protein

People who follow the diet should balance carbohydrates, fats, and proteins in the following proportions:
* 40 percent carbohydrate
* 30 percent fat
* 30 percent protein
The idea is that by roughly balancing these three things in each meal, a person's health and weight will improve.

In the Zone diet, calorie intake does not have to go down, but what the person is eating has to change.

#### __The Ketogenic diet__
https://www.medicalnewstoday.com/articles/180858.php

Ketosis is a normal metabolic process. When the body does not have enough glucose for energy, it burns stored fats instead; this results in a build-up of acids called ketones within the body.

Some people encourage ketosis by following a diet called the ketogenic or low-carb diet. The aim of the diet is to try and burn unwanted fat by forcing the body to rely on fat for energy, rather than carbohydrates.

Facts on ketosis: 
* Ketosis occurs when the body does not have sufficient access to its primary fuel source, glucose.
* Ketosis describes a condition where fat stores are broken down to produce energy, which also produces ketones, a type of acid.
* As ketone levels rise, the acidity of the blood also increases, leading to ketoacidosis, a serious condition that can prove fatal.
* People with type 1 diabetes are more likely to develop ketoacidosis, for which emergency medical treatment is required to avoid or treat diabetic coma.
* Some people follow a ketogenic (low-carb) diet to try to lose weight by forcing the body to burn fat stores.

The diet itself can be regarded as a high-fat diet, with around 
* 75 percent of calories derived from fats
* 20 percent of calories derived from proteins
* 5 percent of calories derived carbohydrates

#### __The Vegetarian diet__
https://www.medicalnewstoday.com/articles/8749.php#benefits

A vegetarian does not eat meat or fish, but there are different types of vegetarian. Some consume eggs and dairy products, while the strictest kind, vegans, eat no animal produce at all, including honey. Some people call themselves vegetarians, but they consume fish.

Here, we will focus on lacto-ovo-vegetarians, people who do not consume meat, fish and related products, but who do eat eggs, dairy products, and honey.

#### __The Vegan diet__

https://www.medicalnewstoday.com/articles/149636.php

A vegan diet is part of a lifestyle that excludes the consumption or use of any products made from animals. Vegans do not eat animal products, including honey, eggs, gelatin, or dairy
* low in saturated fat
* rich in nutrient
* proteins, vitamins, and minerals must come from non-animal sources
