<div style="display: flex; background-color: #3F579F;">
    <h1 style="margin: auto; padding: 30px 30px 0px 30px;">Design an application for public health - Project 3</h1>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 5px 30px 0px 30px;">
    <span style="width: 100%; text-align: center; font-size:20px; font-weight: bold; float: left;">| Cleaning notebook |</span>
</div>
<div style="display: flex; background-color: #3F579F; margin: auto; padding: 10px 30px 30px 30px;">
    <span style="width: 100%; text-align: center; font-size:26px; float: left;">Data Scientist course - OpenClassrooms</span>
</div>

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">1. Libraries and functions</h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">1.1. Libraries and functions</h3>
</div>

In [1]:
import io
import gc
from math import prod
import time as time
import pandas as pd

<div class="alert alert-block alert-info">
Due to size of the dataset, it is necessary to show all columns to work on it
</div>

In [2]:
pd.set_option("display.max_columns", None) # show max of cols
pd.set_option("max_colwidth", None) # show full width of cols

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">1.2. Functions declaration</h3>
</div>

In [3]:
def df_initial_analysis(df, name_df):
    """
    Initial analysis on the DataFrame.

    Args:
        df (pandas.DataFrame): DataFrame to analyze.
        name_df (str): DataFrame name.

    Returns:
        None. 
        Print the initial analysis on the DataFrame. 
    """
    
    # Calculating the memory usage based on dataframe.info()
    buf = io.StringIO()
    df.info(buf=buf)
    memory_usage = buf.getvalue().split('\n')[-2]
  
    if df.empty:
        print("The", name_df, "dataset is empty. Please verify the file.")
    else:
        empty_cols = [col for col in df.columns if df[col].isna().all()] # identifying empty columns
        df_rows_duplicates = df[df.duplicated()] #identifying full duplicates rows
        
        # Creating a dataset based on Type object and records by columns
        type_cols = df.dtypes.apply(lambda x: x.name).to_dict() 
        df_resume = pd.DataFrame(list(type_cols.items()), columns = ["Name", "Type"])
        df_resume["Records"] = list(df.count())
        
        print("\nInitial Analysis of", name_df, "dataset")
        print("--------------------------------------------------------------------------")
        print("- Dataset shape:                 ", df.shape[0], "rows and", df.shape[1], "columns")
        print("- Total of NaN values:           ", df.isna().sum().sum())
        print("- Percentage of NaN:             ", round((df.isna().sum().sum() / prod(df.shape)) * 100, 2), "%")
        print("- Total of full duplicates rows: ", df_rows_duplicates.shape[0])
        print("- Total of empty rows:           ", df.shape[0] - df.dropna(axis="rows", how="all").shape[0]) if df.dropna(axis="rows", how="all").shape[0] < df.shape[0] else \
                    print("- Total of empty rows:            0")
        print("- Total of empty columns:        ", len(empty_cols))
        print("  + The empty column is:         ", empty_cols) if len(empty_cols) == 1 else \
                    print("  + The empty column are:         ", empty_cols) if len(empty_cols) >= 1 else None
        
        print("\n- Type object and records by columns         (",memory_usage,")")
        print("--------------------------------------------------------------------------")
        print(df_resume.sort_values("Records", ascending=False))  
        
        # deleting dataframe to free memory
        del [df_resume]
        gc.collect()
        df_resume = pd.DataFrame()
    

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">2. Importing files</h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">2.1. Importing and preparing files</h3>
</div>

<div class="alert alert-block alert-info">
Reading data in <b>chunks of 1 million rows</b> at a time
</div>

In [4]:
start = time.time()
chunk = pd.read_csv("datasets/en.openfoodfacts.org.products.csv", chunksize=1000000, sep="\t", encoding="UTF-8")
data = pd.concat(chunk)
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


Read csv with chunks:  100.97550177574158 sec


<div class="alert alert-block alert-info">
Making <b>the initial analysis</b>
</div>

<div class="alert alert-block alert-warning">
After analyzed the dataset, we can conclude the following:
<ol>
    <li>Almost 80% of the data present in data set are missing-values</li>
    <li>There are 5 empty columns that we can delete</li>
    <li>There is a lot of memory usage with this dataset</li>
</ol>
</div>

In [5]:
pd.set_option("display.max_rows", None) # Show full of rows
df_initial_analysis(data, "data")


Initial Analysis of data dataset
--------------------------------------------------------------------------
- Dataset shape:                  1760097 rows and 186 columns
- Total of NaN values:            260478039
- Percentage of NaN:              79.56 %
- Total of full duplicates rows:  1
- Total of empty rows:            0
- Total of empty columns:         5
  + The empty column are:          ['cities', 'allergens_en', 'no_nutriments', 'ingredients_from_palm_oil', 'ingredients_that_may_be_from_palm_oil']

- Type object and records by columns         ( memory usage: 2.4+ GB )
--------------------------------------------------------------------------
                                           Name     Type  Records
0                                          code   object  1760097
6                        last_modified_datetime   object  1760097
63                                    states_en   object  1760097
62                                  states_tags   object  1760097
61      

In [6]:
pd.set_option("display.max_rows", None) # reset max of showing rows

<div style="background-color: #506AB9;" >
    <h2 style="margin: auto; padding: 20px; color:#fff; ">2. Cleaning dataset</h2>
</div>

<div style="background-color: #6D83C5;" >
    <h3 style="margin: auto; padding: 20px; color:#fff; ">2.1. Deleting NaN columns and rows, and duplicated rows</h3>
</div>

In [7]:
data = data.dropna(axis="columns", how="all").dropna(axis="rows", how="all")

In [8]:
data = data.drop_duplicates()

<div class="alert alert-block alert-danger">
<b>I am here</b>
</div>

In [53]:
#Observation des colonnes qui contiennent le _t et le _datetime
print(data.iloc[:,data.columns.str.endswith('_t')].head())
print('\n', data.iloc[:,data.columns.str.endswith('_datetime')].head())

    created_t  last_modified_t
0  1529059080       1561463718
1  1539464774       1539464817
2  1574175736       1574175737
3  1619501895       1619501897
4  1444572561       1444659212

        created_datetime last_modified_datetime
0  2018-06-15T10:38:00Z   2019-06-25T11:55:18Z
1  2018-10-13T21:06:14Z   2018-10-13T21:06:57Z
2  2019-11-19T15:02:16Z   2019-11-19T15:02:17Z
3  2021-04-27T05:38:15Z   2021-04-27T05:38:17Z
4  2015-10-11T14:09:21Z   2015-10-12T14:13:32Z


In [None]:
OBJECT_COLUMNS = [ 
    "code", "url", "creator", "created_datetime", "last_modified_datetime", "product_name", "abbreviated_product_name", "generic_name", 
    "quantity", "packaging", "packaging_tags", "packaging_text", "brands", "brands_tags", "categories", "categories_tags", "categories_en", 
    "origins", "origins_tags", "origins_en", "manufacturing_places", "manufacturing_places_tags", "labels", "labels_tags", "labels_en", 
    "emb_codes", "emb_codes_tags", "first_packaging_code_geo", "cities_tags", "purchase_places", "stores", "countries", "countries_tags", 
    "countries_en", "ingredients_text", "allergens", "traces", "traces_tags", "traces_en", "serving_size", "additives", "additives_tags", 
    "additives_en", "ingredients_from_palm_oil_tags", "ingredients_that_may_be_from_palm_oil_tags", "nutriscore_grade", "pnns_groups_1", 
    "pnns_groups_2", "states", "states_tags", "states_en", "brand_owner", "ecoscore_grade_fr", "main_category", "main_category_en", "image_url", 
    "image_small_url", "image_ingredients_url", "image_ingredients_small_url", "image_nutrition_url", "image_nutrition_small_url", "-butyric-acid_100g", 
    "-capric-acid_100g"    
]

In [10]:
for col in data.columns:
    print(col.dtypes)

AttributeError: 'str' object has no attribute 'dtypes'

In [43]:
OBJECT_COLUMNS, INT64_COLUMNS, FLOAT64_COLUMNS, CATEGORY_COLUMNS, BOOL_COLUMNS = ([] for i in range(5))

In [20]:
import numpy as np

In [44]:
for col in data.columns:
    if data[col].dtypes == "object" :
        OBJECT_COLUMNS.append(data[col].name)
    elif data[col].dtypes == "int64" :
        INT64_COLUMNS.append(data[col].name)
    elif data[col].dtypes == "float64" :
        FLOAT64_COLUMNS.append(data[col].name)
    elif data[col].dtypes == "category" :
        CATEGORY_COLUMNS.append(data[col].name)
    elif data[col].dtypes == "bool" :
        BOOL_COLUMNS.append(data[col].name)

In [52]:
data.head(1)

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,quantity,packaging,packaging_tags,packaging_text,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,origins_en,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,brand_owner,ecoscore_score_fr,ecoscore_grade_fr,main_category,main_category_en,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,-soluble-fiber_100g,-insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,17,http://world-en.openfoodfacts.org/product/0000000000017/vitoria-crackers,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:categories-to-be-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-validated, en:photos-uploaded","en:to-be-completed,en:nutrition-facts-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:categories-to-be-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-validated,en:photos-uploaded","To be completed,Nutrition facts completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Categories to be completed,Brands to be completed,Packaging to be completed,Quantity to be completed,Product name completed,Photos to be validated,Photos uploaded",,,,,,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.200.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.9.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.9.200.jpg,,,,375.0,1569.0,,7.0,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15.0,,,,,,,,,,,,7.8,,,,1.4,0.56,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [51]:
data[~data['code'].isna()]

MemoryError: 

MemoryError: 

In [26]:
data.dropna(axis="columns", how="all", inplace=True).dropna(axis="rows", how="all", inplace=True)

AttributeError: 'NoneType' object has no attribute 'dropna'

In [None]:
data.drop_duplicates(inplace=True)

In [23]:
pd.set_option("display.max_rows", None) # Show full of rows
df_initial_analysis(data, "data")


Initial Analysis of data dataset
--------------------------------------------------------------------------
- Dataset shape:                  1760097 rows and 181 columns
- Total of NaN values:            251677554
- Percentage of NaN:              79.0 %
- Total of full duplicates rows:  1
- Total of empty rows:            0
- Total of empty columns:         0

- Type object and records by columns         ( memory usage: 2.4+ GB )
--------------------------------------------------------------------------
                                           Name     Type  Records
0                                          code   object  1760097
58                                    states_en   object  1760097
3                                     created_t    int64  1760097
4                              created_datetime   object  1760097
5                               last_modified_t    int64  1760097
6                        last_modified_datetime   object  1760097
57                        

In [24]:
pd.set_option("display.max_rows", None) # reset max of showing rows

In [19]:
data.tail(1)

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,quantity,packaging,packaging_tags,packaging_text,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,origins_en,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,brand_owner,ecoscore_score_fr,ecoscore_grade_fr,main_category,main_category_en,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,-soluble-fiber_100g,-insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
1760096,999999999999999,http://world-en.openfoodfacts.org/product/999999999999999/signal-toothpaste,openfoodfacts-contributors,1587222660,2020-04-18T15:11:00Z,1605558295,2020-11-16T20:24:55Z,Signal Toothpaste,,,,,,,,,"Non food products, Open Beauty Facts, Toothpaste","en:non-food-products,en:open-beauty-facts,en:toothpaste","Non food products,Open Beauty Facts,Toothpaste",,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:categories-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-validated, en:packaging-photo-to-be-selected, en:nutrition-photo-to-be-selected, en:ingredients-photo-to-be-selected, en:front-photo-selected, en:photos-uploaded","en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:categories-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-validated,en:packaging-photo-to-be-selected,en:nutrition-photo-to-be-selected,en:ingredients-photo-to-be-selected,en:front-photo-selected,en:photos-uploaded","To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Categories completed,Brands to be completed,Packaging to be completed,Quantity to be completed,Product name completed,Photos to be validated,Packaging photo to be selected,Nutrition photo to be selected,Ingredients photo to be selected,Front photo selected,Photos uploaded",,,,en:toothpaste,Toothpaste,https://static.openfoodfacts.org/images/products/999/999/999/999999/front_en.3.400.jpg,https://static.openfoodfacts.org/images/products/999/999/999/999999/front_en.3.200.jpg,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [20]:
data.describe()

Unnamed: 0,created_t,last_modified_t,cities,allergens_en,serving_quantity,no_nutriments,additives_n,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,nutriscore_score,nova_group,ecoscore_score_fr,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-caproic-acid_100g,-caprylic-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,-soluble-fiber_100g,-insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
count,1760097.0,1760097.0,0.0,0.0,463383.0,0.0,687715.0,687715.0,0.0,687715.0,0.0,655001.0,601437.0,421369.0,127411.0,1343908.0,1397732.0,975.0,1389669.0,1346364.0,3.0,3.0,15.0,7.0,20.0,7.0,78.0,29.0,3.0,8.0,15.0,10.0,47404.0,47404.0,1981.0,474.0,117.0,167.0,508.0,236.0,73.0,10.0,5.0,90.0,53.0,2.0,15.0,8.0,5.0,8.0,264553.0,268350.0,1389200.0,1372059.0,139.0,77.0,79.0,769.0,36.0,89.0,499.0,3567.0,467580.0,3607.0,3337.0,1390642.0,43.0,49.0,23.0,1341585.0,1341581.0,19593.0,212152.0,82.0,9532.0,3324.0,1103.0,219778.0,23387.0,22483.0,23569.0,15823.0,10095.0,8447.0,12571.0,1034.0,5988.0,118.0,378.0,91844.0,685.0,269712.0,13664.0,264442.0,14579.0,10096.0,4237.0,4002.0,321.0,2441.0,184.0,273.0,2094.0,425.0,160.0,187.0,8792.0,348.0,11748.0,309.0,5904.0,5.0,455.0,11632.0,655006.0,9.0,4.0,1.0,41.0,1712.0,32.0,46.0,20.0
mean,1551419000.0,1586493000.0,,,2.397833e+16,,2.045231,0.020606,,0.069087,,9.18136,3.43141,48.649775,5.231541e+37,6535332.0,4.768839e+36,361.496513,14.59468,12.59576,16.13,43.133333,21.528927,34.779584,127.3718,61.471143,10.715106,1.621922,0.001388453,2.753821,36.328254,0.284041,9.912199,5.892412,19.193757,2.876665,1.557723,0.869834,31.505225,4.4643,2.335784,1.99959,20.214003,239.442832,15.100702,0.85,14.31266,0.270998,0.1123321,4.700688,0.113198,0.048768,28.87058,13.39107,12.76837,8.023649,24.856205,3.518905,12.566818,3.789968,177.351673,31.259291,2.999382,2.42137,4.274249,8.731743,50.690065,528.177396,1.032415,2.113855,0.8455806,2.551932e+19,0.143366,0.551107,0.24451,0.322169,0.209052,0.029634,0.814803,0.074783,0.035984,0.833251,3.150429,0.026503,0.01675669,0.324283,0.052885,1.323346,1.662096,0.442837,0.219256,0.18262,0.566562,0.007843,3.411179,0.034383,0.021842,0.006812,0.062423,0.065592,0.540816,0.411964,1.416566,2.37495,3.482168,6.580437,34.201934,19.889581,46.452507,15.148324,51.973455,1.22943,264.854873,622.308731,9.181426,12.777778,34.175,9.1,1.602241,0.068921,3.583438,0.02501,0.038885
std,50161010.0,28678870.0,,,1.632253e+19,,2.917752,0.143871,,0.301058,,8.889746,0.964089,25.487723,1.8673819999999998e+40,7499809000.0,5.637993999999999e+39,557.578883,856.2292,8618.263,27.600723,47.472659,23.577125,62.857012,558.484,159.428531,74.186338,5.341329,0.002179084,6.317014,117.53423,0.674459,16.760607,10.322462,660.861557,8.033928,8.623512,3.325248,385.747215,11.918692,13.365693,3.620243,29.146241,1999.953424,21.662545,0.919239,39.33337,0.739177,0.2172377,6.752442,28.921995,1.395429,655.537,20.8029,17.272701,12.210888,26.255916,11.463307,38.95603,13.908055,3345.391932,33.701753,7.06019,3.663791,4.912954,147.8976,304.359878,3670.893992,2.548613,136.0838,54.43208,3.572066e+21,26.91477,4.194392,15.597007,3.517324,3.783565,1.228638,14.361471,6.524062,0.866931,71.086933,294.663511,1.862765,0.7029247,3.86488,1.036049,13.804943,14.681357,7.815799,2.278376,5.366706,10.256265,0.424097,367.783911,1.078673,1.098167,0.112939,0.949801,1.204635,7.298232,3.2041,54.711221,25.32328,31.836022,2.252681,36.496234,33.041567,28.683851,5.932068,22.761526,0.808762,768.136959,6142.98454,8.889756,10.58038,15.620153,,9.685532,1.826776,1.461794,0.02661,0.125679
min,1328021000.0,1333873000.0,,,0.0,,0.0,0.0,,0.0,,-15.0,1.0,-23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,7.4,0.0,8.8e-05,9.2e-08,0.0,0.0,0.0,3.6e-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4e-05,0.0,0.0,0.2,0.0,0.0,6.1e-07,0.0,0.0,0.0,-1.0,-1.0,0.0,0.0,0.0,-0.1,0.0,0.0,0.0,0.0,-20.0,0.0,0.0,-500.0,0.0028,0.0,3.6e-05,0.0,0.0,0.0,-0.00034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.00026,0.0,0.0,-6.896552,0.0,0.0,-2e-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8,0.0,0.19915,0.0,0.049,-15.0,-3.0,14.0,9.1,0.0,0.0,0.4,0.00135,0.004
25%,1519383000.0,1577285000.0,,,28.0,,0.0,0.0,,0.0,,2.0,3.0,32.0,400.0,101.0,418.0,42.0,0.7,0.1,0.195,16.2,0.023365,0.0985,0.0002065,0.001501,0.0,1.4e-05,0.00013268,0.0,6e-06,0.000176,0.1,0.0,0.421,0.062,0.09,0.06,1.1,0.437,0.00021,0.22125,4.77,1.385,1.08,0.525,9.777e-07,9e-06,0.00066,7.72725e-07,0.0,0.0,3.5,0.6,2.215,0.38,0.6,0.0,0.03025,0.01,2.65,4.0,0.0,0.0,1.0,1.3,1.4,0.4,0.019,0.07,0.028,0.0,0.0,3.2e-05,1e-06,0.0018,6e-06,0.0,0.0,0.000209,0.002927,0.000208,3.5e-05,3.4e-05,4e-07,6e-06,0.0007,0.0015,0.018075,0.08,0.0012,0.0,0.088,0.0,0.017,0.0009,0.0002,0.0,1.4e-05,5e-06,7e-06,5e-06,1.3e-05,0.016,0.039,6.0,0.0,0.0,18.0,12.0,32.0,0.6,0.0,111.0,2.0,3.0,26.0,9.1,0.053,8e-06,2.775,0.007525,0.007
50%,1562467000.0,1587667000.0,,,50.0,,1.0,0.0,,0.0,,10.0,4.0,43.0,976.0,263.0,1088.0,167.0,7.0,1.85,0.36,25.0,6.93,0.36,0.055,0.095,0.005,0.013,0.000265,0.000285,0.0648,0.002805,3.57,2.14,1.7,0.125,0.3,0.2,7.9,0.64,0.0036,0.8,7.3,25.865,5.26,0.85,0.00388,0.002985,0.03,0.05275,0.0,0.0,15.15,3.6,7.5,2.1,18.0,0.01,0.2615,0.12,24.5,16.0,1.6,2.0,3.0,6.0,3.9,2.6,0.022,0.56,0.224,0.0,0.0,0.000735,1e-06,0.006,2.5e-05,0.0,0.00072,0.000357,0.005,0.000571,6.5e-05,0.000105,1.5e-06,1.5e-05,0.001667,0.003605,0.0482,0.167,0.0085,0.036,0.182,0.001,0.059,0.00273,0.000429,0.001,7e-05,1.6e-05,1.2e-05,1.6e-05,2.3e-05,0.032,0.4,7.2,20.81,0.00315,50.0,15.0,52.0,1.575,100.0,333.0,10.0,13.0,37.0,9.1,0.069,2.3e-05,3.95,0.02,0.0085
75%,1587670000.0,1608681000.0,,,113.0,,3.0,0.0,,0.0,,16.0,4.0,69.0,1617.0,400.0,1674.0,418.0,21.43,7.14,24.18,61.0,48.5,36.45,2.9475,3.6,0.071,0.177,0.0020825,1.0225,0.718,0.022525,10.0,7.0,3.5,1.0,0.7,0.7,21.1,3.6,0.061,1.625,18.0,56.0,20.8,1.175,0.8315,0.0255,0.031,9.725,0.0,0.022,53.2,18.3,14.4,8.6,45.0,1.0,6.425,0.993,45.0,62.0,3.6,3.0,6.0,12.21,7.1,5.4,0.435,1.4,0.56,5.0,0.000107,0.06525,4e-06,0.015,8.9e-05,0.004,0.001,0.000844,0.008621,0.001316,0.000146,0.0002,4e-06,4.2e-05,0.004,0.0154,0.2045,0.3,0.056,0.11,0.333,0.00241,0.133,0.00577,0.001071,0.002,0.00045,4.7e-05,6.3e-05,5.2e-05,8.8e-05,0.046,0.4,7.57,60.0,26.25,64.0,15.0,70.0,1.59,279.5,614.2,16.0,22.0,45.175,9.1,0.1,9.6e-05,4.0,0.0272,0.01125
max,1619571000.0,1619571000.0,,,1.111111e+22,,49.0,3.0,,6.0,,40.0,4.0,125.0,6.665559e+42,8693855000000.0,6.665559e+42,3830.0,999999.0,9999999.0,48.0,97.0,50.0,170.0,2500.0,423.0,650.0,25.0,0.0039,18.0,457.0,2.1,459.0,204.0,29400.0,75.0,85.0,37.0,8700.0,153.0,112.0,12.0,71.0,19000.0,76.0,1.5,151.0,2.1,0.5,15.2,14800.0,141.0,762939.0,6880.0,99.8,55.0,100.0,112.0,225.0,120.0,74756.0,129.0,2020.0,61.0,46.0,173000.0,2000.0,25700.0,8.9,92500.0,37000.0,5e+23,11800.0,38.0,1480.0,99.5,100.0,430.0,987.0,975.0,69.0,8890.0,29600.0,166.666667,65.7895,100.0,50.0,150.0,247.0,875.0,50.0,930.0,650.0,120.0,44400.0,93.92,69.0,5.6,17.0,38.8,99.0,38.0,2500.0,500.0,400.0,14.0,100.0,100.0,100.0,100.0,100.0,2.183,13867.0,656298.6,40.0,25.0,48.7,9.1,62.1,54.0,7.3,0.15,0.572


In [None]:
data.head(1)

In [7]:
nb_not_null = pd.DataFrame((~data.isna()).sum(axis =0), columns=['nb'])
nb_not_null.sort_values(by=['nb'], axis=0, ascending=True, inplace=True)
nb_not_null.T.head(150)

Unnamed: 0,cities,no_nutriments,ingredients_from_palm_oil,ingredients_that_may_be_from_palm_oil,allergens_en,water-hardness_100g,-elaidic-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-lignoceric-acid_100g,glycemic-index_100g,additives,-dihomo-gamma-linolenic-acid_100g,chlorophyl_100g,-erucic-acid_100g,-stearic-acid_100g,-myristic-acid_100g,-nervonic-acid_100g,-cerotic-acid_100g,-mead-acid_100g,nutrition-score-uk_100g,-melissic-acid_100g,-gamma-linolenic-acid_100g,-capric-acid_100g,-butyric-acid_100g,-gondoic-acid_100g,-montanic-acid_100g,-lauric-acid_100g,-palmitic-acid_100g,carnitine_100g,nucleotides_100g,-behenic-acid_100g,beta-glucan_100g,-maltose_100g,choline_100g,casein_100g,inositol_100g,serum-proteins_100g,-oleic-acid_100g,-arachidonic-acid_100g,-glucose_100g,-arachidic-acid_100g,-fructose_100g,beta-carotene_100g,-maltodextrins_100g,omega-9-fat_100g,-eicosapentaenoic-acid_100g,silica_100g,-sucrose_100g,taurine_100g,-docosahexaenoic-acid_100g,chromium_100g,ph_100g,-linoleic-acid_100g,molybdenum_100g,collagen-meat-protein-ratio_100g,fluoride_100g,fruits-vegetables-nuts-dried_100g,bicarbonate_100g,caffeine_100g,carbon-footprint_100g,-alpha-linolenic-acid_100g,starch_100g,omega-6-fat_100g,chloride_100g,-lactose_100g,energy-from-fat_100g,biotin_100g,vitamin-k_100g,phylloquinone_100g,omega-3-fat_100g,iodine_100g,selenium_100g,vitamin-e_100g,-insoluble-fiber_100g,abbreviated_product_name,polyols_100g,-soluble-fiber_100g,packaging_text,manganese_100g,copper_100g,cocoa_100g,pantothenic-acid_100g,folates_100g,fruits-vegetables-nuts_100g,vitamin-d_100g,vitamin-b9_100g,zinc_100g,carbon-footprint-from-meat-or-fish_100g,fruits-vegetables-nuts-estimate_100g,vitamin-b12_100g,phosphorus_100g,ingredients_from_palm_oil_tags,magnesium_100g,vitamin-b6_100g,alcohol_100g,vitamin-b2_100g,vitamin-b1_100g,vitamin-pp_100g,ingredients_that_may_be_from_palm_oil_tags,monounsaturated-fat_100g,polyunsaturated-fat_100g,first_packaging_code_geo,cities_tags,origins_tags,origins_en,origins,potassium_100g,traces,manufacturing_places_tags,manufacturing_places,emb_codes_tags,emb_codes,generic_name,traces_tags,traces_en,energy-kj_100g,purchase_places,allergens,vitamin-a_100g,stores,vitamin-c_100g,iron_100g,trans-fat_100g,cholesterol_100g,calcium_100g,packaging_tags,packaging,brand_owner,labels,labels_tags,labels_en,additives_tags,additives_en,ecoscore_score_fr,ecoscore_grade_fr,quantity,serving_quantity,fiber_100g,serving_size,nova_group,nutriscore_score,nutriscore_grade,nutrition-score-fr_100g,image_ingredients_url,image_ingredients_small_url,ingredients_text,ingredients_from_palm_oil_n,ingredients_that_may_be_from_palm_oil_n,additives_n,main_category_en,categories_en,categories_tags,main_category,categories,image_nutrition_small_url,image_nutrition_url,brands_tags,brands,image_url,image_small_url,sodium_100g,salt_100g,energy-kcal_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,product_name,pnns_groups_1,countries_en,countries_tags,countries,pnns_groups_2,creator,url,last_modified_t,created_datetime,states,states_tags,states_en,last_modified_datetime,created_t,code
nb,0,0,0,0,0,1,2,3,3,3,4,4,5,5,5,7,7,8,8,8,9,10,10,10,14,15,15,15,20,20,23,29,32,36,41,43,46,49,53,73,77,78,79,82,89,90,117,118,139,160,167,184,187,236,273,309,321,348,378,425,455,474,499,508,685,769,975,1034,1103,1712,1981,2094,2441,3324,3337,3339,3567,3607,3722,4002,4237,5904,5988,8447,8792,9532,10095,10096,11632,11748,12571,13664,13994,14579,15823,19593,22483,23387,23569,40211,47404,47404,69575,74984,76186,76186,76304,91844,94000,112691,112742,112911,112942,113378,116895,116895,127411,151112,181644,212152,218680,219778,264442,264553,268350,269712,282838,282863,289516,388121,388140,388140,398474,398474,421369,421369,455823,463383,467580,468298,601437,655001,655001,655006,675186,675186,687713,687715,687715,687715,841866,841866,841866,841866,841867,858019,858019,918811,918867,1315404,1315404,1341581,1341585,1343908,1346364,1372059,1389200,1389669,1390642,1397732,1682674,1743530,1754600,1754600,1754604,1759730,1760093,1760097,1760097,1760097,1760097,1760097,1760097,1760097,1760097,1760097


In [46]:
pd.set_option("display.max_rows", None)
df_initial_analysis(data, "data")


Initial Analysis of data dataset
--------------------------------------------------------------------------
- Dataset shape:                  1760097 rows and 181 columns
- Total of NaN values:            251677554
- Percentage of NaN:              79.0 %
- Total of full duplicates rows:  1
- Total of empty rows:            0
- Total of empty columns:         0

- Type object and records by columns         ( memory usage: 2.4+ GB )
--------------------------------------------------------------------------
                                           Name    Type  Records
0                                          code  object  1760097
58                                    states_en  object  1760097
3                                     created_t  object  1760097
4                              created_datetime  object  1760097
5                               last_modified_t  object  1760097
6                        last_modified_datetime  object  1760097
57                               

In [13]:
pd.reset_option("display.max_columns") # reset max  of showing cols
pd.reset_option("max_colwidth") # reset full width of showing cols

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,...,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
count,1760097,1760097,1760093,1760097,1760097,1760097,1760097,1682674,3339,113378,...,11632,655006,9,4.0,1.0,41.0,1712.0,32,46.0,20.0
unique,1760093,1760093,13454,1510529,1510529,1383412,1383412,1106740,3308,82578,...,2588,56,7,4.0,1.0,28.0,448.0,19,27.0,12.0
top,7340011495437,http://world-en.openfoodfacts.org/product/7340...,kiliweb,1587662527,2020-04-23T17:22:07Z,1614416585,2021-02-27T09:03:05Z,Aceite de oliva virgen extra,"6x27,5cl tourtel twi",Pâtes alimentaires de qualité supérieure,...,3580,14,22,48.7,9.1,0.069,8.3e-06,4,0.02,0.0065
freq,2,2,989099,28,28,145,145,1252,7,283,...,329,34069,3,1.0,1.0,6.0,78.0,9,6.0,3.0


In [None]:
# copying the dataset into new dataset and removing the NaN columns anr rows to work with the df
df_data_copy = df_data.dropna(axis="columns", how="all").dropna(axis="rows", how="all")

In [57]:
del [data]
gc.collect()
data = pd.DataFrame()
del [data]

NameError: name 'data' is not defined

In [10]:
pd.reset_option("display.max_rows") # reset width of showing rows
del [data]
gc.collect()
data = pd.DataFrame()
del [data]

<div class="alert alert-block alert-info">
Reading only 1000 rows <b>to know the dataset</b>
</div>

In [7]:
df_temp = pd.read_csv("datasets/en.openfoodfacts.org.products.csv", nrows=1000, sep="\t", encoding="UTF-8", dtype="unicode")

In [21]:
pd.set_option("display.max_columns", None)
df_temp.head(5)

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,quantity,packaging,packaging_tags,packaging_text,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,origins_en,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,traces,traces_tags,traces_en,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_en,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutriscore_score,nutriscore_grade,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_en,brand_owner,ecoscore_score_fr,ecoscore_grade_fr,main_category,main_category_en,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy-kj_100g,energy-kcal_100g,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,-butyric-acid_100g,-caproic-acid_100g,-caprylic-acid_100g,-capric-acid_100g,-lauric-acid_100g,-myristic-acid_100g,-palmitic-acid_100g,-stearic-acid_100g,-arachidic-acid_100g,-behenic-acid_100g,-lignoceric-acid_100g,-cerotic-acid_100g,-montanic-acid_100g,-melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,-alpha-linolenic-acid_100g,-eicosapentaenoic-acid_100g,-docosahexaenoic-acid_100g,omega-6-fat_100g,-linoleic-acid_100g,-arachidonic-acid_100g,-gamma-linolenic-acid_100g,-dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,-oleic-acid_100g,-elaidic-acid_100g,-gondoic-acid_100g,-mead-acid_100g,-erucic-acid_100g,-nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,-sucrose_100g,-glucose_100g,-fructose_100g,-lactose_100g,-maltose_100g,-maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,-soluble-fiber_100g,-insoluble-fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,17,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,375.0,1569.0,,7.0,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15.0,,,,,,,,,,,,7.8,,,,1.4,0.56,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,31,http://world-en.openfoodfacts.org/product/0000...,isagoofy,1539464774,2018-10-13T21:06:14Z,1539464817,2018-10-13T21:06:57Z,Cacao,,,130 g,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,3327986,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1574175736,2019-11-19T15:02:16Z,1574175737,2019-11-19T15:02:17Z,Filetes de pollo empanado,,,,,,,,,,,,,,,,,,,,,,,,,,,en:es,en:spain,Spain,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-to-be-c...","en:to-be-completed,en:nutrition-facts-to-be-co...","To be completed,Nutrition facts to be complete...",,,,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,4622327,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1619501895,2021-04-27T05:38:15Z,1619501897,2021-04-27T05:38:17Z,Hamburguesas de ternera 100%,,,,,,,,,,,,,,,,,,,,,,,,,,,en:es,en:spain,Spain,,,,,,,,,,,,,,,,,,,,,,,unknown,unknown,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,,,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,874.9,3661.0,,15.1,6.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.6,1.0,,,,,,,,,,,,15.7,,,,2.1,0.84,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,100,http://world-en.openfoodfacts.org/product/0000...,del51,1444572561,2015-10-11T14:09:21Z,1444659212,2015-10-12T14:13:32Z,moutarde au moût de raisin,,,100g,,,,courte paille,courte-paille,"Epicerie, Condiments, Sauces, Moutardes","en:groceries,en:condiments,en:sauces,en:mustards","Groceries,Condiments,Sauces,Mustards",,,,,,Delois france,fr:delois-france,fr:delois-france,,,,,,,courte paille,France,en:france,France,eau graines de téguments de moutarde vinaigre ...,en:mustard,,,,,,,,0.0,,,,0.0,,,0.0,,,18.0,d,,Fat and sauces,Dressings and sauces,"en:to-be-completed, en:nutrition-facts-complet...","en:to-be-completed,en:nutrition-facts-complete...","To be completed,Nutrition facts completed,Ingr...",,60.0,b,en:mustards,Mustards,https://static.openfoodfacts.org/images/produc...,https://static.openfoodfacts.org/images/produc...,,,,,936.0,,936.0,,8.2,2.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,29.0,22.0,,,,,,,,,0.0,,,5.1,,,,4.6,1.84,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,,,,,,,,


In [30]:
pd.reset_option("display.max_columns") # reset width of showing cols

In [31]:
pd.set_option("display.max_rows", None)
df_temp.dtypes

code                                          object
url                                           object
creator                                       object
created_t                                     object
created_datetime                              object
last_modified_t                               object
last_modified_datetime                        object
product_name                                  object
abbreviated_product_name                      object
generic_name                                  object
quantity                                      object
packaging                                     object
packaging_tags                                object
packaging_text                                object
brands                                        object
brands_tags                                   object
categories                                    object
categories_tags                               object
categories_en                                 

In [32]:
pd.reset_option("display.max_rows") # reset width of showing rows

In [None]:
<div class="alert alert-block alert-info">
<b>Nutri-Score data:</b>
    <ul>
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ul>    
</div>

<div class="alert alert-block alert-warning">
<b>Example:</b> read data in chunks of 1 million rows at a time
</div>

<div class="alert alert-block alert-success">
<b>Up to you:</b> Use green boxes sparingly, and only for some specific 
purpose that the other boxes can't cover. For example, if you have a lot 
of related content to link to, maybe you decide to use green boxes for 
related links from each section of a notebook.
</div>

<div class="alert alert-block alert-danger">
<b>Just don't:</b> In general, avoid the red boxes. These should only be
used for actions that might cause data loss or another major issue.
</div>