COLAB read LINK : https://colab.research.google.com/drive/1Tuw093YVweBhka548iaQ7pAgxDwi7vG4?usp=sharing

# ***-1 -*** **OpenFoodFacts : About this Project**

## *A -* Introduction

The aim if this project is to create an application than can use the data of the **OpenFoodFacts** open database.

"*Open Food Facts gathers information and data on food products from around the world.*"
*https://world.openfoodfacts.org/*

In this project, we will clean and explore the OpenFoodFacts database to evaluate the feasability of our application.

This project is divided into two notebooks:
* In the first notebook, we will clean and filter the OpenFoodfacts data which is constituted of around 2 millions samples.
* In the second notebook we will statistically explore the filtered dataset in order to evaluate the feasability of our application.

In addition, we will also focus on creating tools to automate the exploration of datasets, whichever they are. This will enable us to quickly and efficiently explore datasets, in a transverse way, in the future.

This project is part of my ***OpenClassrooms-CentraleSupelec Machine Learning Engineer*** curriculum.

## *B -* How to Read this Project

### *a -* Notebooks

This project is divided into two notebooks:

 * **Cleaning** (*first notebook*): chapters 1, 2, 3 & 4
 * **Exploration** (*second notebook*): chapters 5 & 6

### *b -* Chapters

Each notebook is organized in chapters:

* ***-1 -*** **About this project** is *what you are reading now*. This is the *README*.
* ***0 -*** **Environment**: sets up the necessary environment to run this notebook. In this part, we will also develop the toolbox to automate the exploration of the dataset. *Disclaimer: This part is not really about Data Science, but more about code and automatized processing. It is not necessary to read this part unless you have a good knowledge of python and are interested on how I delevelopped these functions. Fee free to skip it.*

*First Notebook*
* ***1 -*** **Dataset Description**: describes the raw *data* of *OpenFoodFacts*
* ***2 -*** **Application Concept**: a short brief, we will present our application concept.
* ***3 -*** **Dataset Cleaning**: cleans the database in order to obtain useable data for our data exploration and application.
* ***4 -*** **Dataset Cleaning Conclusions**

*Second Notebook*
* ***5 -*** **Exploratory Analysis**: we perform the Exploratory Analysis.
* ***6 -*** **Exploratory Analysis Conclusions**

## *B -* Mouting Google Drive

In order to load the data which has been downloaded from OpenFoodFacts and uploaded to my personnal drive, we need to mount the Google Drive instance.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive') # link to be updated

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# ***0 -*** **Environment**

## *A -* Importing Libraries

Let's import the libraries that will be used in this project.

In [2]:
import pandas as pd
!pip install sweetviz
import sweetviz as sv
import missingno
import plotly.express as px
import plotly.graph_objects as go
import re
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import numpy as np
from wordcloud import WordCloud
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.decomposition import PCA
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler, QuantileTransformer, StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error
import plotly.figure_factory as ff
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection



## *B -* Utilities

### *a -* Dataset Versioning

I have developped the **Dataset()** class during this project as a simple dataset versioning tool.

*You can skip this part in you are not interested in code.*

In [3]:
class Dataset():


  def __init__(self, dataframe:pd.DataFrame):

    init_name = 'original'
    self.versions_index = {init_name:0}
    self.versions = {0:self.dict_constructor(dataframe=dataframe, step=init_name, index=0)}
    self.current_version = 0

    print(f'Version {self.current_version}: "{init_name}" initialized') 


  def last_index(self):

    return list(self.versions_index.values())[-1]


  def add_index(self, step, index):

    if step in list(self.versions_index.keys()) or step=='current':
      print('\nKey already in index, choose another key.\n')
      return False

    else:
      self.versions_index[step] = index
      return True


  def dict_constructor(self, dataframe, step, index):

    if index == 0:
      samples_diff = dataframe.shape[0]
      features_diff = dataframe.shape[1]
      index_diff = dataframe.index
      columns_diff = dataframe.columns

    else:
      last_version = self.versions.get(index-1)
      samples_diff = last_version.get('samples') - dataframe.shape[0]
      features_diff = last_version.get('features') - dataframe.shape[1]
      index_diff = last_version.get('index').difference(dataframe.index)
      columns_diff = last_version.get('columns').difference(dataframe.columns)

    dataframe_dict = {
      'name': step,
      'dataframe': dataframe,
      'samples': dataframe.shape[0],
      'samples_diff': samples_diff,
      'features': dataframe.shape[1],
      'features_diff': features_diff,
      'index': dataframe.index,
      'index_diff': index_diff,
      'columns': dataframe.columns,
      'columns_diff': columns_diff
    }

    return dataframe_dict


  def save_version(self, updated_dataframe:pd.DataFrame, step:str):

    next_version = self.last_index() + 1
    key_pass = self.add_index(step=step, index=next_version)
    if key_pass == True:
      self.versions[next_version] = self.dict_constructor(dataframe=updated_dataframe, step=step, index=next_version)
      self.current_version = next_version
      print(f'\nVersion {next_version}: "{step}" saved\n')    


  def parse_step(self, step):

    if type(step) == str:
      version_index = self.versions_index.get(step)
      version_name = step
    if type(step) == int:
      version_index = step
      version_name = self.versions.get(version_index).get('name')
    return version_index, version_name


  def get_version(self, step):

    version_index, version_name = self.parse_step(step)

    return self.versions.get(version_index)


  def delete_version(self, step):

    version_index, version_name = self.parse_step(step)
    del self.versions_index[version_name]
    del self.versions[version_index]

    print(f'\nVersion {version_index}: "{version_name}" deleted\n')


  def pull_features(self, features, step=0):

    return self.get_version(self.current_version).get('dataframe').join(self.get_version(step).get('dataframe')[features], how='inner')


  def get(self, item='dataframe', step='current'):

    if step == 'current': version = self.versions.get(self.current_version)
    elif step == 'latter': version = self.versions.get(self.current_version-1)
    else: version = self.get_version(step)

    try: 
      item = version.get(item)
      return item
    except KeyError: 
      print('\nKey error, try generating the item first.\n')


  def num_report(self, fig_fill_min:float):

    dataframe = self.get()

    samples = dataframe.shape[0]
    report = dataframe.select_dtypes([int, float, 'datetime']).describe().transpose()
    report['fill_%'] = ((report['count'] / samples) * 100).astype(float).round(2)
    report['nans'] = dataframe.isna().sum()
    report['nans_%'] = ((report['nans'] / samples) * 100).astype(float).round(2)

    for i in report.index:
      zeroes_count = dataframe[i][dataframe[i] == 0].shape[0]
      report.loc[i, 'zeroes'] = zeroes_count
      report.loc[i, 'zeroes_%'] = round(((zeroes_count / samples) * 100), 2)

    report = report[['count', 'fill_%', 'nans', 'nans_%','zeroes', 'zeroes_%','mean', 'std', 'min', '25%', '50%', '75%', 'max']]
    report = report.sort_values(by='count', ascending=False).rename_axis(mapper='feature', axis=0)
    report_df = report[report['fill_%'] >= fig_fill_min][['fill_%', 'nans_%', 'zeroes_%']].transpose()
    report_fig = go.Figure(data=[go.Bar(name=str(report_df.index[index]), x=list(report_df.columns.values), y=list(report_df.iloc[index,:].values)) for index in range(report_df.shape[0])])

    if fig_fill_min == 0: title = 'numerical features characteristics'
    else: title = f'numerical features characteristics (fill >= {fig_fill_min}%: {report_df.shape[1]})'

    report_fig.update_layout(title=title)
    report_fig.show()

    self.versions[self.current_version]['numericals'] = report


  def cat_report(self, fig_fill_min:float):

    dataframe = self.get()

    samples = dataframe.shape[0]
    report = dataframe.select_dtypes('object').describe().transpose()
    report['fill_%'] = ((report['count'] / samples) * 100).astype(float).round(2)
    report['uniques_%'] = ((report['unique'] / samples) * 100).astype(float).round(2)
    report['nans'] = dataframe.isna().sum()
    report['nans_%'] = ((report['nans'] / samples) * 100).astype(float).round(2)
    report = report[['count', 'fill_%', 'unique', 'uniques_%', 'nans', 'nans_%', 'top', 'freq']]
    report = report.sort_values(by='count', ascending=False).rename_axis(mapper='feature', axis=0)
    report_df = report[report['fill_%'] >= fig_fill_min][['fill_%', 'nans_%', 'uniques_%']].transpose()
    report_fig = go.Figure(data=[go.Bar(name=str(report_df.index[index]), x=list(report_df.columns.values), y=list(report_df.iloc[index,:].values)) for index in range(report_df.shape[0])])

    if fig_fill_min == 0: title = 'categorical features characteristics'
    else: title = f'categorical features characteristics (fill >= {fig_fill_min}%: {report_df.shape[1]})'

    report_fig.update_layout(title=title)
    report_fig.show()

    self.versions[self.current_version]['categoricals'] = report


  def report(self, fig_fill_min=0):

    dataframe = self.get()

    if self.current_version > 0:

      version_old = self.versions.get(self.current_version-1)
      samples_old, features_old = version_old.get('samples'), version_old.get('features')
      samples_diff = samples_old - dataframe.shape[0]
      samples_percent = round((samples_diff / samples_old) * 100, 2)
      features_diff = features_old - dataframe.shape[1]
      features_percent = round((features_diff / features_old) * 100, 2)
      print(f'\nSamples dropped: {samples_diff}/{samples_old} ({samples_percent}%)\nFeatures dropped: {features_diff}/{features_old} ({features_percent}%)\n')

    # fig 1
    num_df = self.num_report(fig_fill_min)
    # fig 2
    cat_df = self.cat_report(fig_fill_min)

  
  def help(self):

    print('This is the help.')


### *b -* Data Summarization

These are wrapper functions that I have developed to explore the dataset.

The dataset has close to 2 millions rows. In order to process this data in an efficient way, we will need to make use of the pandas *vectorization*, or risk to run into performance issues.

*You can skip this part in you are not interested in code.*

In [4]:
def knn_optimizer(model, X_train:pd.DataFrame, y_train:pd.DataFrame, X_val:pd.DataFrame, y_val:pd.DataFrame, metric, range=range(1,10)):  

  best_id, best_neighbors, best_score = 0, 0, None

  for id, neighbors in enumerate(range):

    knn = model(n_neighbors=neighbors)
    knn.fit(X_train, y_train)
    predictions = knn.predict(X_val)

    if metric == 'accuracy':

      score = accuracy_score(y_val, predictions)
      score = round(score * 100, 2)
      print(f'\nPass {id}: {neighbors} neighbor(s), {metric}: {score}')

      if best_score is None or score > best_score:
        best_neighbors, best_score, best_id = neighbors, score, id
      
    if metric == 'MSE':

      score = mean_squared_error(y_val, predictions)
      score = round(score, 2)
      print(f'\nPass {id}: {neighbors} neighbor(s), {metric}: {score}')

      if best_score is None or score < best_score:
        best_neighbors, best_score, best_id = neighbors, score, id

  print(f'\nBest pass {best_id}: {best_neighbors} neighbor(s), {metric}: {best_score}')
  
  return model(n_neighbors=best_neighbors).fit(X_train, y_train)


def eta_squared(x,y):
    moyenne_y = y.mean()
    classes = []
    for classe in x.unique():
        yi_classe = y[x==classe]
        classes.append({'ni': len(yi_classe),
                        'moyenne_classe': yi_classe.mean()})
    SCT = sum([(yj-moyenne_y)**2 for yj in y])
    SCE = sum([c['ni']*(c['moyenne_classe']-moyenne_y)**2 for c in classes])
    return SCE/SCT


def dist_plot(dataframe:pd.DataFrame, feature:str, by=None, bin_size=0.5):

  subsets = list()
  labels = list()
  if by is not None:
    labels = list(set(dataframe[by].values))
    labels.sort()
    for filter in labels:
      subsets.append(dataframe[dataframe[by]==filter][feature].values)
  else:
    labels = [feature]
    subsets = [dataframe[feature].values]

  fig = ff.create_distplot(subsets, group_labels=labels, bin_size=bin_size,
                          curve_type='normal', show_rug=False
                          )
  fig.update_layout(title_text=f'{feature} vs normal distribution', height=750)
  fig.show()


def heatmap(matrix:pd.DataFrame, title='', extra=None):

  if extra is not None:
    extra = extra.values
  fig = ff.create_annotated_heatmap(matrix.values, x=matrix.columns.to_list(), y=matrix.index.to_list(), annotation_text=extra)
  fig.update_layout(title=title)
  fig.show()


def pie_plot(dataframe:pd.DataFrame, feature:str):

  fig_df = pd.DataFrame(pd.Series((','.join(dataframe[feature].to_list())).split(',')).value_counts(), columns=['population']).rename_axis(mapper='tag', axis=0)
  fig = px.pie(fig_df.reset_index(), names='tag', values='population', title=f'{feature} population')
  fig.show()


def bar_plot(dataframe:pd.DataFrame, feature:str):

  fig_df = pd.DataFrame(pd.Series((','.join(dataframe[feature].to_list())).split(',')).value_counts(), columns=['population']).rename_axis(mapper='tag', axis=0)
  fig = px.bar(fig_df.reset_index(), x='tag', y='population', title=f'{feature} population')
  fig.show()


def box_plots(x_data,y_data):

  colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
            'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
      
  fig = go.Figure()

  for xd, yd, cls in zip(x_data, y_data, colors):
          fig.add_trace(go.Box(
              y=yd,
              name=xd,
              boxpoints='outliers',
              jitter=0.5,
              whiskerwidth=0.2,
              fillcolor=cls,
              marker_size=2,
              line_width=1)
          )

  fig.update_layout(
      yaxis=dict(
          autorange=True,
          showgrid=True,
          zeroline=True,
          dtick=5,
          gridcolor='rgb(255, 255, 255)',
          gridwidth=1,
          zerolinecolor='rgb(255, 255, 255)',
          zerolinewidth=2,
      ),
      margin=dict(
          l=40,
          r=30,
          b=80,
          t=100,
      ),
      paper_bgcolor='rgb(243, 243, 243)',
      plot_bgcolor='rgb(243, 243, 243)',
      showlegend=False
  )

  return fig

  
def batch_box_plots(dataframe:pd.DataFrame, by=None, title=''):

  dataframe_num = dataframe.select_dtypes([int,float])

  if by is None:
    x_data = dataframe_num.columns.to_list()
    y_data = [dataframe_num[feature].values for feature in x_data]
    fig = go.Figure()

    fig = box_plots(x_data,y_data)
    fig.update_layout(title=title)
    fig.show()

  else:

    for feature in dataframe_num.columns:

      filters = list(set(dataframe[by].values))
      filters.sort()
      x_data = filters
      y_data = [dataframe[dataframe[by]==filter][feature].values for filter in filters]

      fig = box_plots(x_data,y_data)
      fig_title = title + f' ({feature})'
      fig.update_layout(title=fig_title)
      fig.show()


def dist_plot(dataframe:pd.DataFrame, feature:str, by:str, bin_size=1):

  subsets = list()
  labels = list(set(dataframe[by].values))
  for filter in labels:
    subsets.append(dataframe[dataframe[by]==filter][feature].values)
  fig = ff.create_distplot(subsets, group_labels=labels, bin_size=bin_size,
                          curve_type='normal', show_rug=False)

  fig.update_layout(title_text=f'{feature} vs normal distribution', height=750)
  fig.show()


def sum_dtypes(dataframe:pd.DataFrame):

  dtypes = dataframe.dtypes.value_counts()
  dtypes.index = dtypes.index.astype(str)
  dtypes = pd.DataFrame(data=dtypes, columns=['population']).rename_axis(mapper='dtype', axis=0)
  dtypes_fig = px.pie(dtypes.reset_index(), names='dtype', values='population', title="dtypes repartition")
  dtypes_fig.show()

  return dtypes


def sum_nans(dataframe:pd.DataFrame):

  samples, features = dataframe.shape[0], dataframe.shape[1]
  nans = dataframe.isna().sum()
  nans = pd.DataFrame(data=nans, columns=['nans']).rename_axis(mapper='feature', axis=0).sort_values(by='nans', ascending=False)
  nans['nans_%'] = ((nans['nans'] / dataframe.shape[0]) * 100).round(2)

  return nans


def sum_uniques(dataframe:pd.DataFrame):

  samples, features = dataframe.shape[0], dataframe.shape[1]
  uniques = dataframe.nunique()
  uniques = pd.DataFrame(data=uniques, columns=['uniques']).rename_axis(mapper='feature', axis=0).sort_values(by='uniques', ascending=False)
  uniques['uniques_%'] = ((uniques['uniques'] / dataframe.shape[0]) * 100).round(2)

  return uniques


def join(series):

  return series.to_list()


def sample(*series):

  df = pd.DataFrame()

  for serie in series:
    uniques = serie.unique()
    if len(uniques) >= 10:
      sample = pd.Series(uniques).sample(10)
      df[serie.name] = sample.values
    else:
      sample = serie.sample(10)
      df[serie.name] = sample.values
    df[f'{serie.name}_index'] = sample.index
  
  df = df.reset_index().drop('index', axis=1).rename_axis(mapper='sample', axis=0)
  
  return df


def filter_tags(dataframe:pd.DataFrame, filters:dict):

  dataframe_features = dataframe.columns.tolist()
  features_df = pd.DataFrame(dataframe.columns, columns=['features'], index=dataframe.columns).rename_axis(mapper='index', axis=0)
  features_df['dtype'] = dataframe.dtypes.astype(str).values
  features_df['cat'] = features_df['dtype'].str.contains('object')
  features_df['num'] = features_df['dtype'].str.contains('float64')
  features_df['startswith'] = features_df['features'].str.split('_').str[0]
  features_df['splits'] = features_df['features'].str.count('_')
  features_df['processed'] = features_df['features']

  for filter in filters:
    if filter == 'endswith':
      for tag in filters[filter]:
        features_df[f'...{tag}'] = features_df['features'].str.endswith(tag)
        features_df['processed'] = features_df['processed'].str.replace(tag + r'$', '')

  filters_endswith = {f'...{filter}':sum for filter in filters['endswith']}
  misc = {feature:sum for feature in ['cat','num']}
  # dataframe qui filtre les startwith pat tag pour trouver les noms uniques
  features_filtered_df = features_df.groupby(by='startswith').agg({**misc, **filters_endswith, **{'splits': max, 'features': join, 'processed':join}}).rename_axis(mapper='index', axis=0)
  features_filtered_df['startswith_filtered'] = features_filtered_df.index
  features_filtered_df['total'] = features_filtered_df['cat'] + features_filtered_df['num']
  features_filtered_df_cols = features_filtered_df.columns.to_list()
  features_filtered_df = features_filtered_df[[features_filtered_df_cols[-1]]+features_filtered_df_cols[:-1]]
  features_filtered_df['processed'] = features_filtered_df['processed'].apply(lambda cell: set(cell))
  features_filtered_df = features_filtered_df.sort_values(by='splits', ascending=False)
  # recroisement avec la liste de features du dataframe
  features_names = [name for names in features_filtered_df['processed'].to_list() for name in names]
  features_final = list()

  for filter in filters:
    if filter == 'endswith':
      for feature_name in features_names:
          for tag in filters[filter] + ['']:
            temp_feature_name = f'{feature_name}{tag}'
            if temp_feature_name in dataframe_features:
              features_final.append(temp_feature_name)
              break

  print(f'\n{len(dataframe_features) - len(features_final)} features dropped\n')

  return features_final, features_filtered_df.drop('startswith_filtered', axis=1)


def filter_cat_feature(dataframe:pd.DataFrame, by:str, minimum_coverage:float):

  #filter top features with minimum cov and plot top features and others
  feature = dataframe[by].astype(str)
  features_df = pd.DataFrame(pd.Series((','.join(feature.to_list())).split(',')).value_counts(), columns=['population']).rename_axis(mapper='tag', axis=0)
  features_df['population_%'] = round((features_df['population'] / features_df['population'].values.sum()) * 100, 2)
  features_df['cumulative_uniques_%'] = features_df['population_%'].values.cumsum()
  features_n = features_df.shape[0]
  top_features_n = 0
  
  if minimum_coverage == 100:
    top_features = features_df.index.to_list()
    others = None

  else:
    for feature_index, coverage in enumerate(features_df['cumulative_uniques_%'].to_list()):
      if coverage >= minimum_coverage:
        top_features_n = feature_index +1
        break
    top_features = features_df.index.to_list()[:top_features_n]
    others = features_df[top_features_n:]
    
  top_features_df = features_df

  if others is not None:
    top_features_df = features_df.copy().head(top_features_n)
    top_features_df.loc['others',:] = [others['population'].sum(), others['population_%'].sum(), others['cumulative_uniques_%'].to_list()[-1]]
    top_features = top_features + ['others']
  # details
  filtered_percent = round((top_features_n / features_n) * 100, 2)
  print(f'\nMinimum coverage: {minimum_coverage}%\nFiltered "{by}": {top_features_n}/{features_n} ({filtered_percent}%)\nSelected: {top_features}\n')
  # fig 1
  if top_features_n > 0: top_string = f' (top {top_features_n} and others)'
  else: top_string = ''
  # filters dataframe with each feature to aggregate stats into top_features_df
  for feature in top_features:
    if feature == 'others':
        filter_df = others
    else:
      filter_df = dataframe.copy()
      filter_df['/filter'] = dataframe[by].str.contains(feature)
      filter_df = filter_df[filter_df['/filter'] == True].drop('/filter', axis=1)
    top_features_df.loc[feature, 'size'] = filter_df.shape[0] * filter_df.shape[1]
    top_features_df.loc[feature, 'nans'] = filter_df.isna().sum().sum()
    top_features_df.loc[feature, 'unique'] = filter_df.nunique().sum().sum()

  top_features_df['fill'] = top_features_df['size'] - top_features_df['nans']
  top_features_df['nans_%'] = ((top_features_df['nans'] / top_features_df['size']) * 100).round(2)
  top_features_df['fill_%'] = 100 - top_features_df['nans_%']
  top_features_df['uniques_%'] = ((top_features_df['unique'] / top_features_df['size']) * 100).round(2)
  top_features_df = top_features_df[['population', 'population_%', 'cumulative_uniques_%', 'fill', 'fill_%', 'nans', 'nans_%', 'unique', 'uniques_%', 'size']]
  top_features_fig = top_features_df[['population_%', 'fill_%', 'nans_%', 'uniques_%']].transpose()
  top_features_fig = go.Figure(data=[go.Bar(name=str(top_features_fig.index[index]), x=list(top_features_fig.columns.values), y=list(top_features_fig.iloc[index,:].values)) for index in range(top_features_fig.shape[0])])
  top_features_fig.update_layout(title=f'"{by}" charateristics per category' + top_string) #width=1200, height=600, 
  top_features_fig.show()

  return top_features_df


### *c -* PCA

These are functions from the *OpenClassRooms* course on *Dimensionality Reduction*.

*You can skip this part in you are not interested in code.*

In [5]:
def display_circles(pcs, n_comp, pca, axis_ranks, labels=None, label_rotation=0, lims=None):
    for d1, d2 in axis_ranks: # On affiche les 3 premiers plans factoriels, donc les 6 premières composantes
        if d2 < n_comp:

            # initialisation de la figure
            fig, ax = plt.subplots(figsize=(14,12))

            # détermination des limites du graphique
            if lims is not None :
                xmin, xmax, ymin, ymax = lims
            elif pcs.shape[1] < 30 :
                xmin, xmax, ymin, ymax = -1, 1, -1, 1
            else :
                xmin, xmax, ymin, ymax = min(pcs[d1,:]), max(pcs[d1,:]), min(pcs[d2,:]), max(pcs[d2,:])

            # affichage des flèches
            # s'il y a plus de 30 flèches, on n'affiche pas le triangle à leur extrémité
            if pcs.shape[1] < 30 :
                plt.quiver(np.zeros(pcs.shape[1]), np.zeros(pcs.shape[1]),
                   pcs[d1,:], pcs[d2,:], 
                   angles='xy', scale_units='xy', scale=1, color="grey")
                # (voir la doc : https://matplotlib.org/api/_as_gen/matplotlib.pyplot.quiver.html)
            else:
                lines = [[[0,0],[x,y]] for x,y in pcs[[d1,d2]].T]
                ax.add_collection(LineCollection(lines, axes=ax, alpha=.1, color='black'))
            
            # affichage des noms des variables  
            if labels is not None:  
                for i,(x, y) in enumerate(pcs[[d1,d2]].T):
                    if x >= xmin and x <= xmax and y >= ymin and y <= ymax :
                        plt.text(x, y, labels[i], fontsize='14', ha='center', va='center', rotation=label_rotation, color="blue", alpha=0.5)
            
            # affichage du cercle
            circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
            plt.gca().add_artist(circle)

            # définition des limites du graphique
            plt.xlim(xmin, xmax)
            plt.ylim(ymin, ymax)
        
            # affichage des lignes horizontales et verticales
            plt.plot([-1, 1], [0, 0], color='grey', ls='--')
            plt.plot([0, 0], [-1, 1], color='grey', ls='--')

            # nom des axes, avec le pourcentage d'inertie expliqué
            plt.xlabel('F{} ({}%)'.format(d1+1, round(100*pca.explained_variance_ratio_[d1],1)))
            plt.ylabel('F{} ({}%)'.format(d2+1, round(100*pca.explained_variance_ratio_[d2],1)))

            plt.title("Cercle des corrélations (F{} et F{})".format(d1+1, d2+1))
            plt.show(block=False)
        
def display_factorial_planes(X_projected, n_comp, pca, axis_ranks, labels=None, alpha=1, illustrative_var=None):
    for d1,d2 in axis_ranks:
        if d2 < n_comp:
 
            # initialisation de la figure       
            fig = plt.figure(figsize=(14,12))
        
            # affichage des points
            if illustrative_var is None:
                plt.scatter(X_projected[:, d1], X_projected[:, d2], alpha=alpha)
            else:
                illustrative_var = np.array(illustrative_var)
                for value in np.unique(illustrative_var):
                    selected = np.where(illustrative_var == value)
                    plt.scatter(X_projected[selected, d1], X_projected[selected, d2], alpha=alpha, label=value)
                plt.legend()

            # affichage des labels des points
            if labels is not None:
                for i,(x,y) in enumerate(X_projected[:,[d1,d2]]):
                    plt.text(x, y, labels[i],
                              fontsize='14', ha='center',va='center') 
                
            # détermination des limites du graphique
            boundary = np.max(np.abs(X_projected[:, [d1,d2]])) * 1.1
            plt.xlim([-boundary,boundary])
            plt.ylim([-boundary,boundary])
        
            # affichage des lignes horizontales et verticales
            plt.plot([-100, 100], [0, 0], color='grey', ls='--')
            plt.plot([0, 0], [-100, 100], color='grey', ls='--')

            # nom des axes, avec le pourcentage d'inertie expliqué
            plt.xlabel('F{} ({}%)'.format(d1+1, round(100*pca.explained_variance_ratio_[d1],1)))
            plt.ylabel('F{} ({}%)'.format(d2+1, round(100*pca.explained_variance_ratio_[d2],1)))

            plt.title("Projection des individus (sur F{} et F{})".format(d1+1, d2+1))
            plt.show(block=False)

def display_scree_plot(pca):
    scree = pca.explained_variance_ratio_*100
    plt.figure(figsize=(14,12))
    plt.bar(np.arange(len(scree))+1, scree)
    plt.plot(np.arange(len(scree))+1, scree.cumsum(),c="red",marker='o')
    plt.xlabel("rang de l'axe d'inertie")
    plt.ylabel("pourcentage d'inertie")
    plt.title("Eboulis des valeurs propres")


## *C -* Loading the Data

The data download page of OpenFoodFacts can be found at https://fr.openfoodfacts.org/data.

We will use the .csv file. Let's import it using *pandas read_csv()* method.

We initiate our *Dataset()* with the dataframe.

In [110]:
root_path = './gdrive/MyDrive/Openclassrooms/P2'
dataframe = pd.read_csv(f'{root_path}/dataset.csv', sep='\t', encoding="utf-8", low_memory=True)
dataset = Dataset(dataframe)


Columns (0,8,13,19,20,21,22,23,27,28,29,31,52,55,64) have mixed types.Specify dtype option on import or set low_memory=False.



Version 0: "original" initialized


# ***1 -*** **Dataset Description**

Let us take a look at the main dataset characteristics to get acquainted with the dataset.

## *A -* Dictionnary

The *data dictionnary* can be found at : https://world.openfoodfacts.org/data/data-fields.txt

## *B -* Shape

In [111]:
samples, features = dataset.get('samples'), dataset.get('features')
print(f'The dataset is composed of {samples} samples (rows), and {features} features (columns).')

The dataset is composed of 1993128 samples (rows), and 186 features (columns).


## *C -* Head 

Let's take a look at the head (the firsts rows) of the dataset.

We notice that the dataset composed of:
* *Meta data*: information about the database entries such as the pseudonyms of the contributors, or the dates of creation,
* *Nutritionnal data*: the legal nutritionnal information you can find on the back of food products, such as the sugars, proteins, fats or salt per 100 grams,
* *Scores*, such as the nova group or nutriscore, which try to describe the quality of the food products,
* *Miscellaneous data*: additional information about the prodcuct, such as food categories or origins.

In [112]:
dataframe.head()

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,abbreviated_product_name,generic_name,quantity,packaging,packaging_tags,packaging_text,brands,brands_tags,categories,categories_tags,categories_en,origins,origins_tags,origins_en,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_en,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_en,ingredients_text,allergens,allergens_en,...,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-dried_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,carbon-footprint-from-meat-or-fish_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0,225,http://world-en.openfoodfacts.org/product/0000...,nutrinet-sante,1623855208,2021-06-16T14:53:28Z,1623855209,2021-06-16T14:53:29Z,jeunes pousses,,,,,,,endives,endives,,,,,,,,,,,,,,,,,,,en:france,en:france,France,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,3429145,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1630483911,2021-09-01T08:11:51Z,1630484064,2021-09-01T08:14:24Z,L.casei,,,,,,,,,,,,Spain,en:spain,Spain,,,,,,,,,,,,,Spain,en:spain,Spain,"Leche semidesnatada, azucar 6.9% leche desnata...",,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,17,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1529059080,2018-06-15T10:38:00Z,1561463718,2019-06-25T11:55:18Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,31,http://world-en.openfoodfacts.org/product/0000...,isagoofy,1539464774,2018-10-13T21:06:14Z,1539464817,2018-10-13T21:06:57Z,Cacao,,,130 g,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,3327986,http://world-en.openfoodfacts.org/product/0000...,kiliweb,1574175736,2019-11-19T15:02:16Z,1624390765,2021-06-22T19:39:25Z,Filetes de pollo empanado,,,,,,,,,,,,,,,,,,,,,,,,,,,Espagne,en:spain,Spain,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## *D -* Dtypes

* Around two thirds of the *features* (columns) of the *dataset* is *numeric* while a third is *categoric*.

In [113]:
dtypes = sum_dtypes(dataframe)

In [114]:
dtypes

Unnamed: 0_level_0,population
dtype,Unnamed: 1_level_1
float64,124
object,60
int64,2


## *E -* NaNs

* A good part of the dataset seems to be empty, we will need to get rid of these empty features.

In [115]:
nans = sum_nans(dataframe)
nans.head()

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
allergens_en,1993128,100.0
ingredients_from_palm_oil,1993128,100.0
ingredients_that_may_be_from_palm_oil,1993128,100.0
cities,1993128,100.0
additives,1993128,100.0


In [116]:
average_nans = round(nans['nans_%'].mean(), 2)
print(f'The dataset is empty at {average_nans} %')

The dataset is empty at 79.83 %


## *F -* Uniques

In [117]:
uniques = sum_uniques(dataframe)
uniques.head()

Unnamed: 0_level_0,uniques,uniques_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
url,1993115,100.0
code,1992835,99.99
created_t,1740157,87.31
created_datetime,1740157,87.31
last_modified_t,1555554,78.05


In [118]:
average_uniques = round(uniques['uniques_%'].mean(), 2)
print(f'The dataset contains {average_uniques} % of unique values') 

The dataset contains 5.61 % of unique values


## *G -* Numericals

Now, let's take a look at the numerical features report. The graph below represents the features that are filled at at least 25%.
* We notice that the "created_t" and "last_modified_t" features are filled at 100 %. This is probably because the database generates its dates metadata automatically.
* Following these two features are the main nutritional values ("from energy_100g" to "sodium_100g") which are partially filled. We can assume that this is normal, as foods don't all contain all the nutritional groups.
* We then have some misceallaneous data, and scores such as the nutriscore or the nova group. 25% fill is a pertty low limit, we will assume that the following features are too empty to make full use of them. We might reevaluate this statement once we have filtered the dataset.

In [119]:
dataset.num_report(25)

In [120]:
dataset.get('numericals').head()

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
created_t,1993128.0,100.0,0,0.0,0.0,0.0,1560244000.0,53025160.0,1328021000.0,1524226000.0,1571413000.0,1603559000.0,1634518000.0
last_modified_t,1993128.0,100.0,0,0.0,0.0,0.0,1595542000.0,31270700.0,1333873000.0,1582762000.0,1599730000.0,1619865000.0,1634519000.0
energy_100g,1581036.0,79.32,412092,20.68,42223.0,2.12,4.215944e+36,5.3010959999999996e+39,0.0,418.0,1079.0,1674.0,6.665559e+42
proteins_100g,1574356.0,78.99,418772,21.01,197264.0,9.9,8.788473,62.80052,-500.0,1.3,6.0,12.4,73000.0
fat_100g,1573119.0,78.93,420009,21.07,241320.0,12.11,69924790.0,87702490000.0,0.0,0.8,7.0,21.2,110000000000000.0


## *H -* Categoricals

What about the categorical features ?
* We notice that most of the highly filled features are meta data, but there are also miscealaneous features such as the pnns groups, the countries or the product names (more than 95%), which might be useful to filter our dataset later.
* Then, we have image urls, which could be useful to fill in the database entries, using tehcnologies such as Optical Character Recognition (OCR), but which will not be in our interest at this time.
* There are also brand tags, categories, ingredients which are partially filled and would require Natural Language Processing (NLP).

In [121]:
dataset.cat_report(25)

In [122]:
dataset.get('categoricals').head()

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
code,1993128,100.0,1992835,99.99,0,0.0,73124005572,2
states_tags,1993128,100.0,6075,0.3,0,0.0,"en:to-be-completed,en:nutrition-facts-complete...",205699
created_datetime,1993128,100.0,1740157,87.31,0,0.0,2020-04-23T17:22:07Z,28
last_modified_datetime,1993128,100.0,1555554,78.05,0,0.0,2021-09-02T17:25:05Z,216
states,1993128,100.0,6075,0.3,0,0.0,"en:to-be-completed, en:nutrition-facts-complet...",205699


# ***2 -*** **Application Concept**

So, what can we make of this data ?
* Do you have a sweet tooth ? Because I have one. The problematic is that I am also very health conscious and that I don't have the time to make my own sweets. I am also a very curious individual who loves to discover new products.
* What about creating an app that suggests healthier alternatives to your everyday snacks ? This is what we are going for. I present to you : *BetterSnack* (this is just an idea).


# ***3 -*** **Dataset Cleaning**

We can make several points after finishing the dataset description:
* The dataset is partially filled and we need to get rid of the empty features.
* Also, in the latter part, we can but only notice that some features seem to be duplicates, such as "countries", "countries_en" and "countries_tags", which have different suffixes but the exact same characteristics. 
* We also need to get rid of these redundant features.

## *A -* Features Filtering

We first clean the features (columns) of the dataset. 

 ### *a -* Redundants

In [123]:
sample(dataframe['countries'], dataframe['countries_en'], dataframe['countries_tags'])

Unnamed: 0_level_0,countries,countries_index,countries_en,countries_en_index,countries_tags,countries_tags_index
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,中華民国,6190,"Argentina,China",3437,"en:france,en:portugal",556
1,"Egypt, Turkey",9936,"Chile,France,Spain",3448,"en:austria,en:germany,en:luxembourg",1061
2,"Francja,Polska,Szwajcaria, en:belgium, en:swit...",7187,Bosnia and Herzegovina,409,en:central-african-republic,1217
3,"France,Belgium,Luxembourg",7010,"Morocco,Spain,fr:francia",3004,"en:belgium,en:denmark,en:france,en:germany,en:...",3834
4,"Denmark,Finland,Germany,Norway,Sweden,Switzerland",6627,"Argentina,Chile,Paraguay,Uruguay",3402,"en:france,en:mauritius,en:south-africa",2970
5,"Austria,France, en:germany",5774,"Estonia,Poland",3921,"en:france,en:philippines",652
6,"Austria,France,Germany,Serbia",10544,"Germany,Portugal,Spain,Switzerland",995,"en:italy,en:spain,fr:francia",2452
7,"Brazil,United States",1545,"France,Qatar",551,"en:france,en:germany,en:netherlands,en:portuga...",2146
8,"France, Guyane, m",10540,"Indonesia,Spain",512,"en:france,en:guadeloupe,en:hungary,en:morocco,...",3755
9,"Guadeloupe, en:australia, en:france",6525,"Algeria,ثم-تع",2982,"en:bulgaria,en:czech-republic,en:denmark,en:es...",3808


It seems pretty clear that these features represent the exact same information. We will remove the duplicate features using their names and the *filter_tags()* method I have developped:
* This function detects the root names of the features to get rid of the unwanted tag suffixes and keeps one of the features (order of preference: prefixes to keep should be first).
* For example, for countries, and for all features in the dataset, we assume that if there is a "tags" suffix, is it the most formalized one, and easiest to process.
* In the report generated by the filter_tags function, we can see that 7 feature names start with "ingredients", of which, 3 are categorical features and 4 are numerical features.

In [124]:
features_final, features_report = filter_tags(dataframe=dataframe, filters={'endswith':['_tags','_t','_url','_datetime','_en','_fr','_100g','_n','_text','_small']}) # for details, refer to #Environment/Utilities/filter_tags()
dataframe = dataframe[features_final]
features_report.head()


32 features dropped



Unnamed: 0_level_0,total,cat,num,..._tags,..._t,..._url,..._datetime,..._en,..._fr,..._100g,..._n,..._text,..._small,splits,features,processed
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ingredients,7,3,4,2,0,0,0,0,0,0,2,1,0,7,"[ingredients_text, ingredients_from_palm_oil_n...","{ingredients_that_may_be_from_palm_oil, ingred..."
first,1,1,0,0,0,0,0,0,0,0,0,0,0,3,[first_packaging_code_geo],{first_packaging_code_geo}
image,6,6,0,0,0,6,0,0,0,0,0,0,0,3,"[image_url, image_small_url, image_ingredients...","{image_nutrition, image, image_ingredients}"
abbreviated,1,1,0,0,0,0,0,0,0,0,0,0,0,2,[abbreviated_product_name],{abbreviated_product_name}
pnns,2,2,0,0,0,0,0,0,0,0,0,0,0,2,"[pnns_groups_1, pnns_groups_2]","{pnns_groups_2, pnns_groups_1}"


The *filter_tags()* method works well, but there are still *duplicates* in the dataset:
* We know for a fact that sodium and salt, energy and kcalories and representing the same data at different ratios. 
*We choose to get rid of these duplicates.

*Sources:*

*https://www.nal.usda.gov/legacy/fnic/what-difference-between-calories-and-kilocalories*

*https://www.hsph.harvard.edu/nutritionsource/salt-and-sodium/*

In [125]:
dataframe = dataframe.drop(['energy-kcal_100g', 'sodium_100g'], axis=1)
dataset.save_version(dataframe, 'duplicates')


Version 1: "duplicates" saved



#### *Dataset report*

In [126]:
dataset.report(25)


Samples dropped: 0/1993128 (0.0%)
Features dropped: 34/186 (18.28%)



In [127]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
last_modified_t,1993128.0,100.00,0,0.00,0.0,0.00,1.595542e+09,3.127070e+07,1.333873e+09,1.582762e+09,1.599730e+09,1.619865e+09,1.634519e+09
created_t,1993128.0,100.00,0,0.00,0.0,0.00,1.560244e+09,5.302516e+07,1.328021e+09,1.524226e+09,1.571413e+09,1.603559e+09,1.634518e+09
energy_100g,1581036.0,79.32,412092,20.68,42223.0,2.12,4.215944e+36,5.301096e+39,0.000000e+00,4.180000e+02,1.079000e+03,1.674000e+03,6.665559e+42
proteins_100g,1574356.0,78.99,418772,21.01,197264.0,9.90,8.788473e+00,6.280052e+01,-5.000000e+02,1.300000e+00,6.000000e+00,1.240000e+01,7.300000e+04
fat_100g,1573119.0,78.93,420009,21.07,241320.0,12.11,6.992479e+07,8.770249e+10,0.000000e+00,8.000000e-01,7.000000e+00,2.120000e+01,1.100000e+14
...,...,...,...,...,...,...,...,...,...,...,...,...,...
glycemic-index_100g,4.0,0.00,1993124,100.00,0.0,0.00,3.417500e+01,1.562015e+01,1.400000e+01,2.600000e+01,3.700000e+01,4.517500e+01,4.870000e+01
-elaidic-acid_100g,2.0,0.00,1993126,100.00,0.0,0.00,8.500000e-01,9.192388e-01,2.000000e-01,5.250000e-01,8.500000e-01,1.175000e+00,1.500000e+00
water-hardness_100g,1.0,0.00,1993127,100.00,0.0,0.00,9.100000e+03,,9.100000e+03,9.100000e+03,9.100000e+03,9.100000e+03,9.100000e+03
no_nutriments,0.0,0.00,1993128,100.00,0.0,0.00,,,,,,,


In [128]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
url,1993128,100.0,1993115,100.0,0,0.0,http://world-en.openfoodfacts.org/product/6703...,2
states_tags,1993128,100.0,6075,0.3,0,0.0,"en:to-be-completed,en:nutrition-facts-complete...",205699
code,1993128,100.0,1992835,99.99,0,0.0,73124005572,2
creator,1993124,100.0,14517,0.73,4,0.0,kiliweb,1161906
pnns_groups_2,1992885,99.99,42,0.0,243,0.01,unknown,1233742
pnns_groups_1,1992883,99.99,12,0.0,245,0.01,unknown,1233742
countries_tags,1987006,99.69,4242,0.21,6122,0.31,en:france,772112
product_name,1911433,95.9,1240330,62.23,81695,4.1,Aceite de oliva virgen extra,1339
image_url,1529027,76.71,1528868,76.71,464101,23.29,https://images.openfoodfacts.org/images/produc...,49
image_nutrition_url,1036879,52.02,1036842,52.02,956249,47.98,https://images.openfoodfacts.org/images/produc...,7


### *b -* NaNs

The *database* contains a lot of *NaNs* (***#1.E***).

On a first pass, we choose to **keep** the *features* that have at least 50% of actual *variables*. 

In [129]:
min_fill = 50
nans = sum_nans(dataframe)
features_selected = nans[nans['nans_%'] <= min_fill].index.to_list()
dataframe = dataframe[features_selected]
dataset.save_version(dataframe, 'nans')


Version 2: "nans" saved



#### *Dataset report*

In [130]:
dataset.report()


Samples dropped: 0/1993128 (0.0%)
Features dropped: 132/152 (86.84%)



In [131]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
created_t,1993128.0,100.0,0,0.0,0.0,0.0,1560244000.0,53025160.0,1328021000.0,1524226000.0,1571413000.0,1603559000.0,1634518000.0
last_modified_t,1993128.0,100.0,0,0.0,0.0,0.0,1595542000.0,31270700.0,1333873000.0,1582762000.0,1599730000.0,1619865000.0,1634519000.0
energy_100g,1581036.0,79.32,412092,20.68,42223.0,2.12,4.215944e+36,5.3010959999999996e+39,0.0,418.0,1079.0,1674.0,6.665559e+42
proteins_100g,1574356.0,78.99,418772,21.01,197264.0,9.9,8.788473,62.80052,-500.0,1.3,6.0,12.4,73000.0
fat_100g,1573119.0,78.93,420009,21.07,241320.0,12.11,69924790.0,87702490000.0,0.0,0.8,7.0,21.2,110000000000000.0
carbohydrates_100g,1572773.0,78.91,420355,21.09,123702.0,6.21,28.85503,632.2009,-1.0,3.5,15.1,53.0,762939.0
sugars_100g,1557152.0,78.13,435976,21.87,236658.0,11.87,64219820.0,80137260000.0,-1.0,0.6,3.57,17.65,100000000000000.0
saturated-fat_100g,1530001.0,76.76,463127,23.24,326738.0,16.39,5.105034,16.54633,0.0,0.1,1.8,7.09,16700.0
salt_100g,1489738.0,74.74,503390,25.26,226882.0,11.38,2.125962,93.88736,0.0,0.076,0.5714286,1.4,75000.0


In [132]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
states_tags,1993128,100.0,6075,0.3,0,0.0,"en:to-be-completed,en:nutrition-facts-complete...",205699
code,1993128,100.0,1992835,99.99,0,0.0,73124005572,2
url,1993128,100.0,1993115,100.0,0,0.0,http://world-en.openfoodfacts.org/product/6703...,2
creator,1993124,100.0,14517,0.73,4,0.0,kiliweb,1161906
pnns_groups_2,1992885,99.99,42,0.0,243,0.01,unknown,1233742
pnns_groups_1,1992883,99.99,12,0.0,245,0.01,unknown,1233742
countries_tags,1987006,99.69,4242,0.21,6122,0.31,en:france,772112
product_name,1911433,95.9,1240330,62.23,81695,4.1,Aceite de oliva virgen extra,1339
image_url,1529027,76.71,1528868,76.71,464101,23.29,https://images.openfoodfacts.org/images/produc...,49
image_nutrition_url,1036879,52.02,1036842,52.02,956249,47.98,https://images.openfoodfacts.org/images/produc...,7


### *c -* Meta Data

As we have seen before, there are some *meta data* features that we don't require to elaborate on our application concept. We **remove** them.

In [133]:
meta_features = ['last_modified_t', 'created_t','url', 'image_url', 'image_nutrition_url', 'code', 'states_tags', 'creator']
dataframe = dataframe.drop(meta_features, axis=1)
dataset.save_version(dataframe, 'meta_data')


Version 3: "meta_data" saved



#### *Dataset report*

In [134]:
dataset.report()


Samples dropped: 0/1993128 (0.0%)
Features dropped: 8/20 (40.0%)



In [135]:
dataframe.get('numericals')

In [136]:
dataframe.get('categoricals')

### *Others*


The "product_name" & "brands_tags" features also won't be of use to us, we **remove** them.

In [137]:
dataframe = dataframe.drop(['brands_tags','product_name'], axis=1)

## *B -* Samples Filtering

Now that we have selected our *features*, let's take a look at the *samples* (rows) of the dataset:
* We have seen that the database entries are far to be perfect, and that some features are still filled with nans, we will now try to find out if by filtering the dataset we can obtain better quality data. 
* We will ensure that the data quality is good enough to start with the actual data exploration.

### *a -* Countries tags

Let's take a look at countries tags:
* There are more than 4000 unique values in countries tags. 
* The UN recognizes 195 independant states around the world. Which means that many entries are not standarized. 
* We apply a simple regex on the countries tags to remove the langagues codes (ex: "en:") and filter the dataset with the most populated occurences

In [138]:
countries_name = 'countries_tags'
pd.DataFrame(dataset.get('categoricals').loc[countries_name]).rename_axis(mapper='stats', axis=0)

Unnamed: 0_level_0,countries_tags
stats,Unnamed: 1_level_1
count,1987006
fill_%,99.69
unique,4242
uniques_%,0.21
nans,6122
nans_%,0.31
top,en:france
freq,772112


In [139]:
countries_processed_name = 'countries'
dataframe[countries_processed_name] = dataframe[countries_name].astype(str).apply(lambda cell: re.sub(r'[a-zA-Z]{2}:', '', cell))
sample(dataframe[countries_name], dataframe[countries_processed_name])

Unnamed: 0_level_0,countries_tags,countries_tags_index,countries,countries_index
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"en:argentina,en:brazil,en:chile,en:colombia,en...",3198,"algeria,tunisia",2993
1,en:xk,2641,null-australia,4176
2,"en:belgium,en:france,en:morocco,en:reunion,en:...",1497,"belgium,france,germany,hong-kong,spain",3457
3,"en:france,en:germany,en:reunion,en:spain",752,"croatia,france,serbia",1747
4,"en:germany,en:togo",2217,"austria,portugal,spain,francia",3715
5,"en:bolivia,en:canada,en:chile,en:colombia,en:c...",3369,"czech-republic,france,romania,slovakia",2753
6,en:democratic-republic-of-the-congo,432,"austria,hungary",1201
7,"en:france,en:germany,en:italy",411,"bulgaria,czech-republic,romania",1128
8,"en:australia,en:belgium,en:france,en:switzerla...",2482,"denmark,germany,romania,spain",1889
9,"en:belgium,en:denmark,en:france,en:spain,en:un...",2533,"andorra,belgium,france,luxembourg,spain",1693


* The samples now look better, but there is still an issue: there are often more than one country per food product. 
* To resolve this issue, we need to "unzip" the countries present in each row and count them, we will then be able to make a decision on what to make of these countries tags.
* We will use the filter_cat_feature method I developped in the ***B -*** **Utilities** chapter of this notebook.
* The *filter_cat_feature()* method counts each tag, separated by commas, in the feature and selects the modalities (here, the "countries" feature we have created) that represent at least the total population share using the *minimum_coverage* argument (we set it to 90%) and summarizes the other modalities in the "others" category.

In [140]:
countries_report = filter_cat_feature(dataframe=dataframe, by=countries_processed_name, minimum_coverage=90)


Minimum coverage: 90%
Filtered "countries": 9/511 (1.76%)
Selected: ['france', 'united-states', 'spain', 'italy', 'germany', 'switzerland', 'belgium', 'united-kingdom', 'canada', 'others']



In [141]:
# countries_report # uncomment this line to see the DETAILED feature REPORT

* More than a half of the whole food products of the database entries are present in France (40% of the population) and the United States (18% of the population). This should not surprise us, as the OpenFoodFacts project having originated in France. 
* We make the decision to use the french data because altought some countries features are more filled than the french data, it is by far the biggest population of the dataset, which compensates for its missing data.

In [142]:
dataframe = dataframe[dataframe[countries_processed_name] == 'france'].drop(countries_name, axis=1).drop('countries', axis=1)
dataset.save_version(dataframe, 'countries')


Version 4: "countries" saved



#### *Dataset report*

In [143]:
dataset.report()


Samples dropped: 1221016/1993128 (61.26%)
Features dropped: 3/12 (25.0%)



In [144]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
energy_100g,593719.0,76.9,178393,23.1,15662.0,2.03,1167.312826,10523.477639,0.0,469.0,1088.0,1669.0,8010000.0
saturated-fat_100g,592227.0,76.7,179885,23.3,95140.0,12.32,5.423562,8.457686,0.0,0.2,2.0,8.0,2000.0
sugars_100g,591444.0,76.6,180668,23.4,75225.0,9.74,13.582444,40.682762,-1.0,0.6,3.2,19.0,27000.0
proteins_100g,589846.0,76.39,182266,23.61,51834.0,6.71,9.151071,95.573005,0.0,1.5,6.3,13.0,73000.0
fat_100g,588172.0,76.18,183940,23.82,61486.0,7.96,14.269239,41.692285,0.0,1.0,8.0,22.1,29000.0
carbohydrates_100g,588122.0,76.17,183990,23.83,42719.0,5.53,26.947048,251.845671,-1.0,2.3,13.4,51.0,192000.0
salt_100g,571796.0,74.06,200316,25.94,82210.0,10.65,1.280061,19.261548,0.0,0.06,0.55,1.31,14000.0


In [145]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,772071,99.99,41,0.01,41,0.01,unknown,465939
pnns_groups_1,772070,99.99,12,0.0,42,0.01,unknown,465939


### *b -* PNNS groups 1

There are other features which are highly filled, such as the "pnns_groups" features. These features are groups of food products with similar characteristics defined by the french Health Agency (Santé Public France). Let's take a look at them in order to see what we can make of them.

*Source:*
https://www.mangerbouger.fr/PNNS/Le-PNNS/Qu-est-ce-que-le-PNNS

In [146]:
groups_1_name = 'pnns_groups_1'
sample(dataframe[groups_1_name]) # uncomment this line to see the SAMPLES

Unnamed: 0_level_0,pnns_groups_1,pnns_groups_1_index
sample,Unnamed: 1_level_1,Unnamed: 2_level_1
0,sugary-snacks,12
1,Composite foods,2
2,Cereals and potatoes,8
3,Fruits and vegetables,4
4,Beverages,6
5,,11
6,Sugary snacks,3
7,Salty snacks,9
8,unknown,0
9,Alcoholic beverages,10


In [147]:
pd.DataFrame(dataset.get('categoricals').loc[groups_1_name]).rename_axis(mapper='stats', axis=0) # uncomment this line to see the DETAILED REPORT

Unnamed: 0_level_0,pnns_groups_1
stats,Unnamed: 1_level_1
count,772070
fill_%,99.99
unique,12
uniques_%,0
nans,42
nans_%,0.01
top,unknown
freq,465939


* The PNNS groups 1 seem to be very filled, with a score close to a 100%. Let's decompose the feature by using our *filter_cat_feature()* method.

In [148]:
groups_1_report = filter_cat_feature(dataframe=dataframe, by=groups_1_name, minimum_coverage=90)


Minimum coverage: 90%
Filtered "pnns_groups_1": 7/13 (53.85%)
Selected: ['unknown', 'Sugary snacks', 'Fish Meat Eggs', 'Milk and dairy products', 'Composite foods', 'Cereals and potatoes', 'Fruits and vegetables', 'others']



In [149]:
# groups_1_report # uncomment this line to see the DETAILED REPORT

Actually, the "pnns_groups_1" feature is not filled correctly:
* PNNS groups 1 is mainly filled by the "unknown" class, which doesn't give us information, it represents more than 60% of the population. 
* We cannot infere the PNNS group from the nutritional values or any other features. 
* We will filter the unknown class out.

In [150]:
dataframe = dataframe[dataframe[groups_1_name].str.contains('unknown') == False]
dataset.save_version(dataframe, 'pnns_groups')


Version 5: "pnns_groups" saved



In [151]:
groups_1_report = filter_cat_feature(dataframe=dataframe, by=groups_1_name, minimum_coverage=90)


Minimum coverage: 90%
Filtered "pnns_groups_1": 8/11 (72.73%)
Selected: ['Sugary snacks', 'Fish Meat Eggs', 'Milk and dairy products', 'Composite foods', 'Cereals and potatoes', 'Fruits and vegetables', 'Beverages', 'Fat and sauces', 'others']



In [152]:
# groups_1_report # uncomment this line to see the DETAILED REPORT

Now that the "unknown" class has been filtered out, we can clearly see that the main PNNS groups present in the dataset are "Sugary Snacks", "Fish Meat Eggs" and "Diary Products". 
* Our application being focused on suggesting healthier snacks, we can't complain about this output.
* We filter the database to **select** the sugary snacks only.
* We remove the "pnns_groups_1" feature which is now common for our whole dataset.

In [153]:
dataframe = dataframe[dataframe['pnns_groups_1'] == 'Sugary snacks'].drop('pnns_groups_1', axis=1)
dataset.save_version(dataframe, 'sugary_snacks')


Version 6: "sugary_snacks" saved



#### *Dataset report*

In [154]:
dataset.report()


Samples dropped: 245181/306131 (80.09%)
Features dropped: 1/9 (11.11%)



In [155]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
energy_100g,51221.0,84.04,9729,15.96,371.0,0.61,1670.617051,597.713856,0.0,1335.0,1724.0,2090.0,19305.0
saturated-fat_100g,51218.0,84.03,9732,15.97,7255.0,11.9,8.362626,8.62262,0.0,0.3,6.1,14.0,400.0
sugars_100g,51166.0,83.95,9784,16.05,979.0,1.61,38.119972,20.419296,0.0,24.0,36.1,52.0,105.0
proteins_100g,51072.0,83.79,9878,16.21,2992.0,4.91,5.164273,4.036653,0.0,1.9,5.5,7.1,100.0
carbohydrates_100g,50967.0,83.62,9983,16.38,127.0,0.21,55.662358,17.686448,0.0,46.0,55.0,65.0,105.0
fat_100g,50945.0,83.58,10005,16.42,4298.0,7.05,17.16143,13.964851,0.0,1.2,17.0,27.0,100.0
salt_100g,50288.0,82.51,10662,17.49,7020.0,11.52,0.430323,1.442337,0.0,0.03,0.2,0.6,98.0


In [156]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,60950,100.0,4,0.01,0,0.0,Sweets,26062


### *c -* PNNS groups 2

Let's see what we can abouth the PNNS groups 2.

In [159]:
groups_2_name = 'pnns_groups_2'

In [160]:
pd.DataFrame(dataset.get('categoricals').loc[groups_2_name]).rename_axis(mapper='stats', axis=0)

Unnamed: 0_level_0,pnns_groups_2
stats,Unnamed: 1_level_1
count,60950
fill_%,100
unique,4
uniques_%,0.01
nans,0
nans_%,0
top,Sweets
freq,26062


In [161]:
groups_2_report = filter_cat_feature(dataframe=dataframe, by=groups_2_name, minimum_coverage=100)


Minimum coverage: 100%
Filtered "pnns_groups_2": 0/4 (0.0%)
Selected: ['Sweets', 'Biscuits and cakes', 'Chocolate products', 'Pastries']



In [162]:
# groups_2_report # uncomment this line to see the DETAILED REPORT

* It seems that PNNS groups 2 are always filled when the groups 1 are filled.
* This is a good news for us, as we will be able to offer more precise suggestions to our users. We will leave these samples untouched for the moment.

#### *Dataset report*

In [163]:
dataset.report()


Samples dropped: 245181/306131 (80.09%)
Features dropped: 1/9 (11.11%)



In [164]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
energy_100g,51221.0,84.04,9729,15.96,371.0,0.61,1670.617051,597.713856,0.0,1335.0,1724.0,2090.0,19305.0
saturated-fat_100g,51218.0,84.03,9732,15.97,7255.0,11.9,8.362626,8.62262,0.0,0.3,6.1,14.0,400.0
sugars_100g,51166.0,83.95,9784,16.05,979.0,1.61,38.119972,20.419296,0.0,24.0,36.1,52.0,105.0
proteins_100g,51072.0,83.79,9878,16.21,2992.0,4.91,5.164273,4.036653,0.0,1.9,5.5,7.1,100.0
carbohydrates_100g,50967.0,83.62,9983,16.38,127.0,0.21,55.662358,17.686448,0.0,46.0,55.0,65.0,105.0
fat_100g,50945.0,83.58,10005,16.42,4298.0,7.05,17.16143,13.964851,0.0,1.2,17.0,27.0,100.0
salt_100g,50288.0,82.51,10662,17.49,7020.0,11.52,0.430323,1.442337,0.0,0.03,0.2,0.6,98.0


In [165]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,60950,100.0,4,0.01,0,0.0,Sweets,26062


### *d -* Nutritional values

Let's now take a look at the nutritional values, or the data of the nutrition facts label. 
* All of the nutritional values have a part of nans orbiting around 16%. This should not surprise us, as every food product aren't composed of every nutritional groups.
* We consider this as normal.

*https://www.fda.gov/food/new-nutrition-facts-label/how-understand-and-use-nutrition-facts-label*


In [166]:
dataframe_num = dataframe.select_dtypes([int, float])
columns_100 = [column for column in dataframe_num.columns.to_list() if column.endswith('_100g')]
columns_100.remove('energy_100g') # energy per 100g can be higher than 100

In [167]:
sum_nans(dataframe_num)

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
salt_100g,10662,17.49
fat_100g,10005,16.42
carbohydrates_100g,9983,16.38
proteins_100g,9878,16.21
sugars_100g,9784,16.05
saturated-fat_100g,9732,15.97
energy_100g,9729,15.96


* Some rows are fully empty, some other are partially filled. 
* We are getting rid of them using pandas *dropna()*

In [168]:
threshold_percent = 75
threshold = round(dataframe_num.shape[1] * threshold_percent/100)
print(f'Rows filled at {threshold_percent}%, or less than {threshold}/{dataframe_num.shape[1]} filled features will be removed')

Rows filled at 75%, or less than 5/7 filled features will be removed


In [169]:
dataframe_num = dataframe_num.dropna(thresh=threshold)
dataframe_num.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0


In [170]:
sum_nans(dataframe_num)

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
salt_100g,1208,2.37
carbohydrates_100g,155,0.3
fat_100g,155,0.3
saturated-fat_100g,108,0.21
energy_100g,87,0.17
sugars_100g,32,0.06
proteins_100g,30,0.06


However, we should still check the quality of this data and try to get rid of the outliers. 
* We will start by eliminating rows where any value for 100 grams is higher than 100, which is theorically impossible.
* We create the "any>100" filter to detect if any of the 100 grams cell is higher than 100.

In [171]:
dataframe_num['any>100'] = dataframe_num[columns_100].apply(lambda row: any(row.fillna(0).values > 100), axis=1)
dataframe_num.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,any>100
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,False
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,False
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,False
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,False
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,False


* We **filter out** the outliers using the filter to keep the valid rows.

In [172]:
outliers = len(dataframe_num[dataframe_num['any>100']])
print(f'\n>100 : {outliers} samples\n')
dataframe_num = dataframe_num[~dataframe_num['any>100']].drop('any>100', axis=1)
dataframe_num.head()


>100 : 4 samples



Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0


Now that we have eliminated all the rows containing values per 100 grams higher than 100, let's dig deeper into the nutritional values cleaning.
* Having learnt about the nutrition facts label, we know that saturated fat is a subset of fat, which means that saturated fat content cannot be higher than fat content.
* We also know that the total of every nutritional group, except for saturated fat and energy must be inferior or equal to 100 grams.
* We create filters which apply these conditions on the dataset.

In [173]:
dataframe_num['fat>=saturated'] = dataframe_num['fat_100g'].fillna(0) >= dataframe_num['saturated-fat_100g'].fillna(0)
dataframe_num['carbs>=sugars'] = dataframe_num['carbohydrates_100g'].fillna(0) >= dataframe_num['sugars_100g'].fillna(0)
dataframe_num['total_100g'] = dataframe_num['carbohydrates_100g'].fillna(0) + dataframe_num['salt_100g'].fillna(0) + dataframe_num['proteins_100g'].fillna(0) + dataframe_num['fat_100g'].fillna(0)
dataframe_num['total<=100'] = dataframe_num['total_100g'] <= 100
dataframe_num.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,fat>=saturated,carbs>=sugars,total_100g,total<=100
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,True,True,54.0,True
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,True,True,91.3,True
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,True,True,57.8,True
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,True,True,93.3,True
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,True,True,94.41,True


* We **filter out** the rows that don't match the filters.

In [174]:
dataframe_num = dataframe_num[dataframe_num['fat>=saturated'] & dataframe_num['carbs>=sugars'] & dataframe_num['total<=100']].drop(['fat>=saturated', 'carbs>=sugars', 'total<=100', 'total_100g'], axis=1)
dataframe_num.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0


In [175]:
sum_nans(dataframe_num)

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
salt_100g,1201,2.38
saturated-fat_100g,105,0.21
energy_100g,83,0.16
sugars_100g,31,0.06
proteins_100g,23,0.05
fat_100g,10,0.02
carbohydrates_100g,3,0.01


We still miss some secondary data, we will try to handle this in the next chapter. 
* Let's join the numerical dataframe onto the dataframe.

In [176]:
dataframe = dataframe_num.join(dataframe.select_dtypes('object'), how='left')
dataframe.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,Biscuits and cakes
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,Sweets
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,Sweets
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,Sweets
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,Sweets


In [177]:
dataset.save_version(dataframe, 'nutritional_values')


Version 7: "nutritional_values" saved



#### *Dataset report*

In [178]:
dataset.report()


Samples dropped: 10573/60950 (17.35%)
Features dropped: 0/8 (0.0%)



In [179]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
carbohydrates_100g,50374.0,99.99,3,0.01,120.0,0.24,55.521533,17.547006,0.0,46.0,55.0,65.0,100.0
fat_100g,50367.0,99.98,10,0.02,4232.0,8.4,17.145693,13.918258,0.0,1.2,17.0,27.0,97.0
proteins_100g,50354.0,99.95,23,0.05,2929.0,5.81,5.154817,3.957629,0.0,2.0,5.5,7.1,85.0
sugars_100g,50346.0,99.94,31,0.06,807.0,1.6,38.085338,20.15095,0.0,24.2,36.1,52.0,100.0
energy_100g,50294.0,99.84,83,0.16,194.0,0.39,1674.34037,581.875118,0.0,1339.0,1728.0,2090.0,19305.0
saturated-fat_100g,50272.0,99.79,105,0.21,6812.0,13.52,8.395266,8.354746,0.0,0.4,6.3,14.1,90.0
salt_100g,49176.0,97.62,1201,2.38,6450.0,12.8,0.399494,0.796125,0.0,0.03,0.2,0.6,67.0


In [180]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,50377,100.0,4,0.01,0,0.0,Biscuits and cakes,21183


## *D -* Missing values replacement

Let us remember that the goal of the dataset exploration is to elaborate on the idea to create a suggestion engine for healthier snacks to french consumers. 
* We know have filtered the dataset to get rid of the redundant features, to keep the french sugary snacks products which present theorically correct nutritionnal values.
* Before trying to create our own score (and our own definition) for product healthyness, we might look into the attempts of evaluating a product's healthyness.
* The nutriscore and Nova score are both attemps at determining if a product is healthy. We got rid of them in an earlier part of this notebook (*3.A.a -* Nans).
* However, knowing that the nutriscore calculation is based of the nutritional values, we might be able to infer it using the nutritional values at our disposal and algorythms of machine learning. This could help us creating a Proof Of Concept (POC).

### ***a -*** Nutritional values

Althought we do not dispose of the mathematical function to calculate the nutriscore, we know that it uses data that we described as secondary in the last chapter (such as saturated fat). 
* We will try to impute this data using a KNN imputer.

In [181]:
dataframe_num = dataframe.select_dtypes([int, float])
dataframe_num.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0


In [182]:
pd.DataFrame({'Standard Deviation': [dataframe_num[column].std() for column in dataframe_num.columns]}, index=dataframe_num.columns)

Unnamed: 0,Standard Deviation
salt_100g,0.796125
saturated-fat_100g,8.354746
sugars_100g,20.15095
carbohydrates_100g,17.547006
fat_100g,13.918258
proteins_100g,3.957629
energy_100g,581.875118


In [183]:
knn_imputer = KNNImputer()
dataframe_num = pd.DataFrame(knn_imputer.fit_transform(dataframe_num), index=dataframe_num.index, columns=dataframe_num.columns)

In [184]:
sum_nans(dataframe_num)

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
salt_100g,0,0.0
saturated-fat_100g,0,0.0
sugars_100g,0,0.0
carbohydrates_100g,0,0.0
fat_100g,0,0.0
proteins_100g,0,0.0
energy_100g,0,0.0


* The KNN imputer can output values that don't fit our requirements as it imputes from the nearest neightbors. 
* Let's apply our filters again, on its results, to insure our data quality.

In [185]:
dataframe_num['any>100'] = dataframe_num[columns_100].apply(lambda row: any(row.fillna(0).values > 100), axis=1)
dataframe_num = dataframe_num[~dataframe_num['any>100']].drop('any>100', axis=1)
dataframe_num['fat>=saturated'] = dataframe_num['fat_100g'].fillna(0) >= dataframe_num['saturated-fat_100g'].fillna(0)
dataframe_num['carbs>=sugars'] = dataframe_num['carbohydrates_100g'].fillna(0) >= dataframe_num['sugars_100g'].fillna(0)
dataframe_num['total_100g'] = dataframe_num['sugars_100g'].fillna(0) + dataframe_num['salt_100g'].fillna(0) + dataframe_num['proteins_100g'].fillna(0) + dataframe_num['fat_100g'].fillna(0)
dataframe_num['total<=100'] = dataframe_num['total_100g'] <= 100
dataframe_num = dataframe_num[dataframe_num['fat>=saturated'] & dataframe_num['carbs>=sugars'] & dataframe_num['total<=100']].drop(['fat>=saturated', 'carbs>=sugars', 'total<=100', 'total_100g'], axis=1)
dataframe = dataframe_num.join(dataframe.select_dtypes('object'), how='left')

In [186]:
pd.DataFrame({'Standard Deviation': [dataframe_num[column].std() for column in dataframe_num.columns]}, index=dataframe_num.columns)

Unnamed: 0,Standard Deviation
salt_100g,0.789483
saturated-fat_100g,8.353887
sugars_100g,20.148161
carbohydrates_100g,17.540927
fat_100g,13.91786
proteins_100g,3.957424
energy_100g,581.835072


* After the imputation of the missing values, the standard deviations of the datasets numerircal featuers is very similar than before.

In [187]:
dataframe = dataframe_num.join(dataframe.select_dtypes('object'), how='left')
dataframe.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,Biscuits and cakes
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,Sweets
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,Sweets
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,Sweets
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,Sweets


In [188]:
dataset.save_version(dataframe,'knn_imputer')


Version 8: "knn_imputer" saved



#### *Dataset report*

In [189]:
dataset.report()


Samples dropped: 13/50377 (0.03%)
Features dropped: 0/8 (0.0%)



In [190]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
salt_100g,50364.0,100.0,0,0.0,6582.0,13.07,0.394403,0.789483,0.0,0.03,0.2,0.6,67.0
saturated-fat_100g,50364.0,100.0,0,0.0,6811.0,13.52,8.39567,8.353887,0.0,0.4,6.3,14.1,90.0
sugars_100g,50364.0,100.0,0,0.0,807.0,1.6,38.077569,20.148161,0.0,24.2,36.1,52.0,100.0
carbohydrates_100g,50364.0,100.0,0,0.0,118.0,0.23,55.519407,17.540927,0.0,46.0,55.0,65.0,100.0
fat_100g,50364.0,100.0,0,0.0,4225.0,8.39,17.147697,13.91786,0.0,1.2,17.0,27.0,97.0
proteins_100g,50364.0,100.0,0,0.0,2926.0,5.81,5.155066,3.957424,0.0,2.0,5.5,7.1,85.0
energy_100g,50364.0,100.0,0,0.0,194.0,0.39,1674.408029,581.835072,0.0,1339.0,1728.0,2091.25,19305.0


In [191]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,50364,100.0,4,0.01,0,0.0,Biscuits and cakes,21183


### *b -* Nutriscore score

Now that we have cleaned nutritional values, we will try to determine the nutriscores in order to be able to make predictions when consumers will use the application.
* First, let's recover the nutriscore, there are two features representing the nutriscore:

In [192]:
nutriscore_features = features_report.loc['nutriscore']['features']
nutriscore_features

['nutriscore_score', 'nutriscore_grade']

In [193]:
sample(dataset.get(step=0)['nutriscore_score'], dataset.get(step=0)['nutriscore_grade'])

Unnamed: 0_level_0,nutriscore_score,nutriscore_score_index,nutriscore_grade,nutriscore_grade_index
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,30.0,43,,1573076
1,-14.0,46,b,1105810
2,4.0,5,,439717
3,17.0,9,,430598
4,12.0,23,a,828245
5,7.0,33,,1259254
6,19.0,26,,1821811
7,28.0,20,c,732521
8,-15.0,56,,1055093
9,-6.0,37,e,1124664


* The "nutriscore_grade" feature is a categoric feature that ranges from A to E.
* The "nutriscore_score" feature, is a continuous numerical feature: the lowest (it can be lower than 0) the "nutriscore_score" is, the healthiest the product is.

We choose to infer the "nutriscore_score", which will allow use to fill in the "nutriscore_grade" easily.

In [194]:
target = 'nutriscore_score'
dataframe = dataset.pull_features(target)
dataframe.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2,nutriscore_score
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,Biscuits and cakes,14.0
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,Sweets,2.0
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,Sweets,11.0
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,Sweets,14.0
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,Sweets,18.0


We split the data is 3 subsets:

* A training set which represents 80 % of the samples where the "nustriscore_score" is filled.
* A validation set which represents 20 % of the samples where the "nustriscore_score" is filled.
* A test set of samples with the missing "nutriscore_score" values, that we try to predict.


In [195]:
knn_train, knn_test = dataframe[~dataframe[target].isna()], dataframe[dataframe[target].isna()]
knn_X_train, knn_X_val, knn_y_train, knn_y_val = train_test_split(knn_train.select_dtypes([int,float]).drop(target, axis=1), knn_train[target], test_size=0.2)

In [196]:
standard_scaler = StandardScaler()
knn_X_train = standard_scaler.fit_transform(knn_X_train)
knn_X_val = standard_scaler.transform(knn_X_val)
knn_X_test = standard_scaler.transform(knn_test.select_dtypes([int,float]).drop(target, axis=1))

* Using the *knn_optimizer()* I have developed in *0.B.b - Utilities*, we select the best KNN Regressor model to be used on the test set.

In [197]:
knn_regressor = knn_optimizer(KNeighborsRegressor,knn_X_train,knn_y_train,knn_X_val,knn_y_val,'MSE')


Pass 0: 1 neighbor(s), MSE: 4.9

Pass 1: 2 neighbor(s), MSE: 3.94

Pass 2: 3 neighbor(s), MSE: 3.61

Pass 3: 4 neighbor(s), MSE: 3.5

Pass 4: 5 neighbor(s), MSE: 3.42

Pass 5: 6 neighbor(s), MSE: 3.34

Pass 6: 7 neighbor(s), MSE: 3.32

Pass 7: 8 neighbor(s), MSE: 3.31

Pass 8: 9 neighbor(s), MSE: 3.32

Best pass 7: 8 neighbor(s), MSE: 3.31


* We predict the "nutriscore_score" values of the test set:

In [198]:
knn_test[target] = knn_regressor.predict(knn_X_test)
knn_test.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2,nutriscore_score
1154,1.06,3.12,4.3,36.8,6.5,32.5,1515.0,Biscuits and cakes,12.0
1587,0.026,0.0,57.0,57.0,0.1,1.26,992.0,Sweets,10.625
2210,0.07,26.0,26.0,31.0,42.0,8.7,2339.0,Chocolate products,21.125
3196,0.1,0.1,0.1,0.1,0.1,0.1,741.0,Sweets,1.75
3747,0.85,2.7,5.2,42.7,5.6,11.5,1138.0,Biscuits and cakes,6.5


In [199]:
dataframe = pd.concat([knn_train, knn_test], axis=0)
sum_nans(dataframe)

Unnamed: 0_level_0,nans,nans_%
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
salt_100g,0,0.0
saturated-fat_100g,0,0.0
sugars_100g,0,0.0
carbohydrates_100g,0,0.0
fat_100g,0,0.0
proteins_100g,0,0.0
energy_100g,0,0.0
pnns_groups_2,0,0.0
nutriscore_score,0,0.0


In [200]:
dataset.save_version(dataframe, 'knn_regressor')


Version 9: "knn_regressor" saved



#### *Dataset report*

In [201]:
dataset.report()


Samples dropped: 0/50364 (0.0%)
Features dropped: -1/8 (-12.5%)



In [202]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
salt_100g,50364.0,100.0,0,0.0,6582.0,13.07,0.394403,0.789483,0.0,0.03,0.2,0.6,67.0
saturated-fat_100g,50364.0,100.0,0,0.0,6811.0,13.52,8.39567,8.353887,0.0,0.4,6.3,14.1,90.0
sugars_100g,50364.0,100.0,0,0.0,807.0,1.6,38.077569,20.148161,0.0,24.2,36.1,52.0,100.0
carbohydrates_100g,50364.0,100.0,0,0.0,118.0,0.23,55.519407,17.540927,0.0,46.0,55.0,65.0,100.0
fat_100g,50364.0,100.0,0,0.0,4225.0,8.39,17.147697,13.91786,0.0,1.2,17.0,27.0,97.0
proteins_100g,50364.0,100.0,0,0.0,2926.0,5.81,5.155066,3.957424,0.0,2.0,5.5,7.1,85.0
energy_100g,50364.0,100.0,0,0.0,194.0,0.39,1674.408029,581.835072,0.0,1339.0,1728.0,2091.25,19305.0
nutriscore_score,50364.0,100.0,0,0.0,247.0,0.49,16.879577,6.738111,-9.0,12.0,17.0,22.0,37.0


In [203]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,50364,100.0,4,0.01,0,0.0,Biscuits and cakes,21183


### *c -* Nutriscore Grade

Now that we have predicted the scores, it is pretty easy to convert them to grade (modalities).

* We convert the scores to grades using a simple function that applies the official ranges on the score.

In [204]:
dataframe['nutriscore_grade'] = dataframe['nutriscore_score'].apply(lambda row: 'a' if row < 0 else 'b' if (row >= 0 and row < 3) else 'c' if (row >=3 and row < 11) else 'd' if row >= 11 and row <19 else 'e')
dataframe.head()

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2,nutriscore_score,nutriscore_grade
26,0.1,15.5,21.9,27.3,22.0,4.6,1594.0,Biscuits and cakes,14.0,d
179,0.2,5.8,12.8,42.8,39.6,8.7,21.0,Sweets,2.0,b
381,0.0,0.0,56.0,56.8,0.0,1.0,946.0,Sweets,11.0,d
438,0.0,0.0,93.3,93.3,0.0,0.0,1674.0,Sweets,14.0,d
470,0.1,3.53,81.67,87.86,6.42,0.03,1720.0,Sweets,18.0,d


In [205]:
dataframe

Unnamed: 0,salt_100g,saturated-fat_100g,sugars_100g,carbohydrates_100g,fat_100g,proteins_100g,energy_100g,pnns_groups_2,nutriscore_score,nutriscore_grade
26,0.1000,15.50,21.90,27.30,22.00,4.60,1594.0,Biscuits and cakes,14.000,d
179,0.2000,5.80,12.80,42.80,39.60,8.70,21.0,Sweets,2.000,b
381,0.0000,0.00,56.00,56.80,0.00,1.00,946.0,Sweets,11.000,d
438,0.0000,0.00,93.30,93.30,0.00,0.00,1674.0,Sweets,14.000,d
470,0.1000,3.53,81.67,87.86,6.42,0.03,1720.0,Sweets,18.000,d
...,...,...,...,...,...,...,...,...,...,...
1991111,0.4100,16.00,33.00,60.00,31.00,5.40,2276.0,Biscuits and cakes,22.875,e
1991985,0.0002,9.90,37.00,39.00,41.00,9.68,2427.0,Biscuits and cakes,21.250,e
1992406,0.3900,5.09,40.00,74.00,12.00,2.90,1684.6,Biscuits and cakes,18.375,d
1992473,0.1940,0.10,55.00,79.00,0.50,4.20,1393.0,Sweets,14.000,d


In [206]:
dataset.save_version(dataframe, 'knn_regressor')


Key already in index, choose another key.



#### *Dataset report*

In [207]:
dataset.report()


Samples dropped: 0/50364 (0.0%)
Features dropped: -2/8 (-25.0%)



In [208]:
dataset.get('numericals')

Unnamed: 0_level_0,count,fill_%,nans,nans_%,zeroes,zeroes_%,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
salt_100g,50364.0,100.0,0,0.0,6582.0,13.07,0.394403,0.789483,0.0,0.03,0.2,0.6,67.0
saturated-fat_100g,50364.0,100.0,0,0.0,6811.0,13.52,8.39567,8.353887,0.0,0.4,6.3,14.1,90.0
sugars_100g,50364.0,100.0,0,0.0,807.0,1.6,38.077569,20.148161,0.0,24.2,36.1,52.0,100.0
carbohydrates_100g,50364.0,100.0,0,0.0,118.0,0.23,55.519407,17.540927,0.0,46.0,55.0,65.0,100.0
fat_100g,50364.0,100.0,0,0.0,4225.0,8.39,17.147697,13.91786,0.0,1.2,17.0,27.0,97.0
proteins_100g,50364.0,100.0,0,0.0,2926.0,5.81,5.155066,3.957424,0.0,2.0,5.5,7.1,85.0
energy_100g,50364.0,100.0,0,0.0,194.0,0.39,1674.408029,581.835072,0.0,1339.0,1728.0,2091.25,19305.0
nutriscore_score,50364.0,100.0,0,0.0,247.0,0.49,16.879577,6.738111,-9.0,12.0,17.0,22.0,37.0


In [209]:
dataset.get('categoricals')

Unnamed: 0_level_0,count,fill_%,unique,uniques_%,nans,nans_%,top,freq
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
pnns_groups_2,50364,100.0,4,0.01,0,0.0,Biscuits and cakes,21183
nutriscore_grade,50364,100.0,5,0.01,0,0.0,e,21751


### *File saving*

We save our processed dataframe as a *csv* file.

In [210]:
dataframe.to_csv(f'{root_path}/dataset-processed.csv')

# ***4 -*** **Dataset Cleaning Conclusions**

We have succesfully:
* Cleaned,
* Filtered,
* Imputed &
* Insured the quality of our dataset.

We are ready to proceed to the Exploratory Analysis (*second notebook*)