# Open Food Facts Course Project - Cleaning, manipulating and visualizing

First of all, let's import usefull libraries for the project and make matplolib displaying graphs inline the Notebook.

In [1]:
from os import path # For filepath manipulation
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn

Next step will be to load the Open Food Facts CSV file, which in fact is a TSV file (Cells are separated with tabs).
Loading this file will take a consequent amount of time, as the file is 1Gb big. First thing to do will be to perform some cleanup and removal of useless data, and save the result as a new CSV file.

This new CSV file will be used as datasource in this Notebook using the first **MAX_ENTRY_TO_LOAD** rows while coding.

**Don't forget to set MAX_ENTRY_TO_LOAD = None when coding is finished.** Otherwise, only a subset of the data will be processed.

* Note that this Notebook checks if the cleaned Datafile exists and create it otherwise. This process relies on three functions: First one will load the original TSV file, the second one will cleanup the orignal data and the third one will dump the cleand data into a new CSV file.

The following global constant can be adapted to suit your needs

In [2]:
# Filename of the original TSV file
ORIGINAL_TSV_FILENAME = path.join('data','OpenFoodFacts.tsv')

# Filename of the cleaned data build in this Notebook
CLEANED_CSV_FILENAME = path.join('data','OpenFoodFacts-cleaned.csv')

# If set to true, the original data file loading process is forced, event if the
# cleaned CSV file exists. Should be set to **True** when coding is finished
FORCE_LOAD_ORIGINAL_FILE = False

# Maximum NaN percentage accepted in a column. If above, the column is dropped.
MAX_NAN_PERCENT_VALUE = 0.4

# Number of rows loaded from cleaned CSV file. Usefull while coding, this value should be
# set ton **None** when coding is finished.
MAX_ENTRY_TO_LOAD = 10000

# List of columns that will be removed from the Dataset (useless one)
COLUMNS_TO_DROP = [
    'creator', 'brands', 'brands_tags', 'categories','main_category', 'countries',
    'countries_tags', 'additives', 'additives_tags', 'categories_tags', 'states',
    'states_en', 'states_tags', 'url', 'quantity', 'packaging_tags', 'packaging',
    'created_t', 'last_modified_t', 'pnns_groups_1', 'pnns_groups_2', 'image_url',
    'image_small_url'
]


## A. Importing and cleaning the data

Importing the datasource is done using Pandas **read_csv** method, using parameter **sep="\t"** as the content of the file is a tabulation spearated CSV file.

* Note that I've set the **low_memory** option to False in order to avoid warnings when loading the file. Number of columns is quite important and the process to determine the column dtype is too consuming*


### Some function definitions

#### Data loading function.

In [13]:
def loadOriginalTsvFile(filename):
    print("Loading data from file",ORIGINAL_TSV_FILENAME)
    print("Please wait...")
    df = pd.read_csv(ORIGINAL_TSV_FILENAME,sep="\t",low_memory=False)
    print("Loading process terminated.")
    return df


#### Function to dump the cleaned data into a new CSV file

In [14]:
def dumpCleanedCsvFile(df,filename):
    print("Dumping the cleaned Dataframe into file",CLEANED_CSV_FILENAME)
    print("Please wait...")
    df.to_csv(CLEANED_CSV_FILENAME)
    print("Dumping process terminated")

#### Cleanup function that process the Dataframe returned by the **loadOriginalTsvFile()** function

* Note: This function modifies the Dataframe received as parameter (using inplace = True when suitable)

This function will perform cleanup actions on the whole dataset. Further in this Notebook, more cleaning actions will come while we discover the content of the Open Food Facts database.

Here is a list of the cleaning actions done here:

* Drop unused column defined in the global parameter COLUMNS_TO_DROP
* Drop columns where the percentage of null values is above MAX_NAN_PERCENT_VALUE
* Drop rows where **product_name** or **countries_en** column are empty
* Drop rows with duplicates in 'product_name' column

In [15]:
def cleanOriginalData(df):

    print("Cleaning the dataframe)
    print("Please wait...")
    # Drop unused columns
    df.drop(COLUMNS_TO_DROP,axis = 1,inplace=True)
    
    # Drop columns where percentage of NaN values is too high
    df = df.loc[:, (df.isnull().mean(axis=0) < MAX_NAN_PERCENT_VALUE)]
    
    # Drop rows with empty product_name or countries_en
    df = df[np.logical_and(
        np.logical_not(df['product_name'].isnull()),
        np.logical_not(df['countries_en'].isnull())
    )]
    
    # Drop duplicated rows in column product_name
    df.drop_duplicates(subset=['product_name'],inplace=True)

    print("Cleaning process terminated")

    

### Loading process

Now that our loading functions are defined, put some logic here to avoid long time processing while coding.

**Do not forget to set the global constant to Production values when coding is finished**

In [17]:
if (FORCE_LOAD_ORIGINAL_FILE == True) or path.exists(CLEANED_CSV_FILENAME) == False:
    df = loadOriginalTsvFile(ORIGINAL_TSV_FILENAME)
    cleanOriginalData(df)
    dumpCleanedCsvFile(df, CLEANED_CSV_FILENAME)
else:
    print("Cleand CSV file found. Original data file processing is skipped")

Cleand CSV file found. Original data file processing is skipped


In [20]:
if MAX_ENTRY_TO_LOAD != None:
    print("Loading the first",MAX_ENTRY_TO_LOAD,"rows from ",CLEANED_CSV_FILENAME)
else:
    print("Loading data from ",CLEANED_CSV_FILENAME)

print("Please wait...")
df = pd.read_csv(CLEANED_CSV_FILENAME,low_memory=False, nrows=MAX_ENTRY_TO_LOAD, index_col=0)
print("Dataframe loaded")

print('Number of rows   :',format(df.shape[0]))
print('Number of columns:',format(df.shape[1]))


Loading the first 10000 rows from  data/OpenFoodFacts-cleaned.csv
Please wait...
Dataframe loaded
Number of rows   : 10000
Number of columns: 140
