## 0. Notebook Parameters

---

### Notebook Settings

In [8]:
"""Google Colab settings"""
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

'Google Colab settings'

time: 2.86 ms (started: 2021-03-01 17:45:00 +01:00)


In [9]:
"""Jupyter settings"""
# Enable autoreload
%load_ext autoreload
%autoreload 2

# Pylint parameters
%config Completer.use_jedi = False

# Measure Runtime
# !pip install ipython-autotime
%load_ext autotime

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 8.12 ms (started: 2021-03-01 17:45:00 +01:00)


### Imported Packages

#### Packages Usually Needed

In [10]:
"""Packages for manipulation of vectors, arrays, dataframes"""
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None) # Change display settings of pandas

"""Packages for cleaning dataset"""
import json
import string
import unicodedata

"""Packages for data visualization"""
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

time: 1.84 ms (started: 2021-03-01 17:45:00 +01:00)


#### Packages Specific to the Notebook

In [11]:
# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe']

import matplotlib.pyplot as plt

time: 642 µs (started: 2021-03-01 17:45:00 +01:00)


## 1. Computational Extraction of N-grams

Method following the tutorial [From DataFrame to N-Grams](https://towardsdatascience.com/from-dataframe-to-n-grams-e34e29df3460)   
> A quick-start guide to creating and visualizing n-gram ranking using `nltk` for natural language processing.

### Import the Data

In [12]:
# Load the whole clean dataset
file ='../raw_data/ocr_labeled.csv'
off_df_base = pd.read_csv(file)

time: 3.86 s (started: 2021-03-01 17:45:00 +01:00)


In [13]:
# Deep copy of the dataframe to avoid to reload it
df = off_df_base.copy()  # Renew the DataFrame
#df = df[:10000].copy() # Sample of the dataset

time: 76.4 ms (started: 2021-03-01 17:45:04 +01:00)


In [14]:
# Brief look at the dataset
print(f"""Shape of the dataset: {df.shape}
""")
print(f"""Columns types of the dataset: 
{df.dtypes}
""")
print(f"""Head of the dataset:""")
display(df.head(2))

Shape of the dataset: (434896, 6)

Columns types of the dataset: 
barcode           int64
clean_text       object
fr_text          object
source           object
pnns_groups_1    object
pnns_groups_2    object
dtype: object

Head of the dataset:


Unnamed: 0,barcode,clean_text,fr_text,source,pnns_groups_1,pnns_groups_2
0,3199660476748,ne eleve abattu en bretagne les eleveurs de bretagne decoupe de podlet noir ferfier labe caracteristiques certifiees fermiereleve en plein air duree delevage jours minimum de cereales alimente avec de vegetaux mineraux et vitamines do mation voir etiquette poidsprix a conserver entre oc et c a consommer cuit a caeurdate limite de co le rheu certifie par certisimmeuble le millepertuis les landes dap produit frais classe a pour toute reclamation sadresser a fermiers dargoat bp ploufragan decoupe et conditionne par ldc bretagne bp quintin homologation n la ce abattoir agree nfr ce fltplt ferlr sat elbretpf origine france yo volaise prixtrg poids net prix a payer kg lot r consommer jusqu au r conserver entre o c et c expedie le loc bretagne lanfains e a au trit aste wwwconsignesdetrfr,"NE\nELEVE\nABATTU\nEN BRETAGNE\nLES ÉLEVEURS\nDE BRETAGNE\nDécoupe de\nPodlet noir\nferfier\nlabe\nCaractéristiques certifiées: Fermier-élevé en plein air. Durée d'élevage 81 jours minimum.\n75% de céréales.\nAlimenté avec 100% de végétaux, minéraux et vitamines do mation : voir étiquette poids/prix\nA conserver entre O°C et +4°C -A consommer cuit à caeur-Date limite de co 35650 Le Rheu\nCertifié par: CERTIS-Immeuble Le Millepertuis- Les Landes d'Ap\nProduit frais\nClasse A\nPour toute réclamation, s'adresser à\nFERMIERS D'ARGOAT: BP77-22440 PLOUFRAGAN\nDécoupé et conditionné par: LDC Bretagne BP 256-2280 QUINTIN\nHomologation\nN° LA/02/75\n099 002\nCE\n2\nAbattoir agréé n'FR 22.099.002 CE\n2FLT.PLT FER.LR SAT\nEL.BRET.PF 2\nORIGINE France\nyo\nVOLAISE\nPrixtrg\nPoids net\nPrix a payer\n0,240kg\nLot\n005809 19 45\nR consommer jusqu au\nR conserver entre O C et +4 C\n10/04/18\nExpedie Le 30/03/18\nLOC BRETAGNE 22 LANFAINS\n22.039.002\ne01\n172225-0/196A\n3 19966014\nAU TRIT ASTE\nwww.CONSIGNESDETRFR\n",/319/966/047/6748/1.json,fish meat eggs,meat
1,3199660219192,ker chant local decoupes de poulet conditionne par ldc bretagne lanfains rais classe a origine france prodtre et c offre speciale ne eleve prepare local dans notre region cuisdej plt sat kerchant fx origine france volaille francaise prixkg paids net prin apayer kg to a consommer jusqu au r conserver entre d c et c expedie le loc bretagne lanfains autrit pens,"Ker\nchant\n100% LOCAL\nDecoupes de\nPOULET\nConditionné par LDC Bretagne\nLanfains (22) 18\nrais Classe A\nOrigine FRANCE. Prodtre 0 et +4""C\nOFFRE\nSPECIALE\nNE ELEVE PREPARE\n100% LOCAL\nDANS NOTRE REGION\n1CUIS.DEJ. PLT SAT\nKERCHANT FX 1\nORIGINE France\nVOLAILLE\nFRANÇAISE\nPrix/kg\nPaids net\nPrin apayer\n1,000kg\nto\n005806444\nA consommer jusqu au\n27/09/18\nR conserver entre D C et 4 C\nExpedie Le\nLOC BRETAGNE 22 LANFAINS\n2.09.02\n256-106-0/1818\nAUTRIT\nPENS\n3 199660112 19192\n",/319/966/021/9192/1.json,fish meat eggs,meat


time: 11.9 ms (started: 2021-03-01 17:45:04 +01:00)


### Basic Cleaning

In [15]:
def basic_clean(text):
    """
    A simple function to clean up the data. All the words that
    are not designated as a stop word is then lemmatized after
    encoding and basic regex parsing are performed.
    """
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english'
                                           ) + nltk.corpus.stopwords.words(
                                            'french') + ADDITIONAL_STOPWORDS
    words = (unicodedata.normalize('NFKD', text)
            .encode('ascii', 'ignore')
            .decode('utf-8', 'ignore')
            .lower())
    words = re.sub("\d+", "", words) # Remove numbers
    words = re.sub(r'[^\w\s]', '', words).split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

time: 850 µs (started: 2021-03-01 17:45:04 +01:00)


In [16]:
# Apply the function
words = basic_clean(''.join(str(df['fr_text'].tolist())))

time: 2min 8s (started: 2021-03-01 17:45:04 +01:00)


In [17]:
# test that the output is not unique occurences
from collections import Counter 

x = 'avant'
d = Counter(words) 
print('{} has occurred {} times'.format(x, d[x])) 

avant has occurred 62337 times
time: 2.36 s (started: 2021-03-01 17:47:13 +01:00)


### Loop over the Function of Extraction

In [18]:
# Create a list of categories
categories_1 = list(df.pnns_groups_1.unique())

time: 20.9 ms (started: 2021-03-01 17:47:15 +01:00)


In [19]:
# Tokenize texts for the whole column 
words_global = basic_clean(''.join(str(df['fr_text'].tolist())))

# Create a dict of tokenized texts per categories
words_categories_1 = {k: basic_clean(''.join(
    str(df['fr_text'][df['pnns_groups_1'] == k].tolist()))
                                    ) for k in categories_1
                     }

time: 4min 28s (started: 2021-03-01 17:47:15 +01:00)


In [20]:
from functools import reduce

# Instanciate the df that will store the output of the whole loop
main_df_list = []

# Loop over the number of n-grams
for ng in range(1, 50):
    
    # Create an empty list to store all the dataframes before merging them
    df_list = []
    
    # Get the n-grams for all the categories
    n_grams_global = (pd.Series(nltk.ngrams(words, ng)).value_counts())
    df_list.append(
        pd.DataFrame({'pattern': n_grams_global.index, 
                      'global_occurences': n_grams_global.values,
                     })
    )
    
    # Get the n-grams for each category:
    for cat_k in words_categories_1.keys():
        n_grams_cat = (pd.Series(nltk.ngrams(words_categories_1[cat_k], ng)).value_counts())
        df_list.append(
            pd.DataFrame({'pattern': n_grams_cat.index,
                          cat_k: n_grams_cat.values,
                         })
        )
    
    # Concatenate the dataframes at the end of the iteration
    n_df = reduce(lambda left, right: pd.merge(left, right, on='pattern'), df_list)
    
    # Create a variable to specify the number of n-grams of the output 
    n_gram_size = str(ng) + '-grams'
    n_df.insert(0,'n_gram_size',n_gram_size)
    
    # Append the dataframe of this n-gram
    main_df_list.append(n_df)

# Concatenate the dataframes after the loop
ngrams_df = pd.concat(main_df_list, ignore_index= True)

KeyboardInterrupt: 

time: 35min 46s (started: 2021-03-01 17:51:44 +01:00)


In [None]:
len(ngrams_df)

In [100]:
ngrams_df[420:425]

Unnamed: 0,n_gram_size,pattern,global_occurences,fish meat eggs,sugary snacks,cereals and potatoes,milk and dairy products,fat and sauces,fruits and vegetables,salty snacks,beverages,composite foods
420,2-grams,"(code, barresnfigurant)",35,5,5,4,2,3,3,3,2,8
421,2-grams,"(recyclernconsigne, pouvant)",33,5,1,2,9,1,4,2,1,8
422,2-grams,"(tsa, 91431n91343)",31,7,5,2,2,2,2,4,2,5
423,2-grams,"(equilibrez, bougezn)",28,4,8,5,4,1,2,1,1,2
424,2-grams,"(consommateurs, tsa)",28,6,6,1,2,2,2,1,4,4


time: 12.2 ms (started: 2021-03-01 15:04:22 +01:00)
