## 0. Notebook Parameters

---

### Notebook Settings

In [1]:
"""Google Colab settings"""
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

'Google Colab settings'

In [1]:
"""Jupyter settings"""
# Enable autoreload
%load_ext autoreload
%autoreload 2

# Pylint parameters
%config Completer.use_jedi = False

# Measure Runtime
# !pip install ipython-autotime
%load_ext autotime

time: 469 µs (started: 2021-03-02 15:37:32 +01:00)


### Imported Packages

#### Packages Usually Needed

In [3]:
"""Packages for manipulation of vectors, arrays, dataframes"""
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None) # Change display settings of pandas

"""Packages for data visualization"""
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

time: 1.4 ms (started: 2021-03-02 15:37:46 +01:00)


#### Packages Specific to the Notebook

In [4]:
# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe', 'dont', 'e', 'g', 'kj', 'kcal',]

import matplotlib.pyplot as plt

time: 329 ms (started: 2021-03-02 06:14:19 +01:00)


## 1. Calculate Frequencies of N-grams

### Import the Datasets

In [4]:
# Load the datasets of ngrams

# Datasets with more ngrams (noisier)
df_ngrams_ext_1_base = pd.read_csv('../raw_data/ngrams_extracted_1.csv')
df_ngrams_ext_2_base = pd.read_csv('../raw_data/ngrams_extracted_2.csv')

# Datasets with less ngrams 
# filtered with additional stopwords: ['e', 'g', 'kj', 'kcal', 'dont']
df_ngrams_red_1_base = pd.read_csv('../raw_data/ngrams_extracted_reduced_1.csv')
df_ngrams_red_2_base = pd.read_csv('../raw_data/ngrams_extracted_reduced_2.csv')

time: 160 ms (started: 2021-03-02 15:44:43 +01:00)


In [5]:
# Deep copy of the dataframe to avoid to reload it
ngrams_ext_1 = df_ngrams_ext_1_base.copy()
ngrams_ext_2 = df_ngrams_ext_2_base.copy()

time: 3.43 ms (started: 2021-03-02 15:44:43 +01:00)


In [6]:
# Deep copy of the dataframe to avoid to reload it
ngrams_red_1 = df_ngrams_red_1_base.copy()
ngrams_red_2 = df_ngrams_red_2_base.copy()

time: 3.24 ms (started: 2021-03-02 15:44:44 +01:00)


In [7]:
# Brief look at the dataset
print(f"""Shape of the dataset: {ngrams_ext_1.shape}
""")
print(f"""Columns types of the dataset: 
{ngrams_ext_1.dtypes}
""")
print(f"""Head of the dataset:""")
display(ngrams_ext_1.head())

Shape of the dataset: (38365, 12)

Columns types of the dataset: 
n_gram_size                object
pattern                    object
global_occurences           int64
fish meat eggs              int64
sugary snacks               int64
cereals and potatoes        int64
milk and dairy products     int64
fat and sauces              int64
fruits and vegetables       int64
salty snacks                int64
beverages                   int64
composite foods             int64
dtype: object

Head of the dataset:


Unnamed: 0,n_gram_size,pattern,global_occurences,fish meat eggs,sugary snacks,cereals and potatoes,milk and dairy products,fat and sauces,fruits and vegetables,salty snacks,beverages,composite foods
0,1-grams,"('g',)",1218496,157052,240149,146200,145675,56621,108858,98180,64496,201265
1,1-grams,"('dont',)",272296,39244,48506,27833,36105,16854,22478,21304,18611,41361
2,1-grams,"('sucre',)",268175,23256,85227,23156,34745,11009,23536,14603,19897,32746
3,1-grams,"('sel',)",262603,44526,38631,22952,22495,14217,16096,25072,8107,70507
4,1-grams,"('kcal',)",194316,23664,35394,22496,23819,8900,17202,14781,15996,32064


time: 24.6 ms (started: 2021-03-02 15:45:04 +01:00)


### Concatenate the datasets

In [8]:
# Concatenate the extended datasets
ngrams_ext = pd.concat([ngrams_ext_1, ngrams_ext_2], ignore_index=True)

time: 6.29 ms (started: 2021-03-02 15:51:48 +01:00)


In [11]:
# Save ngrams_ext dataset to csv
ngrams_ext.to_csv('../raw_data/ngrams_extracted.csv', index = False)

time: 187 ms (started: 2021-03-02 15:54:52 +01:00)


In [9]:
print(f"""
- length of concatenated df: {len(ngrams_ext)},
- length of ngrams_ext_1: {len(ngrams_ext_1)},
- length of ngrams_ext_2: {len(ngrams_ext_2)},
""")


- length of concatenated df: 55296,
- length of ngrams_ext_1: 38365,
- length of ngrams_ext_2: 16931,

time: 934 µs (started: 2021-03-02 15:53:16 +01:00)


In [10]:
# Concatenate the reduced datasets
ngrams_red = pd.concat([ngrams_red_1, ngrams_red_2], ignore_index=True)

time: 5.32 ms (started: 2021-03-02 15:53:54 +01:00)


In [12]:
# Save ngrams_ext dataset to csv
ngrams_red.to_csv('../raw_data/ngrams_extracted_reduced.csv', index = False)

time: 116 ms (started: 2021-03-02 15:57:44 +01:00)


### Calculate Frequencies