## Preparing EC Tables

This notebook takes in the three tables from the database, and the following preprocessing is done for the EC codes:
 - Three description tables are joined to get as much description text for the EC codes as possible (main food description table, additional description table, and ingredient table). The `food_code` in additional and main food description tables and the `ingredient_code` are both EC codes, while additional description table and ingredient table may contain duplicate code records. For detailed documentation, please refer to the IRI/FNDDS/PPC data dictionary for columns specifications.
 - The text is then minimally preprocessed by lowercasing and removing special characters




In [None]:
import pandas as pd
import warnings 
import re 

warnings.filterwarnings('ignore')

### Part 1: Create the full EC table
First, it keeps the `food_code` and `main_food_description` columns in the main table, and the `food_code` and `additional_food_description` columns in the additional food table. 

In [None]:
# Read in the main food description table, use food description and category description columns
mainfooddesc = pd.read_csv('./raw_data/mainfooddesc1718.csv', dtype=str)
mainfooddesc['main_food_description'] = mainfooddesc['main_food_description'] + ' ' + mainfooddesc['wweia_category_description']
mainfooddesc = mainfooddesc[['food_code', 'main_food_description']]

In [None]:
mainfooddesc.head(5)

In [None]:
# Read in the additional food description table, concatenate all description to one per food code
addfooddesc = pd.read_csv('./raw_data/addfooddesc1516.csv', dtype=str)
addfooddesc = addfooddesc[['Food_code', 'Additional_food_description']]
addfooddesc = addfooddesc.rename(columns={'Food_code': 'food_code'})

Then, the additional descriptions were grouped by the food code and joined together with each other. Then the table was left joined to the main table to get an updated main table with only one column `food_description` per food code.

In [None]:
addfooddesc['Additional_food_description'] = addfooddesc.groupby('food_code')['Additional_food_description'].transform(lambda x: ' '.join(x))
addfooddesc = addfooddesc.drop_duplicates()

In [None]:
# We can now see the tables are alike.
addfooddesc.head(5)

In [None]:
# Left join the additional food description to the main table, concatenate all descriptions to one col per food code
main_df = pd.merge(mainfooddesc, addfooddesc, on='food_code', how='left')
main_df['Additional_food_description'] = main_df['Additional_food_description'].fillna('')
main_df['main_food_description'] = main_df['main_food_description'] + ' ' + main_df['Additional_food_description']
main_df.drop('Additional_food_description', axis=1, inplace=True)
main_df = main_df.rename(columns={'food_code': 'ec_code', 'main_food_description': 'ec_description'})

In [None]:
main_df.head()

In [None]:
# We can now see a longer text for this food code, after including additional description
main_df.loc[main_df['ec_code'] == '11111000']['ec_description'].tolist()

Finally, the ingredient table was concatenated to the updated main table and all duplicates were dropped.

In [None]:
# Read in the ingredient table 
fnddsingred = pd.read_csv('./raw_data/fnddsingred1516.csv', dtype=str)
fnddsingred = fnddsingred[['ingredient_code', 'ingredient_description']]
fnddsingred = fnddsingred.rename(columns={'ingredient_description': 'ec_description', 'ingredient_code': 'ec_code'})

In [None]:
fnddsingred.head()

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [None]:
# Concatenate the ingredient table to the main table
# Clean the text to keep only numbers and lowercase letters
ec_cleaned = pd.concat([main_df, fnddsingred], axis=0)
ec_cleaned['ec_description'] = ec_cleaned['ec_description'].apply(clean_text)
# Some food descriptions are different across the years. They will be dropped here for now.
# This also removes duplicate ingredient records
ec_cleaned = ec_cleaned.drop_duplicates('ec_code')

### Part 2: Keep ECs that appear in PPC table

In [None]:
ppc = pd.read_csv('./raw_data/ppc20152016.csv', dtype=str)
valid_ec = set(ppc['ec'].tolist())

In [None]:
# Here we don't need to specifically filter out negative EC codes because they don't exist in EC table at the first place
ec_cleaned = ec_cleaned.loc[ec_cleaned['ec_code'].isin(valid_ec)]

In [None]:
len(valid_ec)

In [None]:
ec_cleaned.head()

In [None]:
ec_cleaned.to_csv('ec_cleaned.csv', index=False)