## Preparing UPC Table

This notebook takes in the table exported from the database with records only in the years 2015 and 2016. We keep only the `upc_code` column and `upc_description` column and only valid records that appear in the PPC table. The table is cleaned in the way that it:
- Keep only the numbers and words in lowercase form. 
- Get rid of the UPC code suffix in the description. 

In [None]:
import pandas as pd
import warnings 
import re

warnings.filterwarnings('ignore')

### Part 1: Keep only the useful UPCs 

In [None]:
# The table keeps only records in 2015 and 2016
upc = pd.read_csv('./raw_data/pd_pos_all1516.csv', dtype=str)

In [None]:
upc.columns

In [None]:
upc.sample(3)

We only care about UPC records that appear in PPC, the target table.

In [None]:
ppc = pd.read_csv('./raw_data/ppc20152016.csv', dtype=str)

In [None]:
ppc.sample(3)

In [None]:
ppc = ppc.loc[(ppc['ec'] != '-70') & (ppc['ec'] != '-90')]

Here we remove all PPC records with a match to EC code `-90` (low sale) or `-70` (no sale) because they are technically not a part of the challenge. PPC only covers items in the top 95% of sales, so we would be excluding these matches from public and private test set as well. 

However, we will keep `-80`(cannot determine the actual product) and `-99` (no acceptable matches) as they were, as these reasons for not assigning an FNDDS code are important to distinguish.

In [None]:
valid_upc = set(ppc['upc'].tolist())

In [None]:
# Keep only the UPCs that appear in PPC table.
upc = upc.loc[upc['upc'].isin(valid_upc)]

### Part 2: Preprocess UPC table

In [None]:
# This is the description field before cleaning
upc.iloc[0]['upcdesc']

In [None]:
# All UPC description has UPC code as suffix, which needs to be removed
upc['upc_description'] = upc['upcdesc'].str.split('-').str[0]

In [None]:
# More text columns could be combined to the descriptions together

# upc['deptid'] = upc['deptid'].str.split('-').str[1]
# upc['aisle'] = upc['aisle'].str.split('-').str[1]

# column_list = ['flavor', 'deptid', 'aisle', 'category', 'brand', 'manufacturer', 'parent']
# for column in column_list:
#     upc['upc_description'] = upc['upc_description'] + ' ' + upc[column]

In [None]:
# Remove punctuations and keep only numbers and lowercase letters
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [None]:
# Keep only the code and description for now
upc_cleaned = upc[['upc', 'upc_description']]
upc_cleaned['upc_description'] = upc_cleaned['upc_description'].apply(clean_text)

In [None]:
# This is the description field after cleaning
upc_cleaned.iloc[0]['upc_description']

In [None]:
# Rename the column for consistency
upc_cleaned = upc_cleaned.rename(columns={'upc': 'upc_code'})

In [None]:
# Some food descriptions are different across the years. They will be dropped here for now. 
# By default, 2015 record will be dropped
upc_cleaned = upc_cleaned.drop_duplicates('upc_code')

In [None]:
# Output the table
upc_cleaned.to_csv('upc_cleaned.csv', index=False)

In [None]:
# Just to have a look at the data
upc_cleaned.head()