## Linear Model Baseline

This notebook takes in the cleaned UPC and EC tables and use `recordlinkage` package to perform a very simple logistic regression model. The model is very preliminary at this stage, and just to demonstrate how the task can be done with minimum data processing and time effort. 

`recordlinkage` is a very convenient package for easily matching records together. Although the link can't be open in ADRF, feel free to explore more outside ADRF at: https://pypi.org/project/recordlinkage/

In [None]:
import pandas as pd
import recordlinkage
from recordlinkage.index import Full
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import re
import warnings 

warnings.filterwarnings('ignore')

In [None]:
# The UPC and EC tables are limited to year 2015 and 2016 as well.
upc = pd.read_csv('upc_cleaned.csv', dtype=str)
ec = pd.read_csv('ec_cleaned.csv', dtype=str)
ppc = pd.read_csv('./raw_data/ppc20152016.csv', dtype=str)

Some UPC codes in PPC table are not in UPC table. They will be filtered out here. We made sure there won't be such cases in public and private test sets

In [None]:
# Some UPC code in PPC table was never in UPC table. They will be filtered here for now.
# Don't filter this on EC table because there are custom ec_code that are not in EC table.
ppc = ppc[ppc['upc'].isin(upc['upc_code'])]

There are around 200K unique UPC records and around 3K EC records, which makes it very expensive to match without blocking. A possible thing to do it is to use their category, or same words as a blocking method to reduce the number of linkage candidates.

For simplicity and demonstration purpose, here we only sample a fraction of the table to create a subset of data.

In [None]:
ppc_clipped = ppc.sample(500, random_state=2498)

In [None]:
# Clip the UPC and EC tables accordingly.
upc_set = set(ppc_clipped['upc'].tolist())
ec_set = set(ppc_clipped['ec'].tolist())

upc_clipped = upc[upc['upc_code'].isin(upc_set)]
ec_clipped = ec[ec['ec_code'].isin(ec_set)]

upc_clipped = upc_clipped.set_index('upc_code')
ec_clipped = ec_clipped.set_index('ec_code')

In [None]:
%%time
# Here we use recordlinkage package to create links and also calculate Jaro-Winkler scores
indexer = recordlinkage.Index()
indexer.add(Full())
candidate_links = indexer.index(upc_clipped, ec_clipped)

# Calculate pairs with the Jaro-Winkler score between food description 
comparer = recordlinkage.Compare()
comparer.string('upc_description', 'ec_description', method='jarowinkler', label='score')
raw_data = comparer.compute(candidate_links, upc_clipped, ec_clipped)

In [None]:
# Map the description back to the table
raw_data['upc'], raw_data['ec'] = zip(*raw_data.index)
raw_data['upc_description'] = raw_data['upc'].map(upc_clipped['upc_description'].to_dict())
raw_data['ec_description'] = raw_data['ec'].map(ec_clipped['ec_description'].to_dict())
raw_data = raw_data.fillna('no acceptable match')

In [None]:
raw_data.head()

In [None]:
# Apply TF-IDF to transform text into vectors
clf = TfidfVectorizer(max_features=100)
raw_data['desc'] = raw_data['upc_description'] + " " + raw_data['ec_description']
clf.fit(raw_data['desc'])
tfidf_vector = clf.transform(raw_data['desc']).todense()
vector_df = pd.DataFrame(tfidf_vector, index=candidate_links)
vector_df['score'] = raw_data['score']

In [None]:
vector_df.head()

In [None]:
%%time
# Split the train and test set. Here we are going to use the whole dataset as the test set.
X_train, _ = train_test_split(vector_df, test_size=0.8, random_state=2498)

In [None]:
# Update the ground truth matches, and 
true_linkage = pd.MultiIndex.from_arrays([ppc_clipped['upc'].tolist(), ppc_clipped['ec'].tolist()])
X_train_linkage = X_train.index
match_index = X_train_linkage & true_linkage

In [None]:
%%time
logrg = recordlinkage.LogisticRegressionClassifier(C=10)
logrg.fit(X_train, match_index)

In [None]:
# Make predictions on the whole dataset. 
result = logrg.prob(vector_df)

In [None]:
# Keep the top 5 matches
result = result.groupby(level=0).nlargest(5).reset_index(level=0, drop=True).reset_index()
result = result.rename(columns={'upc_code': 'upc', 'ec_code': 'ec', '0': 'confidence'})

In [None]:
result.head()

In [None]:
result[['upc', 'ec']].to_csv('result/logistic_regression_submission.csv', index=False)

In [None]:
ppc_clipped[['upc', 'ec']].to_csv('result/logistic_regression_ground_truth.csv', index=False)