### Map instacart products to nutrient database
##### Authors: Reshma

* Instacart dataset :
    * Instacart product file - 49688 unique products which inlcude non food items too
    
* Nutrient database:
    * Cleaned USDA file V1
    * Cleaned NutrientValue extract by web scraping
    
* Similarity matrix:
    * Jaccard Similarity
    * Spacy - "en_core_web_lg" database

In [40]:
import pandas as pd
import numpy as np
import nltk as nltk
from tqdm import tqdm
import time
import warnings;
warnings.filterwarnings('ignore')

import spacy
nlp = spacy.load('en_core_web_lg')

In [41]:
path1 = 'C:/Users/SSK/Documents/UC - MScA/Courses/Capstone/Datasets/Nutrition/Final Dataset/'
path2 = 'C:/Users/SSK/Documents/UC - MScA/Courses/Capstone/Datasets/instacart_online_grocery_shopping_2017_05_01/instacart_2017_05_01/'

#### Read nutrient database products

In [42]:
nutri_web  = pd.read_csv(path1+'NutritionValueExtract.csv')
nutri_usda = pd.read_csv(path1+'Nutrition Data Consolidated with additional attributesV1.csv')

In [43]:
db_products = np.unique(list(nutri_web['Product Name'].values) + list(nutri_usda['name'].values))

In [32]:
db_products = {prod.lower():nlp(prod.lower()) for prod in db_products}

#### Read instacart products

In [44]:
insta_ = pd.read_csv(path2+'products.csv')

In [45]:
insta_products = list(insta_['product_name'].values)
insta_df = pd.DataFrame(insta_products, columns=['Product Name'])
insta_df['Product Name'] = insta_df['Product Name'].apply(lambda x : x.lower())

In [48]:
insta_df = insta_df.head(1000)

#### Functions

In [49]:
def get_jd_sim(insta_prod):
    sim_arr = {db_prod:nltk.jaccard_distance(set(insta_prod.split()), set(db_prod.split())) for db_prod in db_products}
    sel_dict = sorted(sim_arr.items(), key=lambda x:x[1])[:3] #lower jaccard similarity is better
    return [i[0] for i in sel_dict]

def get_spacy_sim(insta_prod):
    prod_1 = nlp(insta_prod.lower())
    sim_arr = {db_k:prod_1.similarity(db_v) for db_k, db_v in db_products.items()}
    sel_dict = sorted(sim_arr.items(), key=lambda x:-x[1])[:3] #reverse sorted- higher spacy similarity is better
    return [i[0] for i in sel_dict]

#### Run Jaccard similarity

In [50]:
start_time = time.time()
insta_df['JD Match1'], insta_df['JD Match2'], insta_df['JD Match3'] = zip(*insta_df['Product Name'].apply
                                                                 (lambda product: get_jd_sim(product)))
print("total time in minutes:", (time.time() - start_time)/60.0)

total time in minutes: 5.08179192940394


#### Run spacy similarity

In [38]:
start_time = time.time()
insta_df['SP Match1'], insta_df['SP Match2'], insta_df['SP Match3'] = zip(*insta_df['Product Name'].apply
                                                                 (lambda product: get_spacy_sim(product)))
print("total time in minutes:", (time.time() - start_time)/60.0)

total time in minutes: 21.57397082646688


In [39]:
insta_df

Unnamed: 0,Product Name,JD Match1,JD Match2,JD Match3,SP Match1,SP Match2,SP Match3
0,chocolate sandwich cookies,cookies chocolate wafers,cookies chocolate chip sandwich with creme fil...,cookies chocolate sandwich with creme filling ...,cookies chocolate chip sandwich with creme fil...,cookies chocolate sandwich with creme filling ...,cookies chocolate sandwich with extra creme fi...
1,all-seasons salt,table salt,"butter, without salt",salt pork cooked,peanuts all types dry-roasted with salt,peanuts all types oil-roasted without salt,malt-o-meal original plain prepared with water...
2,robust golden unsweetened oolong tea,oolong tea,tea instant decaffeinated unsweetened,tea instant lemon unsweetened,tea instant decaffeinated unsweetened,oolong tea,tea iced instant black unsweetened
3,smart ones classic favorites mini rigatoni wit...,vodka sauce with tomatoes and cream,veal with cream sauce,pasta with cream sauce ready-to-heat,vodka sauce with tomatoes and cream,pasta with cream sauce home recipe,swedish meatballs with cream or white sauce
4,green chile anytime sauce,enchilada sauce green,sauce hot chile sriracha tuong ot sriracha,sauce chili peppers hot immature green canned,enchilada with beans green-chile or enchilada ...,enchilada with chicken green-chile or enchilad...,enchilada with meat green-chile or enchilada s...
...,...,...,...,...,...,...,...
995,honey cinnamon nut-thins crackers,cinnamon,cinnamon buns frosted (includes honey buns),"cinnamon buns, frosted (includes honey buns)",nuts almonds honey roasted unblanched,honey butter,biscuit cinnamon-raisin
996,mini double chocolate ice cream bars,ice cream soda chocolate,light ice cream cone chocolate,soft serve chocolate ice cream,ice cream bar or stick chocolate ice cream cho...,ice cream bar or stick rich chocolate ice crea...,ice cream bar or stick chocolate covered
997,hot chopped green chili,hot green chili peppers,green chili peppers,sauce chili peppers hot immature green canned,hot green chili peppers,sauce chili peppers hot immature green canned,sauce peppers hot chili mature red canned
998,original organic ville bbq sauce,sauce barbecue bulls-eye original,sauce barbecue kraft original,sauce barbecue kc masterpiece original,sauce barbecue kraft original,sauce barbecue kc masterpiece original,barbecue sauce
