## Task F

How would your recommendation change if you use word vectors (the spaCy package would be the easiest to use with pretrained word vectors) instead of plain vanilla bag-of-words cosine similarity? One way to analyze the difference would be to consider the % of reviews that mention a preferred attribute. E.g., if you recommend a product, what % of its reviews mention an attribute specified by the customer?

Do you see any difference across bag-of-words and word vector approaches? This article may be useful: https://medium.com/swlh/word-embeddings-versus-bag-of-words-the-curious-case-of-recommender-systems-6ac1604d4424?source=friends_link&sk=d746da9f094d1222a35519387afc6338Note that the article doesn’t claim that bag-of-words will always be better than word embeddings for recommender systems. It lays out conditions under which it is likely to be the case. That is, depending on the attributes you use, you may or may not see the same effect. 

In [1]:
import spacy
!python -m spacy download en_core_web_md
import en_core_web_md

nlp = en_core_web_md.load()

Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[K     |████████████████████████████████| 45.4 MB 11.1 MB/s eta 0:00:01
You should consider upgrading via the '/Users/Amanda/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
import pandas as pd 
import json

file = open("reviews.json")
data = json.load(file)
file.close()

def cosine_similarity(review):
    processed_review = nlp(review)
    score = processed_review.similarity(processed_attr)
    return score

In [3]:
attributes_file = open("attributes.txt","r")
attr = attributes_file.readline()
attributes_file.close()

In [4]:
processed_attr = nlp(attr)

product_names = []
product_reviews = []
similarity_scores = []
dark_score = []
sweet_score = []
thick_score = []
attrs = attr.split(' ')
    
beer_names = [name for name in data]
count = 0 
for beer_name in beer_names:
    for i in range(0,len(data[beer_name])):
        reviews = (data[beer_name][i]['review'].split('\n'))
        
        for review in reviews:
            score = cosine_similarity(review)
            product_names.append(beer_name)
            product_reviews.append(review)
            similarity_scores.append(score)
            count += 1
            
            dark_score.append(nlp(review).similarity(nlp(attrs[0])))
            sweet_score.append(nlp(review).similarity(nlp(attrs[1])))
            thick_score.append(nlp(review).similarity(nlp(attrs[2])))
            
            #print(count)
    

In [5]:
output = pd.DataFrame()

output['product_name'] = product_names
output['product_review'] = product_reviews
output['similarity_score'] = similarity_scores
    
output.to_csv('cosine_similarity.csv', index=False)
output

Unnamed: 0,product_name,product_review,similarity_score
0,Kentucky Brunch Brand Stout,2020 vintage drank 10/22/21 Incredible smell ...,0.716884
1,Kentucky Brunch Brand Stout,2020 vintage acquired during the pandemic. It ...,0.592988
2,Kentucky Brunch Brand Stout,"Long time waiting to tick this one, and I have...",0.664378
3,Kentucky Brunch Brand Stout,This review is for the 2019 batch. It was bott...,0.664211
4,Kentucky Brunch Brand Stout,Supreme maple OD! Soooo easy drinking & well-t...,0.579176
...,...,...,...
6217,Dragonsaddle,"For me, I have experienced all of the hoof hea...",0.611540
6218,Dragonsaddle,fresh can drank on 11/5/2016 very hazy golde...,0.694231
6219,Dragonsaddle,1 PINT can Served in an oversized stemless win...,0.725859
6220,Dragonsaddle,L: Pours a dark gold with a nice head that sta...,0.703627


In [6]:

df_eachattr = pd.DataFrame()


df_eachattr['product_name'] = product_names        
df_eachattr['dark score'] = dark_score
df_eachattr['sweet score'] = sweet_score
df_eachattr['thick score'] = thick_score

df_eachattr

Unnamed: 0,product_name,dark score,sweet score,thick score
0,Kentucky Brunch Brand Stout,0.589891,0.641570,0.517301
1,Kentucky Brunch Brand Stout,0.497966,0.540066,0.409405
2,Kentucky Brunch Brand Stout,0.551649,0.591737,0.477392
3,Kentucky Brunch Brand Stout,0.549468,0.582425,0.487953
4,Kentucky Brunch Brand Stout,0.485375,0.539801,0.389077
...,...,...,...,...
6217,Dragonsaddle,0.510733,0.565588,0.416673
6218,Dragonsaddle,0.575380,0.640527,0.478688
6219,Dragonsaddle,0.588389,0.652832,0.529169
6220,Dragonsaddle,0.588706,0.634226,0.494157
