# Assignment 1
## Team - 
- **Abhinav Sharma (ass2575)**
- **Archit Patel (ajp4737)**
- **Vishal Gupta (vg22846)**
- **Vivek Mehendiratta (vm24395)** 
- **Yashaswini Kalva (yk8348)**

In [6]:
import pandas as pd
import numpy as np
from numpy.linalg import norm
import string
import spacy
nlp = spacy.load('en_core_web_md')
from nltk.corpus import stopwords
from nltk import word_tokenize, Counter
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

from sklearn.feature_extraction.text import CountVectorizer

In [7]:
data = pd.read_csv('beer_reviews.csv')
data = data.iloc[:, 1:]
data.shape

(6227, 3)

In [8]:
data.head()

Unnamed: 0,product_name,product_review,user_rating
0,Kentucky Brunch Brand Stout,"Long time waiting to tick this one, and I have...",4.56
1,Kentucky Brunch Brand Stout,This review is for the 2019 batch. It was bott...,5.0
2,Kentucky Brunch Brand Stout,Supreme maple OD! Soooo easy drinking & well-t...,5.0
3,Kentucky Brunch Brand Stout,I have now had 4 different years of KBBS and c...,5.0
4,Kentucky Brunch Brand Stout,2020 Bottle. Absolutely bonkers Maple Syrup o...,5.0


In [9]:
data.isna().sum()

product_name      0
product_review    1
user_rating       0
dtype: int64

In [10]:
data = data.dropna()
data.reset_index(drop=True, inplace=True)
data.shape

(6226, 3)

In [11]:
# function to lemmatize all words in captions
def lemmatization(text):
    text = nlp(text)
    text_lemma = [word.lemma_ for word in text]
    return " ".join(text_lemma)

In [12]:
# stripping the review
data['product_review'] = data['product_review'].astype(str).str.strip()

# lemmatizing the words
data['product_review'] = data['product_review'].map(lemmatization)

# Task B

### Cleaning the data

In [13]:
# creating the text from comments 
text = ' '.join(data['product_review'])
text = text.lower()

# getting stopword list from nltk
stopwords_list = stopwords.words('english')
digit_list = list(string.digits)
punctuation_list = list(string.punctuation)

# tokenizing words
text_token = word_tokenize(text)
text_token_counter = Counter(text_token)

# creating dataframe for frequency table
text_token_df = pd.DataFrame(data=None, columns=['words', 'frequency'])
text_token_df['words'] = text_token_counter.keys()
text_token_df['frequency'] = text_token_counter.values()

# sorting the text token df and getting ranks
text_token_df.sort_values('frequency', inplace=True, ascending=False)
text_token_df.reset_index(drop=True, inplace=True)
text_token_df['rank'] = text_token_df['frequency'].rank(method='min', ascending=False).astype(int)

# removing stop words/punctuations/limit from word lists
mask = ~(text_token_df['words'].isin(stopwords_list) | 
          text_token_df['words'].isin(digit_list) |
          text_token_df['words'].isin(punctuation_list))
token_cleaned_df = text_token_df[mask]

token_cleaned_df

Unnamed: 0,words,frequency,rank
14,beer,5655,15
21,taste,3848,22
22,head,3838,23
23,pour,3739,24
28,chocolate,2916,29
...,...,...,...
13875,redfruit,1,7422
13876,apotheosis,1,7422
13877,jerky,1,7422
13878,ping,1,7422


In [14]:
# token_cleaned_df.head(100).to_csv('beer_review_tokens.csv', index=False)

### Creating words to attributes mapping

In [15]:
# attribute list from the problem statement
attributes_to_word_dict = {
    'aggressive': ['boldly', 'assertive', 'aroma', 'taste'],
    'balanced': ['malt', 'hops', 'malt', 'sweetness', 'hop', 'bitterness', 'balance'],
    'complex': ['multidimensional', 'flavors', 'sensations', 'palate'],
    'crisp':  ['carbonated', 'effervescent'],
    'fruity': ['flavors', 'fruits'],
    'hoppy': ['herbal', 'earthy', 'spicy', 'citric', 'citrus', 'aromas', 'flavors', 'hop'],
    'malty': ['grainy', 'caramel', 'sweet', 'dry'],
    'robust': ['rich', 'bodied']}

# adding more attributes from term frequency analysis
attributes_to_word_dict['aggressive'].append('sour')
attributes_to_word_dict['balanced'].append('bitter')
attributes_to_word_dict['fruity'].append('grapefruit')
attributes_to_word_dict['fruity'].append('pineapple')
attributes_to_word_dict['fruity'].append('mango')
attributes_to_word_dict['fruity'].append('coconut')
attributes_to_word_dict['fruity'].append('tropical')
attributes_to_word_dict['hoppy'].append('maple')
attributes_to_word_dict['malty'].append('bourbon') 
attributes_to_word_dict['malty'].append('vanilla') 
attributes_to_word_dict['malty'].append('oak') 

# lemmatizing keys and values
for k, v in attributes_to_word_dict.items():
    attributes_to_word_dict[k] = [lemmatization(w) for w in v] + [lemmatization(k)]
attributes_to_word_dict

{'aggressive': ['boldly', 'assertive', 'aroma', 'taste', 'sour', 'aggressive'],
 'balanced': ['malt',
  'hop',
  'malt',
  'sweetness',
  'hop',
  'bitterness',
  'balance',
  'bitter',
  'balanced'],
 'complex': ['multidimensional', 'flavor', 'sensation', 'palate', 'complex'],
 'crisp': ['carbonate', 'effervescent', 'crisp'],
 'fruity': ['flavor',
  'fruit',
  'grapefruit',
  'pineapple',
  'mango',
  'coconut',
  'tropical',
  'fruity'],
 'hoppy': ['herbal',
  'earthy',
  'spicy',
  'citric',
  'citrus',
  'aroma',
  'flavor',
  'hop',
  'maple',
  'hoppy'],
 'malty': ['grainy',
  'caramel',
  'sweet',
  'dry',
  'bourbon',
  'vanilla',
  'oak',
  'malty'],
 'robust': ['rich', 'body', 'robust']}

# Task C

Three customer attributes - 
* Aggresive
* Balanced
* Hoppy

In [16]:
input_dict = {
    'aggressive': 1,
    'balanced': 1,
    'complex': 0,
    'crisp':  0,
    'fruity': 0,
    'hoppy': 1,
    'malty': 0,
    'robust': 0}

input_attributes = pd.DataFrame(input_dict, index=[0])
input_attributes


Unnamed: 0,aggressive,balanced,complex,crisp,fruity,hoppy,malty,robust
0,1,1,0,0,0,1,0,0


### Attribute identification from reviews - TF matrix

In [17]:
# creating attribute wise columns to get occurence into a dataframe
attribute_occurence_df = pd.DataFrame(np.zeros((data.shape[0], len(attributes_to_word_dict.keys()))))
attribute_occurence_df.columns = attributes_to_word_dict.keys()

# appending product_review df with brand occurence df
attribute_occurence_df = pd.concat([data, attribute_occurence_df], axis=1)

# geting occurence columns populated
for c in attribute_occurence_df.iloc[:, 3:]:
    model_list = list(attributes_to_word_dict[c])
    print('calculating tf in product_review for attributes - ', c, model_list)
    
    attribute_occurence_df[c] = attribute_occurence_df['product_review'].str.\
        findall('( ' + '|'.join(model_list) + ' )').map(lambda lst: len(lst))

calculating tf in product_review for attributes -  aggressive ['boldly', 'assertive', 'aroma', 'taste', 'sour', 'aggressive']
calculating tf in product_review for attributes -  balanced ['malt', 'hop', 'malt', 'sweetness', 'hop', 'bitterness', 'balance', 'bitter', 'balanced']
calculating tf in product_review for attributes -  complex ['multidimensional', 'flavor', 'sensation', 'palate', 'complex']
calculating tf in product_review for attributes -  crisp ['carbonate', 'effervescent', 'crisp']
calculating tf in product_review for attributes -  fruity ['flavor', 'fruit', 'grapefruit', 'pineapple', 'mango', 'coconut', 'tropical', 'fruity']
calculating tf in product_review for attributes -  hoppy ['herbal', 'earthy', 'spicy', 'citric', 'citrus', 'aroma', 'flavor', 'hop', 'maple', 'hoppy']
calculating tf in product_review for attributes -  malty ['grainy', 'caramel', 'sweet', 'dry', 'bourbon', 'vanilla', 'oak', 'malty']
calculating tf in product_review for attributes -  robust ['rich', 'body

In [18]:
attribute_occurence_df = attribute_occurence_df[
    (attribute_occurence_df.iloc[:, 3:].sum(axis = 1) != 0)]
attribute_occurence_df.head()

Unnamed: 0,product_name,product_review,user_rating,aggressive,balanced,complex,crisp,fruity,hoppy,malty,robust
0,Kentucky Brunch Brand Stout,"long time wait to tick this one , and I have t...",4.56,0,0,0,0,0,1,0,0
1,Kentucky Brunch Brand Stout,this review be for the 2019 batch . it be bott...,5.0,1,0,1,0,1,2,1,0
2,Kentucky Brunch Brand Stout,Supreme maple OD ! soooo easy drinking & well ...,5.0,0,0,0,0,0,1,0,0
3,Kentucky Brunch Brand Stout,I have now have 4 different year of KBBS and c...,5.0,1,0,0,0,0,1,1,0
4,Kentucky Brunch Brand Stout,2020 bottle . absolutely bonker Maple Syrup ...,5.0,0,0,0,0,0,0,1,0


### Getting cosine similarity

In [19]:
def cosineSimilarity(review_att, cust_input):
    return np.dot(review_att, cust_input) / (norm(review_att) * norm(cust_input))

attribute_occurence_df['similarity_score'] = attribute_occurence_df.iloc[:, 3:].apply(
    lambda x : cosineSimilarity(x.values, input_attributes.values[0]), axis = 1).\
sort_values(ascending=False)
attribute_occurence_df.head()

Unnamed: 0,product_name,product_review,user_rating,aggressive,balanced,complex,crisp,fruity,hoppy,malty,robust,similarity_score
0,Kentucky Brunch Brand Stout,"long time wait to tick this one , and I have t...",4.56,0,0,0,0,0,1,0,0,0.57735
1,Kentucky Brunch Brand Stout,this review be for the 2019 batch . it be bott...,5.0,1,0,1,0,1,2,1,0,0.612372
2,Kentucky Brunch Brand Stout,Supreme maple OD ! soooo easy drinking & well ...,5.0,0,0,0,0,0,1,0,0,0.57735
3,Kentucky Brunch Brand Stout,I have now have 4 different year of KBBS and c...,5.0,1,0,0,0,0,1,1,0,0.666667
4,Kentucky Brunch Brand Stout,2020 bottle . absolutely bonker Maple Syrup ...,5.0,0,0,0,0,0,0,1,0,0.0


In [20]:
attribute_occurence_df.sort_values('similarity_score', ascending=False)

Unnamed: 0,product_name,product_review,user_rating,aggressive,balanced,complex,crisp,fruity,hoppy,malty,robust,similarity_score
967,Ann,"from a bomber . pour golden . smell tart , cit...",4.79,1,1,0,0,0,1,0,0,1.0
4175,Art,"B3 , hoppypocket , ca n't thank you enough . t...",4.49,1,1,0,0,0,1,0,0,1.0
5781,Emerald Grouper,"pour a cleanse , golden orange with a thin , w...",4.80,1,1,0,0,0,1,0,0,1.0
1863,Sip Of Sunshine,"dark gold appearance , clear not hazy smell of...",3.58,1,1,0,0,0,1,0,0,1.0
1865,Sip Of Sunshine,not much more to add . this beer not only tast...,4.61,1,1,0,0,0,1,0,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5139,Clare's Thirsty Ale,tan head . black color . 2018 vintage . rasp...,4.50,0,0,0,0,0,0,1,0,0.0
5143,Clare's Thirsty Ale,thank Ryan for share a mini growler of this li...,4.69,0,0,0,0,0,0,3,2,0.0
5924,Cellarman Barrel Aged Saison,"tart granny smith apple , lemon , yogurt , lim...",4.44,0,0,0,1,0,0,2,0,0.0
3464,Kaggen! Stormaktsporter,2018 bottle in a BIF courtesy of mtkatl l : ...,4.57,0,0,0,0,2,0,1,0,0.0


In [21]:
output_df = attribute_occurence_df[['product_name', 'product_review', 'similarity_score']]
output_df.to_csv('customer_review_similarity.csv', index=False)

In [24]:
avg_review_similarity = output_df.groupby('product_name')['similarity_score'].mean().reset_index()
avg_review_similarity.sort_values('similarity_score', ascending=False)

Unnamed: 0,product_name,similarity_score
128,Hop JuJu,0.770763
89,Double Sunshine,0.752851
21,Bad Boy,0.741428
193,Pseudo Sue - Double Dry-Hopped,0.737662
248,Zombie Dust,0.727463
...,...,...
195,Resolute - Coconut,0.248326
34,Black Tuesday,0.235395
28,Beer Geek Vanilla Shake - Bourbon Barrel-Aged,0.226207
238,Vanilla Bean Assassin,0.223602


# Task D

In [25]:
senti_analyzer = SentimentIntensityAnalyzer()

def review_sentiment(review):
    score = senti_analyzer.polarity_scores(review)
    return score['compound']

attribute_occurence_df['sentiment_score'] = attribute_occurence_df['product_review'].map(review_sentiment)

In [26]:
attribute_occurence_df.head()

Unnamed: 0,product_name,product_review,user_rating,aggressive,balanced,complex,crisp,fruity,hoppy,malty,robust,similarity_score,sentiment_score
0,Kentucky Brunch Brand Stout,"long time wait to tick this one , and I have t...",4.56,0,0,0,0,0,1,0,0,0.57735,0.6369
1,Kentucky Brunch Brand Stout,this review be for the 2019 batch . it be bott...,5.0,1,0,1,0,1,2,1,0,0.612372,0.8194
2,Kentucky Brunch Brand Stout,Supreme maple OD ! soooo easy drinking & well ...,5.0,0,0,0,0,0,1,0,0,0.57735,0.9018
3,Kentucky Brunch Brand Stout,I have now have 4 different year of KBBS and c...,5.0,1,0,0,0,0,1,1,0,0.666667,0.8689
4,Kentucky Brunch Brand Stout,2020 bottle . absolutely bonker Maple Syrup ...,5.0,0,0,0,0,0,0,1,0,0.0,-0.5487


# Task E

In [27]:
evaluation_df = attribute_occurence_df.groupby('product_name')[['similarity_score', 'sentiment_score']].mean().\
reset_index()
evaluation_df['evaluation_score'] = evaluation_df['similarity_score'] + evaluation_df['sentiment_score']
evaluation_df.sort_values('evaluation_score', ascending=False, inplace=True)
recommendation = evaluation_df.head(3)['product_name'].values
recommendation

array(['Double Stack', 'Pliny The Younger', 'Keene Idea'], dtype=object)

# Task F

In [28]:
cust_pref = 'aggressive balanced hoppy'

def spacy_similarity(review, cust_pref):
    review_doc = nlp(review)
    cust_pref_doc = nlp(cust_pref)
    return review_doc.similarity(cust_pref_doc)

attribute_occurence_df['spacy_similarity_score'] = attribute_occurence_df['product_review'].map(
    lambda x : spacy_similarity(x, cust_pref))

In [29]:
evaluation_df_w2v = attribute_occurence_df.groupby('product_name')[['spacy_similarity_score', 'sentiment_score']].mean().\
reset_index()
evaluation_df_w2v['evaluation_score_w2v'] = evaluation_df_w2v['spacy_similarity_score'] + evaluation_df_w2v['sentiment_score']
evaluation_df_w2v.sort_values('evaluation_score_w2v', ascending=False, inplace=True)
recommendation_w2v = evaluation_df_w2v.head(3)['product_name'].values
recommendation_w2v

array(['Mexican Cake - Maple Bourbon Barrel-Aged',
       'Genealogy Of Morals - Bourbon Barrel-Aged',
       'Mother Of All Storms'], dtype=object)

**Comparision between the two methods using % of reviews with the attribute**

In [30]:
cust_attributes = list(input_attributes.T[(input_attributes.T == 1)].dropna().index)

mask1 = attribute_occurence_df['product_name'].isin(recommendation)
bow_reco_check = attribute_occurence_df.loc[mask1]
bow_reco_check[cust_attributes] = bow_reco_check[cust_attributes].astype(bool).astype(int)
bow_reco_check.groupby('product_name')[cust_attributes].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0_level_0,aggressive,balanced,hoppy
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Double Stack,20,16,24
Keene Idea,9,21,21
Pliny The Younger,15,20,18


In [31]:
mask2 = attribute_occurence_df['product_name'].isin(recommendation_w2v)

spacy_reco_check = attribute_occurence_df.loc[mask2]
spacy_reco_check[cust_attributes] = spacy_reco_check[cust_attributes].astype(bool).astype(int)
spacy_reco_check.groupby('product_name')[cust_attributes].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0_level_0,aggressive,balanced,hoppy
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Genealogy Of Morals - Bourbon Barrel-Aged,16,11,14
Mexican Cake - Maple Bourbon Barrel-Aged,9,12,19
Mother Of All Storms,19,22,15


The count of each attribute specified by the customer is higher in Bag of words compared to when SpaCy word vectors. In this particular case, Bag of words is working out but may not be generalizable.

Also, SpaCy word vectors are trained on general corpus which may not be relevant for the particular beer domain and the reviews. 

# Task G

In [32]:
evaluation_df_rating = attribute_occurence_df.groupby('product_name')['user_rating'].mean().reset_index()
evaluation_df_rating.sort_values('user_rating', ascending=False, inplace=True)
recommendation_rating = evaluation_df_rating['product_name'].head(3).values
recommendation_rating

array(['SR-71', 'Chemtrailmix', 'Blessed'], dtype=object)

In [33]:
mask3 = attribute_occurence_df['product_name'].isin(recommendation_rating)

rating_reco_check = attribute_occurence_df.loc[mask3]
rating_reco_check[cust_attributes] = rating_reco_check[cust_attributes].astype(bool).astype(int)
rating_reco_check.groupby('product_name')[cust_attributes].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0_level_0,aggressive,balanced,hoppy
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Blessed,13,11,11
Chemtrailmix,13,11,9
SR-71,11,8,12


Individual review rating is only a representation of that particular reviewer and may not reflect what other feels about the beer. Rating is an individual preference which is independent on the attributes. Also, one reviewer may lean toward one attribute and give higher rating because of that, but another reviewer may rate lower because of the same attribute. That's why both Bag of words and Spacy perform better than the rating method.