# Purpose
This notebook handles calculating which whiskies are similar to each other.

The first step is to aggregate the dataframe to a whisky level. This involves:

- Taking the mean of ratings within a whisky.
- Concacting all text from within a review.

While doing this we will also do some cleaning on the text and lemmatize it as well so that's its ready for the next step.

Next we will calcualte similarities.

To do this it trains a word2vec model on all of the whisky reviews.

Afterwards it uses word mover distance on all other whiskies to calculate a similarity score.

The output is a table in which each row contains two whisky ids and the similarity between them.

## Load Libraries and Data

In [406]:
import multiprocessing as mp
import re
from collections import defaultdict

import pandas as pd
from gensim import corpora
from gensim.parsing.preprocessing import (preprocess_string, remove_stopwords,
                                          strip_multiple_whitespaces,
                                          strip_numeric, strip_punctuation,
                                          strip_short, strip_tags)
from gensim.similarities import WmdSimilarity
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from pyemd import emd

In [400]:
reviews = pd.read_parquet('data/review_cats.parquet')

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


## Roll up to Whisky Level And Preprocess
We can join back on other tables to get most information, what we are looking for is whisky itemnumber, aggregated review score, and aggregated nose, taste, and finish words.

### Roll Up

In [401]:
# define a custom aggregation function that returns a concatenated version of all strings in the group by
def words(col):
    return ''.join(col)

# and a custom aggregation to hold the Reddit review Id's and Reddit whisky Ids we've joined,
# in case we ever want to go backwards
def distinctlist(col):
    return list(set(list(col)))

# create rolled up dataframe
whisky = (reviews[['rating','style','nose','taste','finish', 'RedditWhiskyID', 'reviewID', 'Name', 'itemnumber']]
         .groupby(['Name','itemnumber'])
         .agg({
             'RedditWhiskyID': distinctlist,
             'reviewID': list,
             'rating': ['mean','std'],
             'style': pd.Series.mode,
             'nose': words,
             'taste': words,
             'finish': words
             }
            )
)

# collapse multiindex on columns
whisky.columns = ['_'.join(col).strip() for col in whisky.columns.values]

# rename columns
whisky = whisky.rename({'RedditWhiskyID_distinctlist': 'RedditWhiskyIDs',
              'reviewID_list': 'reviewIDs',
              'style_mode':'style',
              'nose_words':'nose',
              'taste_words':'taste',
              'finish_words':'finish'
             }, axis='columns')

whisky.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.1,8.020114,Ireland,"n crisp apple, lots of peach, vanilla, honey. ...","p its fresh and fruity again, lots of peaches ...",f fruity and malty. the malt dies off quickly ...
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"sweet corn, oak, cotton candy/birthday cake, ...","seaweed, corn, mint, brown sugar crackers, br...","edamame, ginger, wheat, werther's, corn mediu..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.2,4.263541,Bourbon,this is the 1792 i remember. custard and bana...,"again, similar to my recollections. hotter th...","dry. wood char, barrel flavour. yeasty? herba..."
601 BOURBON,634519,[29],[97],58.0,,Bourbon,"grain funk, milled corn, herbal, wet dirt. sm...","young sharp corn graininess, astringent, copp...","short, medium warmth, canned corn and oak not..."
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"•\t slight salty tones, but also a bit of swee...","•\t fairly sweet, a fair bit of burn, plums, a...",let's start with the arran sauternes cask •\...


### Preprocess Reviews
Here we need to do some data cleaning on the text, but the important part is the lemmatization. Here's a summary of what we are doing:

- Make everything lowercase
- Replace weird apostrophes and quotes with normal ones
- Remove any html tags that might have ended up in here
- Tag word position then lemmatize
- Remove all punctuation
- Remove words less than 3 letters
- Trip duplicated whitespace
- Remove all numbers
- Remove stopwords such as the, and, etc


Gensim has some nifty preprocessing functions. 
Here we define a list of filters we want to run on the data, then apply it to our reviews.
The problem is that there are some limitations to the built in function so we will need to define our own in addition:

- Standard filters don't catch some funny apostrophe's and quotation marks so we will replace those.
- Gensim lemmatizing library requires some python2 dependencies so we'll use nltk instead.

In [408]:
whiskyp = whisky

# The standard filters aren't catching the funny quotation and apostrophe's in some reviews so add a new function:
def fix_apostrophes(s):
    return re.sub('“|”|’', '', s)

# gensim has a lemmatizer but it requires some libraries that haven't been updated to python 3,
# and it doesn't handle position tags as well.
# here we define a lemmatize function that tags words first and converts nltk tags to wordnet tags.
def lemmatize_text(text):
    
    lemmatizer = WordNetLemmatizer() 

    lemmatized_output = " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_tokenize(text)])
    return lemmatized_output

def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# List all of the functions we want to apply
CUSTOM_FILTERS = [lambda x: x.lower(),         # make everything lowercase
                  fix_apostrophes,             # replace weird apostrophes and quotes with normal ones
                  strip_tags,                  # remove any html tags that might have ended up in here
                  lemmatize_text,              # tag word position then lemmatize
                  strip_punctuation,           # remove all punctuation
                  strip_short,                 # remove words less than 3 letters
                  strip_multiple_whitespaces,  # trip duplicated whitespace
                  strip_numeric,               # remove all numbers
                  remove_stopwords             # remove stopwords such as the, and, etc
                 ]

# process columns
for column in ['nose','taste','finish']:
    whiskyp[column] = whiskyp.apply(lambda row: preprocess_string(row[column], CUSTOM_FILTERS), axis='columns')

# take a look
whiskyp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.1,8.020114,Ireland,"[crisp, apple, lot, peach, vanilla, honey, fru...","[fresh, fruity, lot, peach, fresh, peach, sour...","[fruity, malty, malt, quickly, peach, note, ha..."
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"[sweet, corn, oak, cotton, candy, birthday, ca...","[seaweed, corn, mint, brown, sugar, cracker, b...","[edamame, ginger, wheat, werther, corn, medium..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.2,4.263541,Bourbon,"[remember, custard, banana, lot, custard, bana...","[similar, recollection, hotter, expect, thinne...","[dry, wood, char, barrel, flavour, yeasty, her..."
601 BOURBON,634519,[29],[97],58.0,,Bourbon,"[grain, funk, corn, herbal, wet, dirt, smell, ...","[young, sharp, corn, graininess, astringent, c...","[short, medium, warmth, corn, oak, note, herb,..."
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"[slight, salty, tone, bit, sweet, surprised, s...","[fairly, sweet, fair, bit, burn, plum, vaguely...","[let, start, arran, sauterne, cask, medium, di..."


### Save to File

In [409]:
# convert datatype of column so it doesn't complain
whiskyp['style'] = whiskyp['style'].astype('str')
whiskyp.to_parquet('data/whisky_processed.parquet')

## Similarity

In [313]:
whiskyp = pd.read_parquet('data/whisky_processed.parquet')

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


### Build Document Texts
Before training our model we need to get lists of tokens from the reviews. Here we create a function to do this and then use it on the nose, taste, and finish columns.

At this point we will also remove all of the region/style names to make sure that we aren't inadvertantly telling it a cluster just based on region.

There's also some weird words that made their way in or misspelled words so let's take only english words here. This should also help get rid of distillery names that we didn't catch.

In [379]:
# Get list of regions to filter out of text.
# The length restriction is to handle whiskies where people have marked it as two different regions.
regions = [item.lower() for item in whiskyp['style'].unique() if len(item) < 15]

# Add some more words not caught by this:
regions = regions + ['irish', 'whisky']

In [380]:
# takes a column and returns a list of tokens that occur more than once in the dataset
def createtextlist(column):

    # convert to list of lists
    texts = [list(document) for document in column]

    # remove words that only occur once since they won't help find similarities
    frequency = defaultdict(int)
    # calculate frequencies
    for text in texts:
        for token in text:
            frequency[token] += 1

    # filter out regions, frequency of 1, and nonenglish words            
    texts = [
        [token for token in text if all([not token in regions, frequency[token]>1, is_word(token)])]
        for text in texts
    ]      

    return texts

def is_word(word):
    if wordnet.synsets(word):
        return True
    else:
        return False

In [381]:
# generate lists of tokens for each column
nose_texts   = createtextlist(whiskyp.nose  )
taste_texts  = createtextlist(whiskyp.taste )
finish_texts = createtextlist(whiskyp.finish)

### Word2Vec
Word2Vec trains a model on existing data to determine which words are more connected with each other. This will be used to calculate our Word Mover Distance later on.

Since we want as much data as possible for training, we will train on the full dataset.

In [384]:
from gensim.models import Word2Vec

# let's train on the full dataset of nose, taste, and finish. This is more data to train on so should yield better results
texts = nose_texts + taste_texts + finish_texts

# Train Word2Vec:
model = Word2Vec(texts, size=100)

# Normalizes the vectors in the word2vec class. This improves performance.
model.init_sims(replace=True)  

In [388]:
# Test model
word = 'coffee'
model.wv.most_similar(positive=word)

[('espresso', 0.7701224684715271),
 ('milk', 0.7517192363739014),
 ('dark', 0.705661416053772),
 ('nib', 0.7011394500732422),
 ('mexican', 0.6960846185684204),
 ('leather', 0.6845758557319641),
 ('cacao', 0.6811543703079224),
 ('fudge', 0.6733429431915283),
 ('milky', 0.6686280369758606),
 ('bean', 0.6618437767028809)]

### Reduce Words with TF-IDF

In [389]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

#copy dataframe to hold tf-idf values
whiskyt = whiskyp

def fit_tfidf(df, columnname):
    vec = TfidfVectorizer(lowercase = False, max_df=0.7, min_df=0.04)
    # change column to string as that's the required input
    string = df[columnname].apply(lambda x : ' '.join(x))
    # fit
    tfidf = vec.fit_transform(string)
    features = vec.get_feature_names()
    
    return tfidf, features

nose_tfidf  , nose_features   = fit_tfidf(whiskyp, 'nose')
taste_tfidf , taste_features  = fit_tfidf(whiskyp, 'taste')
finish_tfidf, finish_features = fit_tfidf(whiskyp, 'finish')

In [390]:
# to get top values for each row:
def top_tfidf_features(row, features, top_n=None, include_values=False):
    # sparse array to dense
    row = row.toarray()[0]
    if top_n:
        topn = np.argsort(row)[::-1][:top_n]
    else:
        topn = np.argsort(row)[::-1]
        
    if include_values:
        tfidfed_row = [(features[i], row[i]) for i in topn if row[i] > 0]
    else:
        tfidfed_row = [features[i] for i in topn if row[i] > 0]
    return tfidfed_row

In [391]:
# add column to use as input for apply
whiskyt['index_col'] = range(0, len(whiskyt))
topcount = 50

whiskyt['nose_tfidf'] = (whiskyt[['index_col']]
                         .apply(lambda row: top_tfidf_features(nose_tfidf[row],nose_features, top_n=topcount), axis=1)
                        )

whiskyt['taste_tfidf'] = (whiskyt[['index_col']]
                         .apply(lambda row: top_tfidf_features(taste_tfidf[row],taste_features, top_n=topcount), axis=1)
                        )

whiskyt['finish_tfidf'] = (whiskyt[['index_col']]
                         .apply(lambda row: top_tfidf_features(finish_tfidf[row],finish_features, top_n=topcount), axis=1)
                        )

In [392]:
whiskyt.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish,index_col,nose_tfidf,taste_tfidf,finish_tfidf
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.1,8.020114,Ireland,"[crisp, apple, lot, peach, vanilla, honey, fru...","[fresh, fruity, lot, peach, fresh, peach, sour...","[fruity, malty, malt, quickly, peach, note, ha...",0,"[peach, expect, grass, dry, tropical, signatur...","[peach, banana, oakiness, whiskey, like, marzi...","[vanilla, malt, hang, peach, cereal, banana, b..."
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"[sweet, corn, oak, cotton, candy, birthday, ca...","[seaweed, corn, mint, brown, sugar, cracker, b...","[edamame, ginger, wheat, werther, corn, medium...",1,"[whip, furniture, heat, spice, cream, polish, ...","[punchy, hot, acetone, bold, extra, unripe, sk...","[warm, linger, spice, orchard, sap, tree, spir..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.2,4.263541,Bourbon,"[remember, custard, banana, lot, custard, bana...","[similar, recollection, hotter, expect, thinne...","[dry, wood, char, barrel, flavour, yeasty, her...",2,"[batch, proof, small, soda, rye, banana, mute,...","[glencairn, rest, little, minute, product, bou...","[tingle, spice, wood, clove, barrel, dry, cinn..."
601 BOURBON,634519,[29],[97],58.0,,Bourbon,"[grain, funk, corn, herbal, wet, dirt, smell, ...","[young, sharp, corn, graininess, astringent, c...","[short, medium, warmth, corn, oak, note, herb,...",3,"[dirt, funk, wet, herbal, corn, white, bourbon...","[copper, astringent, sharp, corn, young, touch]","[dirt, herb, corn, warmth, note]"
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"[slight, salty, tone, bit, sweet, surprised, s...","[fairly, sweet, fair, bit, burn, plum, vaguely...","[let, start, arran, sauterne, cask, medium, di...",4,"[peat, smoke, hint, floral, grass, sherry, dry...","[peat, chocolate, fairly, citrus, smoke, toffe...","[smooth, burn, little, note, alcohol, smoke, s..."


### Word Mover Distance Using TFIDF

In [393]:
# We need to regenerate texts based on our TF-IDF analysis here
nose_tfidf_texts   = createtextlist(whiskyp.nose_tfidf)
taste_tfidf_texts  = createtextlist(whiskyp.taste_tfidf)
finish_tfidf_texts = createtextlist(whiskyp.finish_tfidf)

In [410]:
num =  404 # Wild Turkey 101

# test a query
query = nose_tfidf_texts[num]
instance = WmdSimilarity(nose_tfidf_texts, model, num_best=10)
sims = instance[query]

# Print the query and the retrieved documents, together with their similarities.
print ('Query:')
print(whiskyp.reset_index().Name.iloc[num])

#print 
for i in range(num_best):
    print(sims[i][1])
    #print (nose_texts[sims[i][0]])
    print(whiskyp.reset_index().Name.iloc[sims[i][0]])

Query:
WHYTE & MACKAY SPECIAL BLEND SCOTCH WHISKY
1.0
WHYTE & MACKAY SPECIAL BLEND SCOTCH WHISKY
0.9298136610313349
GRANT'S FAMILY RESERVE SCOTCH WHISKY
0.9272752442268795
BELL'S ORIGINAL SCOTCH WHISKY
0.89695661688482
TULLAMORE DEW IRISH WHISKEY
0.8435193254242457
WILD TURKEY 81 PROOF KENTUCKY STRAIGHT BOURBON
0.8318970191997332
JIM BEAM DEVIL'S CUT
0.7866503268710434
THE FAMOUS GROUSE SCOTCH WHISKY
0.7700688197163718
BUSHMILLS IRISH WHISKEY
0.757945780192364
JEFFERSON'S RESERVE BOURBON
0.7535194479138625
JAMESON IRISH WHISKEY


In [395]:
# get similarities for one whisky and one column
def getsimilarities(texts, row_index, model):
    query = texts[row_index] # Get description from text list
    instance = WmdSimilarity(texts, model) # Query object
    sims = instance[query]
    return sims

# get similarities for one whisky, averaged across all columns
def getsimilarityresults(olddf, num):
    # test a query
    nose_sims   = getsimilarities(nose_tfidf_texts  , num, model)
    taste_sims  = getsimilarities(taste_tfidf_texts , num, model)
    finish_sims = getsimilarities(finish_tfidf_texts, num, model)

    # combine into neat dataframe
    df = pd.DataFrame({'nose_sim': nose_sims, 'taste_sim': taste_sims, 'finish_sim':finish_sims})

    # add needed columns
    df['sim'] = df.mean(axis=1)
    df['itemnumber'] = whiskyp.reset_index().iloc[num].itemnumber
    df = pd.concat([df, whiskyp.reset_index().rename({'itemnumber':'itemnumber2'},axis=1)['itemnumber2']], axis=1)

    return df

# Function to multiprocess an entire dataframe
def getsimilarityresults_dataframe(df):
    
    # create dataframe to hold results
    global results
    results = pd.DataFrame(columns=['nose_sim','taste_sim','finish_sim','itemnumber','itemnumber2'])
    
    # call function for each whisky with multiprocessing
    pool = mp.Pool(mp.cpu_count())
    
    for num in range(df.shape[0]):
        pool.apply_async(getsimilarityresults, args=(df, num), callback=collect_result)
    pool.close()
    pool.join()
    
    # return results
    return results
    
# Function to collect results from multiprocess
def collect_result(result):
    global results
    results = results.append(result, ignore_index = True, sort=False)

In [None]:
similarities = getsimilarityresults_dataframe(whiskyp)
similarities.to_parquet('data/similarities2.parquet')

### Save to File

In [333]:
(whiskyt.drop({'nose','taste','finish'},axis=1)
       .rename({'nose_tfidf':'nose','taste_tfidf':'taste','finish_tfidf':'finish'},axis=1)
       .to_parquet('data/whisky_tfidf.parquet')
)