<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Libraries-and-Data" data-toc-modified-id="Load-Libraries-and-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Libraries and Data</a></span></li><li><span><a href="#Initial-Grouping" data-toc-modified-id="Initial-Grouping-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Initial Grouping</a></span></li><li><span><a href="#Extract-Keywords" data-toc-modified-id="Extract-Keywords-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Extract Keywords</a></span><ul class="toc-item"><li><span><a href="#Find-Keywords-from-Distillery-Names-(And-Other-Important-Terms)" data-toc-modified-id="Find-Keywords-from-Distillery-Names-(And-Other-Important-Terms)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Find Keywords from Distillery Names (And Other Important Terms)</a></span></li><li><span><a href="#Extract-Keywords" data-toc-modified-id="Extract-Keywords-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extract Keywords</a></span></li></ul></li><li><span><a href="#Extract-Age" data-toc-modified-id="Extract-Age-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Extract Age</a></span></li><li><span><a href="#Join-Datasets" data-toc-modified-id="Join-Datasets-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Join Datasets</a></span><ul class="toc-item"><li><span><a href="#Join" data-toc-modified-id="Join-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Join</a></span></li><li><span><a href="#Fuzzy-Match" data-toc-modified-id="Fuzzy-Match-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Fuzzy Match</a></span></li><li><span><a href="#Add-Age" data-toc-modified-id="Add-Age-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Add Age</a></span></li><li><span><a href="#Filter-NonMatching" data-toc-modified-id="Filter-NonMatching-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Filter NonMatching</a></span></li></ul></li><li><span><a href="#Save-to-File" data-toc-modified-id="Save-to-File-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Save to File</a></span><ul class="toc-item"><li><span><a href="#Additional-Investigation" data-toc-modified-id="Additional-Investigation-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Additional Investigation</a></span></li></ul></li></ul></div>

# Notebook Purpose
This notebook is to combine the reddit reviews with the LCBO product data.
The difficulty in doing this comes from differing whisky names.
To accomplish the join first we create a list of key phrases and extract them from the names. If whiskies have different key phrases, they do not match. Then we pull out the age of the whisky and compare that as well. Lastly, in terms of cases where there are still duplicates we use a fuzzy matching algorithm and take the highest rank.

## Load Libraries and Data

In [40]:
import praw
import pandas as pd
import re

import requests
import time
import sys
import pdb
from fuzzywuzzy import fuzz
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import multiprocessing as mp

In [41]:
reviews = pd.read_parquet('data/db_reviews_split.parquet')

In [42]:
lcbo = pd.read_parquet('data/lcbo_whisky.parquet').drop_duplicates()

## Initial Grouping
LCBO has some duplicate products due to having different bottle sizes or materials. We don't care about this so will group items by whisky.

In [43]:
# we actually don't care if a product is in a plastic bottle or not for review purposes, so let's rename them:
lcbo['itemname'] = lcbo['itemname'].str.replace('\(PET\)','', case=False,regex=True).str.strip()

# add count to see how many of the same whisky name we have
lcbo['count'] = lcbo.groupby('itemname')['itemnumber'].transform('count')

# add a metric to see how far from 750 a bottle is (we want to drop duplicate products of different sizes)
lcbo = lcbo.assign(sizedelta = abs(lcbo['productsize'] - 750))

# keep only the entry closest to 750 and in case of tie the one with higher price (assuming its the nonpet) :
lcbo['rank'] = lcbo.groupby("itemname")['sizedelta'].rank("first", ascending=True)
lcbo = lcbo[(lcbo['rank'] == 1)]

# drop the added columns since we don't need them anymore
lcbo = lcbo.drop(['count','sizedelta','rank'], axis='columns')

# while we are here we need to fix the name of a specific whisky:
lcbo.loc[lcbo.itemname.str.contains('GLENFARCLAS12'),'itemname'] = "GLENFARCLAS 12-YEAR-OLD HIGHLAND SINGLE MALT SCOTCH"

Some of the naming prevents our keyword matchups so we have to upfront change a couple of whisky names in the reviews table:

In [44]:
reviews.loc[reviews.whisky.str.contains('Jim Beam Legent'),'whisky'] = "Legent"
reviews.loc[reviews.whisky.str.contains('Bruichladdich Black Art 6.1'),'whisky'] = "Black Art 6.1"
reviews.loc[reviews.whisky.str.contains('Last Straw Darker Side of the Moonshine'),'whisky'] = "Darker Side"

## Extract Keywords

### Find Keywords from Distillery Names (And Other Important Terms)

In [45]:
def find_nonwords(sentence):
    #return nltk.word_tokenize(sentence)
    return [str.lower(word) for word in nltk.word_tokenize(sentence) if not is_word(word)]
    
def is_word(word):
    if wordnet.synsets(word):
        return True
    else:
        return False
    
def contains_digit(word):
    return any(char.isdigit() for char in word)

In [46]:
# Find all words in whisky names that are not english words
keywords = lcbo.apply(lambda row: find_nonwords(row['itemname']), axis='columns')

# Turn into one list without duplicates
keywords = list(keywords.apply(pd.Series).stack().unique())

# Filter out purly numeric values
keywords = [word for word in keywords if not contains_digit(word)]

# Filter out stopwords
stopWords = set(stopwords.words('english'))
keywords = [word for word in keywords if word not in stopWords]

# Filter out punctuation
keywords = [word for word in keywords if re.match('^[\w]+$', word) is not None]

# Filter out words that aren't applicable:
# These are either: generic descriptors or whisky regions
filterlist = ['peated', 'campbeltown', 'speyside', 'yo', 'st', 'oaked', 'wheated', 'ol', 'bbq', 'exper']

keywords = [word for word in keywords if word not in filterlist]

newwords = [
            # brands
            '101','1792', '6.1', 'gibsons', 'signature', 'ballantines',
            'makers', 'bakers','blantons','mcclelland','bookers','patricks','gentleman','jack',
            'prichards','stranahans','dewars',
            'jeffersons','liquormens', 
            'barrelling','cattos','blantons','founders',
            'walkers','teachers','bells','royal','grants','o.f.c.', 'century',
            # bigrams need to both be matched:
            ('jack','daniels'), ('knob','creek'),('crown','royal'),('canadian','club'),
            ('highland','park'), ('forty','creek'),('proof', 'whisky'), ('canadian','rockies'),
            'owl', 'jefferson', 'teacher',
            'sazerac', 'caribou', 'wiser', 'walker', 'grouse', 'alberta', 'grant', 'bell', 
            'dewar',  'rittenhouse', 'revel', 'roses', 
            'writers', 'writer', 'rogue',  'colonel', 'weller', 'booker', 'mist', 'challenge',
            'redbreast','jts', 'casg','burns', '601',
            # qualities
            'organic','vintage','quiet', 'classic', 'select', #'proof', 'rare', 
            # region (careful with these)
             'canada', #'islay', 'canadian',
            # locations
            'virginia','dublin','shetland','trafalgar','caribbean','windsor',  'halifax',
            # names
            'patrick', 'gretzky', 'cody', 'charlotte','tucker','prescott',
            # animals
            'bull', 'dog', 'turkey', 'monkey', 'beast', 'fox', 'buffalo','crow','horse', 
            # colors
            'red', 'blue', 'yellow', 'green', 'black', 'brown', 'white', 'gold', 'silver', #'copper',
            'golden','blacker', 'golder', 'redder', 'darker',
            'dark',
            # type
            'rye',
            # barrels
            'cognac', 'sherry', 'amarone', 'champagne', #'stout', messes up caskmates
            'brandy', 'madeira', 'bordeaux', 'sauternes', 'burgundy',
            'sassicaia', 'tokaji', 'rum', 'sherry'
            # barrel count
            'triple', 'double', #'single',
            # woods
            'cedar', 'heartwood', 'springwood', 'virgin', 'redwood', 'wood', 'cork', 'cask', 'new',
            # game of thrones
            'stark', 'tully',
            # flavours
            'apple', 'vanilla', 'peach', 'honey', 'maple', 'spiced', 'toasted', 'seasoned',
            # other
            #'small',
            'irishman', 'rebel', 'compass',   
            'stalk', 'centennial', 'forester', 'powers', 'temple', 
            'antiquity', 'feathery', 'few',  'burnside',   'larceny', 'tango', 'king',
            'moray', 'twelve', 'reunion',   'maestri', #'reserve', 
            'sexton', 'ezra', 'bastille',  'orphan', 'founder',  'wedding', 'shoe',
            'caramel', 'moonshine', 'cooper',  'benchmark',
            'smws','valinch', 'hermitage','home',    'traditional', 'bush', 'art','diamond', 
            'alpha', 'dawn', 'dusk', 'surf', 'elements', 'growth', 'bere', 
            'cuvee', 'infinity', 'octomore', 'resurrection',
            'waves', 'river', 'silk' ,'signal', 'winter', 'snow', 'ice', 'fire', 
            'harvest', 'blenders', 'chairman','ellington', 'kirkland',
            'mcadam', 'glacier', 'skate', 'pike', 'ileach',
            'macaloney', 'cured', 'grain',  'sour', 'tornado',
            'hedonism', 'evolution', 'cross', 'glasgow','indian',
            'heritage',  'devil', 'brooks', 'alba', 'major', 'naked', 'eades', 'light',  'entrapment',  'oyo',
            'palm', 'lochnagar', 'willett', 'north', 'dissertation', 'last', 'legacy'
           ]
keywords = keywords + newwords

### Extract Keywords

In [47]:
# Function to extract keywords from text
def extract_keywords(text, keywords):
    from nltk import ngrams
    text = text.lower().replace("mcclelland's","mcclelland")
    text = text.lower().replace("hayden's","hayden")
    text = text.lower().replace("'s","s")
    result = []
    for k in keywords:
        if type(k) == tuple:
            # lower each word in the tuple and turn into a string
            (word1, word2) = k
            k = " ".join([word1.lower(),word2.lower()])
        else:
            # lower the word
            k = k.lower()
        count = len([gram for gram in ngrams(nltk.word_tokenize(text),len(nltk.word_tokenize(k))) if gram == tuple(nltk.word_tokenize(k))])
        if count > 0:
            result.append(k.replace(' ','_'))
    return " ".join(sorted(result))

# Function to multiprocess an entire dataframe
def extract_keywords_dataframe(df, columnname, keywords):
    # create dataframe to hold results
    global results
    results = pd.DataFrame(columns=[columnname,'keywords'])
    
    # select only the column we want and make unique to save some time
    dfnames = df[columnname].unique()
    pool = mp.Pool(mp.cpu_count())
    
    # call function for each name
    for name in dfnames:
        pool.apply_async(extract_keywords_row, args=(columnname, name, keywords), callback=collect_result)
    pool.close()
    pool.join()
    
    # join back on original dataframe
    return (df.set_index(columnname)
              .join(results.set_index(columnname))
              .reset_index()
              .rename({'index':columnname}, axis='columns')
           )
    
# Function to be ran in multiprocess on each item
def extract_keywords_row(columnname, text, keywords):
    newitem = {}
    newitem[columnname] = text
    newitem['keywords'] = extract_keywords(text, keywords)
    return newitem
    
# Function to collect results from multiprocess
def collect_result(result):
    global results
    results = results.append(result,ignore_index = True)

Add keywords to LCBO data:

In [48]:
results = None

In [49]:
lcbo = extract_keywords_dataframe(lcbo, 'itemname', keywords)

And to review data:

In [50]:
reviews = extract_keywords_dataframe(reviews, 'whisky', keywords)
print(reviews.shape)
reviews = reviews[reviews['keywords'] !='']
print(reviews.shape)

(31056, 15)
(26277, 15)


Save to file

In [52]:
reviews.to_parquet('db_reviews_keywords.parquet')

In [None]:
reviews = pd.read_parquet('db_reviews_keywords.parquet')

## Extract Age

In [53]:
def extract_age(sentence):
    # remove # words and No digit to not confuse age
    sentence = re.sub("\#\d*",'', sentence)
    sentence = re.sub("NO\. \d*",'', sentence)
    # grab full words that are 1 or 2 digits only or end in yo, year, y
    # but only if the word batch is not present
    reg = '^(\d\d?)(?:yo|year|y|-year-old)?$'
    batches = [word for word in nltk.word_tokenize(sentence) if word in ['batch']]
    if len(batches) == 0:
        return " ".join(sorted([re.findall(reg,word, re.IGNORECASE)[0] for word in nltk.word_tokenize(sentence) if re.match(reg, word, re.IGNORECASE) is not None]))
    else:
        return None

## Join Datasets

### Join

Assign a unique IDs to each whisky in the reviews table:

In [54]:
reviews = reviews.assign(RedditWhiskyID = reviews['whisky'].astype('category').cat.codes)

Join on lcbo based on keywords

In [55]:
reviews = (reviews.set_index('keywords')
                  .join(lcbo.set_index('keywords'), how='inner')
                  .reset_index()
                  .rename({'index':'keywords'}, axis='columns')
          )

In [56]:
reviews.shape

(40209, 61)

### Fuzzy Match

In [57]:
# Calculate fuzzmatch using fuzztset which yields the best results
reviews = reviews.rename({'whisky':'RedditWhiskyName','itemname':'Name'},axis='columns')
reviews['fuzztset']    = reviews.apply(lambda row: fuzz.token_set_ratio(row['RedditWhiskyName'],row['Name']), axis='columns')

In [58]:
# Add Rank column based on max fuzz
fuzzfilter = reviews
fuzzfilter["rank"] = fuzzfilter.groupby("RedditWhiskyName")["fuzztset"].rank("dense", ascending=False)

### Add Age

In [59]:
# Add Age columns
fuzzfilter['RedditAge'] = fuzzfilter.apply(lambda row: extract_age(row['RedditWhiskyName']), axis='columns')
fuzzfilter['LcboAge']   = fuzzfilter.apply(lambda row: extract_age(row['Name'])            , axis='columns')

### Filter NonMatching

In [60]:
# Filter out values where age does not match
print(fuzzfilter.shape)
fuzzfilter = fuzzfilter[fuzzfilter['RedditAge'] == fuzzfilter['LcboAge']]
print(fuzzfilter.shape)

(40209, 65)
(15695, 65)


In [61]:
# Let's set threshold at 59 %. This is based on some trial and error.
matches = fuzzfilter[(fuzzfilter['rank'] == 1) & (fuzzfilter['fuzztset'] >= 59)]
print(matches.shape)

(10732, 65)


In [62]:
# save to csv to view results
matches[['RedditWhiskyName','Name','fuzztset']].to_csv('fuzztest.csv')

Check how many we've matched up:

In [66]:
pd.DataFrame(matches.groupby('Name')['reviewID'].count()).shape

(421, 1)

In [None]:
lcbo.shape

In [67]:
100*421/573

73.47294938917976

73 % matched is pretty good

## Save to File

In [68]:
matches.to_parquet('data/matches.parquet')

### Additional Investigation

To look at ones that were not matched and figure out why:

In [None]:
lcbomatches = pd.DataFrame(matches.groupby('Name')['reviewID'].count())
lcbomatches['matched'] = True
lcbomatches = lcbomatches .drop('reviewID', axis='columns')

lcbomatches = lcbo.set_index('itemname').join(lcbomatches)
lcbomatches[lcbomatches['matched'].isna()]

In [None]:
name = "BENCHMARK OLD NO. 8 BRAND KENTUCKY STRAIGHT BOURBON"
redditname = "Benchmark"
print(extract_keywords(name, keywords))
print(extract_keywords(redditname, keywords))
print(fuzz.token_set_ratio(name,redditname))
print(extract_age(name))

In [None]:
re.sub("\#\d*",'', name)

In [None]:
fuzz.token_set_ratio(name,redditname)

In [None]:
rawreviews = pd.read_parquet('data/db_reviews.parquet')

In [None]:
rawreviews[rawreviews['whisky'].str.contains('Darker Side')]
#rawreviews[rawreviews['whisky'].str.contains('Dalmore') & rawreviews['whisky'].str.contains('Wood')]

In [None]:
rename in reddit reviews:
Jim Beam Legent
Legent
Bruichladdich Black Art 6.1
Black Art 6.1
Last Straw Darker Side of the Moonshine
Darker Side
