Also, gensim is a great library for this task. I would look at word-movers distance too - amazing little algorithm. You'll likely be doing what is called "document similarity classification" - where the whisky profiles are the documents and you're trying to "group" them on profile content, or, user preferences...which is where it switches from a clustering to a classification problem in some respects. 

## Load Libraries and Data

In [143]:
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, \
     strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short
import re

In [144]:
#reviews = pd.read_parquet('data/matches_cleanstyles.parquet')
reviews = pd.read_parquet('data/review_cats.parquet')

## Roll up to Whisky Level

We can join back on other tables to get most information, what we are looking for is whisky itemnumber, aggregated review score, and aggregated nose, taste, and finish words.

In [145]:
# define a custom aggregation function that returns a concatenated version of all strings in the group by
def words(col):
    return ''.join(col)

# and a custom aggregation to hold the Reddit review Id's and Reddit whisky Ids we've joined,
# in case we ever want to go backwards
def distinctlist(col):
    return list(set(list(col)))

# create rolled up dataframe
whisky = (reviews[['rating','style','nose','taste','finish', 'RedditWhiskyID', 'reviewID', 'Name', 'itemnumber']]
         .groupby(['Name','itemnumber'])
         .agg({
             'RedditWhiskyID': distinctlist,
             'reviewID': list,
             'rating': ['mean','std'],
             'style': pd.Series.mode,
             'nose': words,
             'taste': words,
             'finish': words
             }
            )
)

# collapse multiindex on columns
whisky.columns = ['_'.join(col).strip() for col in whisky.columns.values]

# rename columns
whisky = whisky.rename({'RedditWhiskyID_distinctlist': 'RedditWhiskyIDs',
              'reviewID_list': 'reviewIDs',
              'style_mode':'style',
              'nose_words':'nose',
              'taste_words':'taste',
              'finish_words':'finish'
             }, axis='columns')



whisky

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.100000,8.020114,Ireland,"n crisp apple, lots of peach, vanilla, honey. ...","p its fresh and fruity again, lots of peaches ...",f fruity and malty. the malt dies off quickly ...
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"sweet corn, oak, cotton candy/birthday cake, ...","seaweed, corn, mint, brown sugar crackers, br...","edamame, ginger, wheat, werther's, corn mediu..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.200000,4.263541,Bourbon,this is the 1792 i remember. custard and bana...,"again, similar to my recollections. hotter th...","dry. wood char, barrel flavour. yeasty? herba..."
601 BOURBON,634519,[29],[97],58.000000,,Bourbon,"grain funk, milled corn, herbal, wet dirt. sm...","young sharp corn graininess, astringent, copp...","short, medium warmth, canned corn and oak not..."
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"•\t slight salty tones, but also a bit of swee...","•\t fairly sweet, a fair bit of burn, plums, a...",let's start with the arran sauternes cask •\...
ABERFELDY 21 YEAR OLD HIGHLAND SINGLE MALT SCOTCH WHISKY,400085,[48],"[159, 160, 161, 162, 163, 164, 165, 166, 167, ...",83.700000,6.254776,Highland,"blood oranges, honey, floral, oak, caramel, ...","hazelnut shells, oak, orange ice cream, wax,...","oak, dark chocolate, orange zest, nutty, lic..."
ABERLOUR 10 YEAR OLD SINGLE MALT SCOTCH WHISKY,482885,"[50, 51, 52]","[170, 171, 172, 173, 174, 175, 176, 177, 178, ...",79.090909,7.006085,Speyside,rather subtle . sherry spiciness. dried fr...,light and creamy mouthfeel. very smooth. l...,medium short . spicey and earthy initially....
ABERLOUR 12 YEAR OLD SINGLE MALT SCOTCH WHISKY,352104,"[56, 60]","[265, 31674]",84.500000,13.435029,Speyside,"ripe blood orange, strawberry preserves, coco...","very thick on the tongue, creamy, vanilla cus...","medium length ~ golden raisins, cinnamon, a t..."
ABERLOUR A'BUNADH SCOTCH WHISKY,573352,"[96, 97, 134, 91, 94, 95]","[31372, 440, 441, 442, 444, 445, 446, 601, 602...",87.086957,4.766484,Speyside,"almonds, dark cherries, fig, chocolate plum,...","sweet fruity start, like rum, some tart lemon...",herbal that continues to dry. shorter in len...
ALBERTA PREMIUM DARK HORSE WHISKY,298083,[156],"[741, 742, 743, 744, 745, 746, 747, 748, 749, ...",84.240000,7.980601,Canada,"rye spice, french vanilla, fresh rain. it's a...",sweet spicy and above all interesting. there ...,longish with some beautiful clean rye spice n...


## Preprocess Reviews

Gensim has some nifty preprocessing functions. Here we define a list of filters we want to run on the data, then apply it to our reviews.

In [146]:
# The standard filters aren't catching the funny quotation and apostrophe's in some reviews so add a new function:
whiskyp = whisky

def fix_apostrophes(s):
    return re.sub('“|”|’', '', s)

# List all of the functions we want to apply
CUSTOM_FILTERS = [lambda x: x.lower(), 
                  fix_apostrophes,
                  strip_tags,
                  strip_punctuation,
                  strip_short,
                  strip_multiple_whitespaces,
                  strip_numeric,
                  remove_stopwords
                 ]

# preprocess columns
for column in ['nose','taste','finish']:
    whiskyp[column] = whiskyp.apply(lambda row: preprocess_string(row[column], CUSTOM_FILTERS), axis='columns')

# take a look
whiskyp

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.100000,8.020114,Ireland,"[crisp, apple, lots, peach, vanilla, honey, fr...","[fresh, fruity, lots, peaches, fresh, peach, s...","[fruity, malty, malt, dies, quickly, peach, no..."
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"[sweet, corn, oak, cotton, candy, birthday, ca...","[seaweed, corn, mint, brown, sugar, crackers, ...","[edamame, ginger, wheat, werther, corn, medium..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.200000,4.263541,Bourbon,"[remember, custard, bananas, lots, custard, ba...","[similar, recollections, hotter, expected, thi...","[dry, wood, char, barrel, flavour, yeasty, her..."
601 BOURBON,634519,[29],[97],58.000000,,Bourbon,"[grain, funk, milled, corn, herbal, wet, dirt,...","[young, sharp, corn, graininess, astringent, c...","[short, medium, warmth, canned, corn, oak, not..."
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"[slight, salty, tones, bit, sweet, surprised, ...","[fairly, sweet, fair, bit, burn, plums, vaguel...","[let, start, arran, sauternes, cask, medium, d..."
ABERFELDY 21 YEAR OLD HIGHLAND SINGLE MALT SCOTCH WHISKY,400085,[48],"[159, 160, 161, 162, 163, 164, 165, 166, 167, ...",83.700000,6.254776,Highland,"[blood, oranges, honey, floral, oak, caramel, ...","[hazelnut, shells, oak, orange, ice, cream, wa...","[oak, dark, chocolate, orange, zest, nutty, li..."
ABERLOUR 10 YEAR OLD SINGLE MALT SCOTCH WHISKY,482885,"[50, 51, 52]","[170, 171, 172, 173, 174, 175, 176, 177, 178, ...",79.090909,7.006085,Speyside,"[subtle, sherry, spiciness, dried, fruit, stra...","[light, creamy, mouthfeel, smooth, lemons, qui...","[medium, short, spicey, earthy, initially, bit..."
ABERLOUR 12 YEAR OLD SINGLE MALT SCOTCH WHISKY,352104,"[56, 60]","[265, 31674]",84.500000,13.435029,Speyside,"[ripe, blood, orange, strawberry, preserves, c...","[tongue, creamy, vanilla, custard, leather, or...","[medium, length, golden, raisins, cinnamon, to..."
ABERLOUR A'BUNADH SCOTCH WHISKY,573352,"[96, 97, 134, 91, 94, 95]","[31372, 440, 441, 442, 444, 445, 446, 601, 602...",87.086957,4.766484,Speyside,"[almonds, dark, cherries, fig, chocolate, plum...","[sweet, fruity, start, like, rum, tart, lemon,...","[herbal, continues, dry, shorter, length, medi..."
ALBERTA PREMIUM DARK HORSE WHISKY,298083,[156],"[741, 742, 743, 744, 745, 746, 747, 748, 749, ...",84.240000,7.980601,Canada,"[rye, spice, french, vanilla, fresh, rain, inc...","[sweet, spicy, interesting, notes, butterscotc...","[longish, beautiful, clean, rye, spice, notes,..."


In [150]:
from gensim.utils import lemmatize

test = whiskyp.iloc[0].taste

lemmatize(test)


ImportError: Pattern library is not installed. Pattern library is needed in order to use lemmatize function