Also, gensim is a great library for this task. I would look at word-movers distance too - amazing little algorithm. You'll likely be doing what is called "document similarity classification" - where the whisky profiles are the documents and you're trying to "group" them on profile content, or, user preferences...which is where it switches from a clustering to a classification problem in some respects. 

## Load Libraries and Data

In [69]:
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, \
     strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short
import re
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag
from nltk.corpus import wordnet

In [71]:
reviews = pd.read_parquet('data/review_cats.parquet')

## Roll up to Whisky Level

We can join back on other tables to get most information, what we are looking for is whisky itemnumber, aggregated review score, and aggregated nose, taste, and finish words.

In [74]:
# define a custom aggregation function that returns a concatenated version of all strings in the group by
def words(col):
    return ''.join(col)

# and a custom aggregation to hold the Reddit review Id's and Reddit whisky Ids we've joined,
# in case we ever want to go backwards
def distinctlist(col):
    return list(set(list(col)))

# create rolled up dataframe
whisky = (reviews[['rating','style','nose','taste','finish', 'RedditWhiskyID', 'reviewID', 'Name', 'itemnumber']]
         .groupby(['Name','itemnumber'])
         .agg({
             'RedditWhiskyID': distinctlist,
             'reviewID': list,
             'rating': ['mean','std'],
             'style': pd.Series.mode,
             'nose': words,
             'taste': words,
             'finish': words
             }
            )
)

# collapse multiindex on columns
whisky.columns = ['_'.join(col).strip() for col in whisky.columns.values]

# rename columns
whisky = whisky.rename({'RedditWhiskyID_distinctlist': 'RedditWhiskyIDs',
              'reviewID_list': 'reviewIDs',
              'style_mode':'style',
              'nose_words':'nose',
              'taste_words':'taste',
              'finish_words':'finish'
             }, axis='columns')



whisky.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,RedditWhiskyIDs,reviewIDs,rating_mean,rating_std,style,nose,taste,finish
Name,itemnumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12 YO KNAPPOGUE CASTLE IRISH SINGLE MALT WHISKEY,619320,"[5277, 5278, 5279]","[20457, 20458, 20459, 20460, 20461, 20462, 204...",75.1,8.020114,Ireland,"n crisp apple, lots of peach, vanilla, honey. ...","p its fresh and fruity again, lots of peaches ...",f fruity and malty. the malt dies off quickly ...
1792 SINGLE BARREL KENTUCKY STRAIGHT BOURBON WHISKY,496729,"[24, 20, 21, 23]","[60, 61, 62, 63, 65, 66]",72.166667,9.042492,Bourbon,"sweet corn, oak, cotton candy/birthday cake, ...","seaweed, corn, mint, brown sugar crackers, br...","edamame, ginger, wheat, werther's, corn mediu..."
1792 SMALL BATCH KENTUCKY STRAIGHT BOURBON,208918,[25],"[67, 68, 69, 70, 71, 72, 73, 74, 75, 76]",76.2,4.263541,Bourbon,this is the 1792 i remember. custard and bana...,"again, similar to my recollections. hotter th...","dry. wood char, barrel flavour. yeasty? herba..."
601 BOURBON,634519,[29],[97],58.0,,Bourbon,"grain funk, milled corn, herbal, wet dirt. sm...","young sharp corn graininess, astringent, copp...","short, medium warmth, canned corn and oak not..."
ABERFELDY 12 YEAR OLD,255281,"[36, 37]","[113, 114, 115, 116, 117, 118, 119, 120, 121, ...",76.931034,6.181372,Highland,"•\t slight salty tones, but also a bit of swee...","•\t fairly sweet, a fair bit of burn, plums, a...",let's start with the arran sauternes cask •\...


## Preprocess Reviews

Gensim has some nifty preprocessing functions. 
Here we define a list of filters we want to run on the data, then apply it to our reviews.
The problem is that there are some limitations to the built in function so we will need to define our own in addition:

- Standard filters don't catch some funny apostrophe's and quotation marks so we will replace those.
- Gensim lemmatizing library requires some python2 dependencies so we'll use nltk instead.

In [None]:
whiskyp = whisky

# The standard filters aren't catching the funny quotation and apostrophe's in some reviews so add a new function:
def fix_apostrophes(s):
    return re.sub('“|”|’', '', s)

# gensim has a lemmatizer but it requires some libraries that haven't been updated to python 3,
# and it doesn't handle position tags as well.
# here we define a lemmatize function that tags words first and converts nltk tags to wordnet tags.
def lemmatize_text(text):
    
    lemmatizer = WordNetLemmatizer() 

    lemmatized_output = " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)])
    return lemmatized_output

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# List all of the functions we want to apply
CUSTOM_FILTERS = [lambda x: x.lower(),         # make everything lowercase
                  fix_apostrophes,             # replace weird apostrophes and quotes with normal ones
                  strip_tags,                  # remove any html tags that might have ended up in here
                  lemmatize_text,              # tag word position then lemmatize
                  strip_punctuation,           # remove all punctuation
                  strip_short,                 # remove words less than 3 letters
                  strip_multiple_whitespaces,  # trip duplicated whitespace
                  strip_numeric,               # remove all numbers
                  remove_stopwords             # remove stopwords such as the, and, etc
                 ]

# process columns
for column in ['nose','taste','finish']:
    whiskyp[column] = whiskyp.apply(lambda row: preprocess_string(row[column], CUSTOM_FILTERS), axis='columns')

# take a look
whiskyp.head()

## Save to File

In [67]:
# convert datatype of column
whiskyp['style'] = whiskyp['style'].astype('str')

In [68]:
whiskyp.to_parquet('data/whisky_processed.parquet')

