# Preprocessing Scotch Notes

This notebook was used for initial preprocessing of scotch data.

Initial functions and classes were designed - moved to whiskynlp scripts later on.

Initially considering number of occurences, when re-writing graph functions paid more attention to KW extraction methods and chose to go for an eigencentrality based method

In [1]:
import pandas as pd
import numpy as np
from hashlib import md5
import json

## Adding IDs and removing duplicates

In [2]:
# Creating an ID for each whisky
def hashEl(name, url):
    """
    MD5 hash of Name and URL
    Hash each individually, use max/min functions to ensure hash 2 happens in same order irrespective of which get's input first.
    """
    h1 = md5(name.encode()).hexdigest()
    h2 = md5(url.encode()).hexdigest()
    h3 = max(h1, h2) + min(h1, h2)
    h4 = md5(h3.encode()).hexdigest()
    return h3

In [2]:
df = pd.read_csv("scotch-no-dupes.csv")
df.head()

Unnamed: 0,ID,Type,Name,Description,Nose,Palate,Finish,Price,Size,Abv,URL
0,495334d7384f4c9a933a156cb57639770cd9c8bca00ac7...,blended malt scotch,Monkey Shoulder Blended Malt Scotch Whisky,Monkey Shoulder Scotch is a superb blended mal...,"An elegant, stylish nose of marmalade, Crema C...","Very malty, creamy delivery with a suggestion ...","Medium length, spicy oak and a hint of pepperm...",25.94,70.0,40.0,https://www.masterofmalt.com/whiskies/monkey-s...
1,e193fa8dee0bb9422054efd5dfb7f2c2628815243b584b...,blended malt scotch,Johnnie Walker Green Label 15 Year Old,"One of those harder-to-find whiskies, Johnnie ...",,,,38.95,70.0,43.0,https://www.masterofmalt.com/whiskies/johnnie-...
2,d3ba34da7b98276f2da2ab313ff9e6cdc5d476d253a05e...,blended malt scotch,The Naked Grouse,An interesting addition to the Famous Grouse r...,"Smooth and oily with notes of cherry compote, ...","Sherried and thick with notes of sultanas, sti...","Medium, with notes of cocoa, oak and just a so...",26.49,70.0,40.0,https://www.masterofmalt.com/whiskies/naked-gr...
3,b74f75c04d65218b3b094130583f53b58b58d331f47d44...,blended malt scotch,Scallywag,Big Peat's gone and got himself a trusty sidek...,Sweetness jumps up like an excited puppy. Icin...,"The sweetness surprisingly retreats, revealing...",A pinch of oak spice joins the vanilla and she...,38.75,70.0,46.0,https://www.masterofmalt.com/whiskies/douglas-...
4,a6f33f754e0dbb178b5fef3ff0d031f470918b9477703c...,blended malt scotch,Monkey Shoulder Smokey Monkey,A peaty variant of the excellent Monkey Should...,"Honeydew melon, flamed orange peel, a touch of...","Vanilla sits at the core, its earthy notes bol...",Toffee Crisp bars and the last wafts of drying...,27.44,70.0,40.0,https://www.masterofmalt.com/whiskies/monkey-s...


## Extracting to graph
An issue with whisky tasting notes, is each document usually doesn't consist of one word more than once.  Means traditional processing techniques don't necessarily extract the right keywords.
We create a network graph of keywords, where each word is a node, with edges being co-occurences in individual tasting notes, with each edge weighted by number of co-occurences.

Another issue, is that words have different meanings, such as `peat`.  In normal english, peat is a fuel, and a synonym might be grass, in whisky peat and grass are about as far from each other as can be possible.

A wordnet lemmatizer fails to lemmatize many words - need to build own lemmatizer.

### Functions for making corpus

In [4]:
# Make Corpus Function

# Removing Punctuation
import string
punct = string.punctuation+'’'

def makeCorpus(lst):
    out = ''
    for el in lst:
        out = out + el + '  '
    return out

def makeList(df, col):
    """
    Extracts 
    """
    out = []
    for row in range(len(df.index)-1):
        # Extracting cell
        row_str = df[col][row].lower()
        # Removing punctuation
        row_str = row_str.translate(str.maketrans(' ',' ',punct))
        out.append(row_str)
    return out

### Lemmatizer

In [5]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

# Extracting stop words
from nltk.corpus import stopwords
whisky_stopwords = ["nose", "palate", "finish", "doesnt", "eye", "touch", "note", "hint", "good","linger","lingers","alongside","mar"]
swords = set(stopwords.words("english") + whisky_stopwords)


class WhiskyLemmatizer(WordNetLemmatizer):
    '''
    An extension on the WordNet Lemmatizer with added context for whisky
    '''
        
    def __init__(self):
        self.whisky_words = {
            "peated": "peat",
            "peaty": "peat",
            "smokey": "smoke",
            "smoky": "smoke",
            "sherried": "sherry"
    }
    
    def lemmatize(self, word):
        # Caches lemmatized words to avoid lookups
        if word in self.whisky_words:
            out = self.whisky_words[word]
        else:
            tag = self.tag(word)
            out = super().lemmatize(word, pos=tag)
            self.whisky_words[word] = out
        return out
    
    def lemmatizeList(self, lst):
        return [self.lemmatize(w) for w in lst]
    
    def whiskySub(self, word):
        if word in self.whisky_words:
            return self.whisky_words[word]
        else:
            return word
    
    def tag(self, word):
        tag = pos_tag([word])
        tag = pos_tag([word])[0][1][0].lower()
        if tag == "v":
            return "v"
        if tag == "j":
            return "a"
        else:
            return "n"
        
        
    def __repr__(self):
        return "<WhiskyLemmatizer>"
    
lemmatizer = WhiskyLemmatizer()


def tokenFilter(corpus):
    tokens = lemmatizer.lemmatizeList(word_tokenize(corpus))
    filtered = [w for w in tokens if (not w in swords)]
    return filtered


### Graph making functions

In [6]:
def makeNodes(corpus):
    filtered = tokenFilter(corpus)
    nodes_dict = {}
    for word in filtered:
        if word in nodes_dict:
            nodes_dict[word] += 1
        else:
            nodes_dict[word] = 1
    
    node_names = list(nodes_dict.keys())
    nodes = []
    for name in node_names:
        nodes.append({
            "name":name,
            "n_occ":int(nodes_dict[name])
        })
    return nodes, node_names
        

def incrementEdge(edges, from_idx, to_idx):
    from_s, to_s = str(from_idx), str(to_idx)
    if from_s in edges:
        if to_s in edges[from_s]:
            edges[from_s][to_s] += 1
        else:
            edges[from_s][to_s] = 1
    else:
        edges[from_s] = {to_s : 1}
    return


def makeNoteEdges(note, nodes, edges):
    descs = tokenFilter(note)
    descs = [d for d in descs if d in nodes]
    node_idxs = [nodes.index(w) for w in descs if w in nodes]
    n_descs = len(descs)
    for note_idx1 in range(n_descs - 1):
        for note_idx2 in range(note_idx1+1, n_descs):
            if note_idx1 != note_idx2:
                node1 = node_idxs[note_idx1]
                node2 = node_idxs[note_idx2]
                from_idx = min(node1, node2)
                to_idx = max(node1, node2)
                incrementEdge(edges, from_idx, to_idx)
    return

        
def initialMakeEdges(lst, nodes):
    edges = {}
    for note in lst:
        makeNoteEdges(note, nodes, edges)
    return edges

def makeEdges(lst, names, verbose=False):
    init_edges = initialMakeEdges(lst, names)
    edges = []
    for start in init_edges.keys():
        for end in init_edges[start].keys():
            start_int, end_int = int(start), int(end)
            edge = {
                "from": start_int,
                "to": end_int,
                "weight": init_edges[start][end],
            }
            if verbose:
                # Add english description to edge.
                desc = {
                    "from": names[start_int],
                    "to": names[end_int]
                }
                edge["english"] = desc
            edges.append(edge)
    return edges
    

def makeGraph(corpus_list, verbose_edges=False, ):
    corpus = makeCorpus(corpus_list)
    
    nodes, names = makeNodes(corpus)
    edges = makeEdges(corpus_list, names, verbose_edges)
    
    graph = {
        "nodes": nodes,
        "edges": edges,
        "node-names": names
    }
    return graph

### Graph Analysis functions

In [7]:
def sortNodesByNOcc(graph):
    """
    Sorts nodes by n occ : requires each node to know its n occ.  
    This should be the case as its n occ is included when making graph
    """
    # Extracting nodes to data frame
    nodes = pd.DataFrame(graph["nodes"])
    
    # Renaming columns and sorting graph
    nodes.columns = ["Descriptor", "N Occ"]
    nodes = nodes.sort_values("N Occ",ascending=False)
    nodes = nodes.reset_index()
    return nodes[["Descriptor", "N Occ"]]

def getUnrepresentedCount(corpus_list, nodes):
    """
    Function to find all descriptors which aren't on graph
    """
    descriptors = list(nodes["Descriptor"])
    
    n_unrepresented = 0
    
    for t_note in corpus_list:
        desc = tokenFilter(t_note)
        matches = [w in descriptors for w in desc]
        if True not in matches:
            n_unrepresented += 1
    
    return n_unrepresented

### Functions to add words to lemmatizer


In [8]:
def addToLemmatizer(nodes, depth):
    """
    Adds a rough stemming of each of the first `depth` nodes to the lemmatizer.
    """
    # getting unlemmatized common words
    list_descriptors = list(nodes["Descriptor"])
    describers_d = list_descriptors[:depth]
    idx1 = 0
    while idx1 < len(describers_d):
        idx2 = idx1 + 1
        word1 = describers_d[idx1]
        while idx2 < len(describers_d):
            word2 = describers_d[idx2]
            if word2[:len(word1)] == word1:
                describers_d.pop(idx2)
                break
            idx2 += 1
        idx1 += 1


    for word in describers_d:
        unlemma = [w for w in list_descriptors if w[:len(word)]==word]
        for unlemma_word in unlemma:
            lemmatizer.whisky_words[unlemma_word] = word

## Tasting Note Analysis
We follow the following process to analyse the tasting notes:

- Make preliminary graph based on hardcoded lemmatizer

- Based on N Occ, take each term and add a rough stem to the lemmatizer

- Remake the graph with improved lemmatizer

### Tasting Note Analysis : Nose

In [9]:

# Extracting list of tasting notes from word
nose_list = makeList(df.dropna().reset_index(), "Nose")

# Making a preliminary graph and extracting words
nose_graph_prelim = makeGraph(nose_list, verbose_edges=True)
nose_occ = sortNodesByNOcc(nose_graph_prelim)

In [10]:
## Noticed that when using len_deg we got some intersting 
## results - instead looking at short words, all seem 
## weird - adding to stopwords
for word in list(nose_occ["Descriptor"]):
    if len(word) < 3:
        print(word)
        
for word in list(nose_occ["Descriptor"]):
    if len(word) < 3:
        swords.add(word)
        
# Improving lemmatizer based on 500 stemmed common words
addToLemmatizer(nose_occ, 200)

–
au
de
le
go
px
u
se
9
‘
18
ol
10
ba
oh
ii
15
n
12
30


In [11]:
# Remaking nose graph
nose_graph = makeGraph(nose_list, verbose_edges=True)
nose_n_occ = sortNodesByNOcc(nose_graph)

Now that we have a list of lemmatized nose descriptors (minus stopwords), we can apply the same processs to palate and finish
### Tasting Note Analysis : Palate

In [13]:
# Extracting list of tasting notes from word
palate_list = makeList(df.dropna().reset_index(), "Palate")

# Making a preliminary graph and extracting words
palate_graph_prelim = makeGraph(palate_list, verbose_edges=True)
palate_occ = sortNodesByNOcc(palate_graph_prelim)

# Improving lemmatizer based on 500 stemmed common words
addToLemmatizer(palate_occ, 200)

In [14]:
for word in list(palate_occ["Descriptor"]):
    if len(word) < 3:
        print(word)
for word in list(palate_occ["Descriptor"]):
    if len(word) < 3:
        swords.add(word)

mr
8
90
42
ra
el
“
”


In [15]:
# Remaking palate graph
palate_graph = makeGraph(palate_list, verbose_edges=True)
palate_n_occ = sortNodesByNOcc(palate_graph)

### Tasting Note Analysis : Finish

In [16]:
# Extracting list of tasting notes from word
finish_list = makeList(df.dropna().reset_index(), "Finish")

# Making a preliminary graph and extracting words
finish_graph_prelim = makeGraph(finish_list, verbose_edges=True)
finish_occ = sortNodesByNOcc(finish_graph_prelim)

In [17]:
## Noticed that when using len_deg we got some intersting 
## results - instead looking at short words, all seem 
## weird - adding to stopwords
for word in list(finish_occ["Descriptor"]):
    if len(word) < 3:
        print(word)

In [18]:
addToLemmatizer(finish_occ, 200)

In [19]:
# Remaking finish graph
finish_graph = makeGraph(finish_list, verbose_edges=True)
finish_n_occ = sortNodesByNOcc(finish_graph)

### Saving lemmatizer dictionary to json


### Tasting Note Analysis : Remaking graphs with updated lemmatizer

In [45]:
with open("graphs/nose.prim.json") as json_in:
    nose_graph = json.load(json_in)
with open("graphs/palate.prim.json") as json_in:
    palate_graph = json.load(json_in)
with open("graphs/finish.prim.json") as json_in:
    finish_graph = json.load(json_in)

Based on a quick look, 100 words for each seems to get a very good coverage of dataset

## Graph Visualisations

Found very difficult to find a 'good' grpah visualisation with this many nodes

In [30]:
import networkx as nx

In [46]:
def graph2NX(g, title):
    G = nx.Graph(name=title)
    nodes = g["nodes"]
    for idx, node in enumerate(nodes):
        G.add_node(node["name"], name=node["name"], nocc=node["n_occ"])
    
    edges = []
    for edge in g["edges"]:
        edges.append(
            (nodes[edge["from"]]["name"], nodes[edge["to"]]["name"], edge["weight"])
        )
    G.add_weighted_edges_from(edges)
    
    
    return G

In [115]:
from pyvis.network import Network
def graph2PV(g, notebook=False, dims=(1000,700)):
    """
    Function to display graphs in networkx
    """
    
    net = Network(width=str(dims[0])+'px',height=str(dims[1])+'px',notebook=notebook)
    
    g_nodes = g["nodes"]
    g_edges = g["edges"]
    
    nodes = np.arange(len(g["nodes"]))
    
    props = list(zip(*[(node["name"], node["n_occ"]) for node in g["nodes"]]))
    names = list(props[0])
    n_occ = list(props[1])
    
    net.add_nodes(nodes, value=n_occ,
                         title=names)
    
    #net.add_edges([(edge["from"], edge["to"], edge["weight"]) for edge in g_edges])
    net.edges = [{'width':edge["weight"],'from':edge["from"],'to':edge["to"]} for edge in g_edges]
    
    net.show_buttons(filter_=["nodes", "edges", "physics"])
    
    net.set_options('{"layout":true}')
    
    return net
    

In [116]:
nose_vis = graph2PV(nose_graph)
nose_vis.show("nose_vis.html")

In [99]:
palate_vis = graph2PV(palate_graph)
palate_vis.show("palate_vis.html")

In [100]:
finish_vis = graph2PV(finish_graph)
finish_vis.show("finish_vis.html")

In [27]:
nose_nx = graph2NX(nose_graph, "Nose")
palate_nx = graph2NX(palate_graph, "Palate")
finish_nx = graph2NX(finish_graph, "Finish")

# Testing Eigencentrality RAKE
Testing what eigencentrality RAKE produces

In [3]:
from whiskynlp.GraphKeywordExtraction import GraphKE

In [6]:
KE = GraphKE()
KE.keywordExtract(df, "Finish", 100)

Building Corpus
Building Graph
Ranking Nodes


[('long', 0.3315870919125884),
 ('spice', 0.3107726573888594),
 ('oak', 0.2976290779752006),
 ('sweet', 0.27160653062623674),
 ('fruit', 0.2688196361323793),
 ('smoke', 0.23526105087172744),
 ('dry', 0.2247697418298472),
 ('pepper', 0.190878219669847),
 ('chocolate', 0.1631098714699105),
 ('malt', 0.1461320618887879),
 ('vanilla', 0.14155751532720273),
 ('spicy', 0.1208973475380097),
 ('honey', 0.11884194886265252),
 ('length', 0.10956742803932186),
 ('orange', 0.10500881137730003),
 ('warm', 0.09997421144076435),
 ('nut', 0.09666299139694369),
 ('sugar', 0.09556504325312482),
 ('peat', 0.09460690635633463),
 ('cinnamon', 0.09201346296830785),
 ('dark', 0.09106126239482776),
 ('barley', 0.09094528628362741),
 ('toast', 0.09084237055062981),
 ('black', 0.08773798720701322),
 ('rich', 0.08753694505120678),
 ('cream', 0.08654084327040498),
 ('butter', 0.08622728758188528),
 ('last', 0.08621590632135873),
 ('salt', 0.08586422355151505),
 ('apple', 0.08263846507215322),
 ('caramel', 0.08054