# Topic Modeling of Reviews

## Data Preparation

First, we load the CSV, removing empty and very small text inputs to avoid simple things as "i like"

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("winemag-data_first150k.csv", nrows=10000)

df = df[["description"]]
df = df[(df["description"].str.strip() != "") & (df["description"].str.len() >= 40)]

print("Total: ", len(df["description"]))
print(df["description"][:5])

Total:  10000
0    This tremendous 100% varietal wine hails from ...
1    Ripe aromas of fig, blackberry and cassis are ...
2    Mac Watson honors the memory of a wine once ma...
3    This spent 20 months in 30% new French oak, an...
4    This is the top wine from La Bégude, named aft...
Name: description, dtype: object


We need to import nltk to find the english stopwords, which are helper words such as "from", "the", "that" that we should ignore to not compromise our result.

You need to uncomment the nltk download line and download the spanish corpus first.

In [2]:
import nltk

# uncomment if you need this
# nltk.download() 

stop_words = nltk.corpus.stopwords.words('english')

print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


Now we remove those stop words and banned words from the reviews because they would only mess with the algorithm, also we remove the accents because some reviews use it, some don't

In [3]:
import unidecode

# banned words, we don't want to cluster around those
banned_words = ['wine']

text = [" ".join([unidecode.unidecode(word) for word in text.split(" ")
                  if word not in stop_words
                  and unidecode.unidecode(word) not in banned_words
                 ]) for text in df["description"]]

text[:5]

['This tremendous 100% varietal hails Oakville aged three years oak. Juicy red-cherry fruit compelling hint caramel greet palate, framed elegant, fine tannins subtle minty tone background. Balanced rewarding start finish, years ahead develop nuance. Enjoy 2022-2030.',
 'Ripe aromas fig, blackberry cassis softened sweetened slathering oaky chocolate vanilla. This full, layered, intense cushioned palate, rich flavors chocolaty black fruits baking spices. A toasty, everlasting finish heady ideally balanced. Drink 2023.',
 'Mac Watson honors memory made mother tremendously delicious, balanced complex botrytised white. Dark gold color, layers toasted hazelnut, pear compote orange peel flavors, reveling succulence 122 g/L residual sugar.',
 "This spent 20 months 30% new French oak, incorporates fruit Ponzi's Aurora, Abetina Madrona vineyards, among others. Aromatic, dense toasty, deftly blends aromas flavors toast, cigar box, blackberry, black cherry, coffee graphite. Tannins polished fine s

Next we tokennize and stemmatized words so we can get their essential meaning and avoid considering different genders or verb inflections of a word as a distinct word when they are probably talking about the same thing.

Those words are saved in a vocabulary for later.

In [4]:
from nltk.stem.snowball import SnowballStemmer
import re

stemmer = SnowballStemmer("english")

def tokenize(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

def stem(tokens):
    return [stemmer.stem(t) for t in tokens]

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in text:
    allwords_tokenized = tokenize(i)
    totalvocab_tokenized.extend(allwords_tokenized)
    
    allwords_stemmed = stem(allwords_tokenized)
    totalvocab_stemmed.extend(allwords_stemmed)
    

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print(vocab_frame.head())

there are 263552 items in vocab_frame
              words
this           this
tremend  tremendous
variet     varietal
hail          hails
oakvill    oakville


## Fitting the Model

Now, we are going to send our text to a TfidfVectorizer, where we use IDF to make the algorithm focus on the main no-so-common words. We also build ngrams, so things like "black cherry" can be considered a single thing instead of two.

Then we send it to a Latent Dirichlet Allocation model, which should group the texts around clusters, discovering topics automatically for us.

In [5]:
%%time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import adjusted_rand_score


def tokenize_and_stem(text):
    return stem(tokenize(text))


vectorizer = TfidfVectorizer(max_df=0.6, max_features=200000,
                             min_df=5, stop_words=stop_words,
                             use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

X = vectorizer.fit_transform(text).todense()
print(X.shape)

terms = vectorizer.get_feature_names()
print(terms[90:100])

K = 7
model = LatentDirichletAllocation(n_components=K, max_iter=100, random_state=8)
model.fit(X)

model

(10000, 12207)
["'s medium", "'s medium bodi", "'s medium full", "'s medium-bodi", "'s miner", "'s much", "'s nice", "'s nice balanc", "'s nose", "'s nose palat"]




CPU times: user 5min 38s, sys: 1.68 s, total: 5min 40s
Wall time: 5min 41s


## Visualization

Here we use pyLDAvis which give us an off-the-shelf visualization for LDA. The topics with more reviews in it are the biggest circles.

You can click in any of it to see which are the most relevant words for that particular topic, and you can move the relevace slider to the right to see words that are more relevant to that topic and not others.

This is a great visualization because we can see 3 things:
- The biggest topics that we have to concern about
- Which topics have overlaps and which talks about completely distinct subjects
- Which words are in those topics, giving us a hint of what it is about

In [7]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(model, X, vectorizer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Of course we don't only want to see the chart about, we want to be able to effectively read the reviews in each topic, so, we build a dataframe with the original texts and the cluster number and score assigned to it. The score is the proximity of that review with the most revelant words in that cluster, so we filter those with less than 60% relevance because we only want the most focused ones.

In [8]:
result = model.transform(X)
clusters = [ np.argmax(score) for score in result ]
scores = [ max(score) for score in result ]

df["cluster"] = clusters
df["score"] = scores

df.loc[(df["score"] < 0.6), "cluster"] = K

df[["description","cluster","score"]].head()

Unnamed: 0,description,cluster,score
0,This tremendous 100% varietal wine hails from ...,4,0.872462
1,"Ripe aromas of fig, blackberry and cassis are ...",7,0.567521
2,Mac Watson honors the memory of a wine once ma...,4,0.82293
3,"This spent 20 months in 30% new French oak, an...",4,0.642081
4,"This is the top wine from La Bégude, named aft...",5,0.867355


Then we get the main words for each cluster, some examples and the sizes of the clusters

In [9]:
# copied from https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/sklearn.py
def _get_topic_term_dists(lda_model):
    return _row_norm(lda_model.components_)

def _row_norm(dists):
    # row normalization function required
    # for doc_topic_dists and topic_term_dists
    return dists / dists.sum(axis=1)[:, None]

topic_term_dists = _get_topic_term_dists(model)

# sort cluster centers by proximity to centroid
order_centroids = topic_term_dists.argsort()[:, ::-1] 

cluster_words = [
    ", ".join([
        vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0] for ind in order_centroids[i, :6]
    ]) for i in range(K)
]
cluster_words.append("Others")

print(cluster_words[0])

cluster_examples = [
    df[df["cluster"] == i]["description"].values.tolist() for i in range(len(cluster_words))
]

print(cluster_examples[0][0])

cluster_sizes = [ len(df[df["cluster"] == i]) for i in range(len(cluster_words)) ]

print(cluster_sizes)

rubbery, flavors, raspberry, resiny, saucy, plum


IndexError: list index out of range

And we print everything for you to read

In [10]:
data = { 'text': df["text"].values.tolist() }

frame = pd.DataFrame(data, index = [clusters], columns = ['text'])

print("Top terms per cluster:")
print()

for i in range(len(cluster_sizes)):
    print("Cluster %d total:" % i, cluster_sizes[i])
    print() #add whitespace
    
    print(cluster_words[i])
    print() #add whitespace
    
    print("Cluster %d examples:" % i)
    for title in cluster_examples[i][:15]:
        print('- %s' % title)
    print() #add whitespace
    print() #add whitespace

KeyError: 'text'

So, one problem with Unsupervised Learning of clustering is that you have to choose beforehand the number of clusters that you want, previously, we choose a small number, and maybe creating topics that are actually talking about two different things.

On the other hand, if we choose too many topics we might break those topics in smaller ones, but create many one which are actually talking about the same things.

We can visualize this effect on the chart, take a look with K = 30

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
K2 = 30

model2 = LatentDirichletAllocation(n_components=K2, max_iter=100, random_state=0)
model2.fit(X)

pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(model2, X, vectorizer)

## Credits

I only managed to create this notebook thanks to those two awesome tutorials, which I basically copied for this topic modelling work:

1. Document Clustering with Python - http://brandonrose.org/clustering
2. Modern NLP in Python - https://www.youtube.com/watch?v=6zm9NC9uRkk

And thanks for the [@datasciencepython](https://web.telegram.org/#/im?p=@datasciencepython) group in telegram for pointing me in the right direction