# Topic Modeling

## Introduction
The goal of topic modeling is to find various latent topics that are present in the corpus.  Each document in the corpus will be made up of one or more topics.  One approach to topic modeling is know as Latent Dirichlet Allocation (LDA).  In LDA documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.  LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

## Problem Statement

## Input
1. Document Term Matrix with additional stop words removed from the eda notebook
2. The number of topics to find in the reviews

## Output
1. Once the topic modeling technique is applied, you need to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.


MIT License

Copyright (c) 2022 UFO Software, LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

In [1]:
import os
from os.path import exists
import spacy
from spacy import displacy
import pandas as pd
import itertools as it
import tqdm as notebook_tqdm
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
from gensim import matutils, models

import pyLDAvis
import pyLDAvis.gensim_models
import warnings
import pickle
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from spacy.util import minibatch
from bokeh.plotting import figure, output_notebook, show
import numpy as np
from collections import Counter
import string
import re
import scipy.sparse

  from imp import reload


In [2]:
# setup the directory structure
# use a temp directory to keep intermediate results
parent_dir = '/run/user/1000/gvfs/smb-share:server=titan.local,share=data_sets/strains'
temp_dir = parent_dir+'/temp'

## Topic Modeling - Attempt #1
Use all the text from the initial data cleaning

## Document Term Matrix (DTM)

In [3]:
# read in the document term matrix
data = pd.read_parquet(temp_dir+'/dtm_stop.parquet')
data

Unnamed: 0_level_0,_all,a_couch_locker,a_life_saver,a_roller_coaster,aa,aaa,aaaa,aaaaa,aaaaaaaaaaaaaa,aaaaaaaaaaaaaqaaaaaaaa,...,zzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,ıf,łēčtrpart,ʻohana,ʻono,δthc,⅛th,⅛thweight
strain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1024,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24k-gold,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3-kings,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3x-crazy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
501st-og,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yoda-og,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yogi-diesel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yumboldt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yummy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Term Document Matrix (TDM)
the transpose of the DTM

In [4]:
# transpose the dtm to create a term document matrix
tdm = data.transpose()
tdm.head()

strain,1024,24k-gold,3-kings,3x-crazy,501st-og,5th-element,707-headband,8-ball-kush,818-og,91-krypt,...,wookies,wsu,xxx-og,y-griega,yeti-og,yoda-og,yogi-diesel,yumboldt,yummy,zeus-og
_all,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a_couch_locker,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
a_life_saver,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a_roller_coaster,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aa,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [6]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open(temp_dir+"/cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [7]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.007*"sativa" + 0.006*"strong" + 0.006*"effect" + 0.006*"happy" + 0.006*"time" + 0.006*"head" + 0.005*"amazing" + 0.005*"sweet" + 0.005*"flavor" + 0.005*"anxiety"'),
 (1,
  '0.009*"pain" + 0.008*"indica" + 0.007*"effect" + 0.007*"sleep" + 0.006*"help" + 0.006*"anxiety" + 0.006*"strong" + 0.006*"relax" + 0.005*"relaxed" + 0.005*"head"')]

In [8]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.007*"pain" + 0.007*"effect" + 0.006*"anxiety" + 0.006*"sativa" + 0.006*"time" + 0.006*"strong" + 0.006*"head" + 0.006*"help" + 0.005*"happy" + 0.005*"amazing"'),
 (1,
  '0.007*"sweet" + 0.007*"indica" + 0.007*"effect" + 0.007*"strong" + 0.006*"flavor" + 0.006*"pain" + 0.005*"relax" + 0.005*"sleep" + 0.005*"relaxed" + 0.005*"time"'),
 (2,
  '0.006*"strong" + 0.006*"indica" + 0.006*"effect" + 0.005*"sleep" + 0.005*"happy" + 0.005*"head" + 0.005*"sweet" + 0.005*"amazing" + 0.005*"pain" + 0.004*"flavor"')]

In [9]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.005*"sweet" + 0.005*"effect" + 0.005*"amazing" + 0.005*"flavor" + 0.005*"happy" + 0.004*"head" + 0.004*"smooth" + 0.004*"indica" + 0.004*"strong" + 0.004*"pretty"'),
 (1,
  '0.007*"effect" + 0.007*"strong" + 0.006*"pain" + 0.006*"time" + 0.006*"head" + 0.006*"sativa" + 0.005*"happy" + 0.005*"indica" + 0.005*"anxiety" + 0.005*"amazing"'),
 (2,
  '0.007*"sativa" + 0.007*"strong" + 0.006*"head" + 0.006*"time" + 0.006*"happy" + 0.006*"effect" + 0.005*"flavor" + 0.005*"sweet" + 0.005*"amazing" + 0.004*"anxiety"'),
 (3,
  '0.010*"pain" + 0.007*"anxiety" + 0.007*"effect" + 0.007*"help" + 0.006*"sleep" + 0.006*"indica" + 0.006*"relax" + 0.005*"cbd" + 0.005*"relaxed" + 0.005*"time"')]

In [10]:
data_clean = pd.read_parquet(temp_dir+'/tri_grams.parquet')
data_clean

Unnamed: 0,tri_review
0,it a good even head and body high good for str...
1,you can change the name give it no name call i...
2,I be skeptical_about this strain after try thr...
3,this strain be always a favorite the top_favor...
4,I have m and this strain be suggest for I to h...
...,...
1051,superdank I finally_find my medicine I can_nt ...
1052,this strain provide a nice head high where you...
1053,this strain be excellent for relieve my migrai...
1054,really like this one nice body high great for ...


## Topic Modeling - Attempt #2
Use only nouns, proper nouns and adjectives

In [11]:
# Re-add the additional stop words since we are recreating the document-term matrix
custom_stop_words = set(['effect', 'strong', 'day', 'hit', 'amazing', 'favorite', 'little', 'one'])
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')

Collecting en-core-web-trf==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl (460.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.3/460.3 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [12]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    nouns_adj = []
    doc = nlp(text)
    for token in doc:
        if token.pos_ == 'NOUN' or token.pos_ == 'PROPN' or token.pos_ == 'ADJ':
            if not token.is_stop:
                nouns_adj.append(token.text)
    return ' '.join(nouns_adj)

In [13]:
# Apply the nouns function to the transcripts to filter only on nouns and adjectives
nouns_adj_file = temp_dir+'/data_nouns_adj.parquet'
if not exists(nouns_adj_file):
    data_nouns_adj = pd.DataFrame(data_clean.tri_review.apply(lambda x: nouns_adj(x)))
    data_nouns_adj.to_parquet(nouns_adj_file)
else:
    data_nouns_adj = pd.read_parquet(nouns_adj_file)
    
data_nouns_adj

Unnamed: 0,tri_review
0,good even head body high good stress nice high...
1,name name schnauzerganjkosher tangie k gold we...
2,skeptical_about strain kings amazed stress top...
3,strain top_favorite in_fact potency strain dep...
4,strain muscle_spasm first second_time before_b...
...,...
1051,superdank medicine active imagination depressi...
1052,strain nice head high thought forefront high n...
1053,strain excellent migraine chance strain enough...
1054,nice body high great every_day light head high...


In [14]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.tri_review)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,a_couch_locker,a_life_saver,a_roller_coaster,aa,aaa,aampf,aarch,aaron,ab,abad,...,zzzzzzgreat,zzzzzznighty,zzzzzzzz,zzzzzzzzi,zzzzzzzzzzz,łēčtrpart,ʻohana,δthc,⅛th,⅛thweight
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1051,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1052,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1053,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1054,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [16]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.010*"pain" + 0.009*"indica" + 0.008*"time" + 0.008*"head" + 0.007*"sweet" + 0.007*"flavor" + 0.007*"relaxed" + 0.007*"anxiety" + 0.007*"happy" + 0.006*"smooth"'),
 (1,
  '0.012*"sativa" + 0.009*"anxiety" + 0.008*"pain" + 0.008*"happy" + 0.008*"time" + 0.007*"head" + 0.006*"cbd" + 0.006*"sweet" + 0.005*"uplifting" + 0.005*"flavor"')]

In [17]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.008*"happy" + 0.008*"sativa" + 0.007*"time" + 0.006*"flavor" + 0.006*"head" + 0.006*"sweet" + 0.005*"smooth" + 0.005*"hybrid" + 0.005*"lemon" + 0.004*"uplifting"'),
 (1,
  '0.012*"sativa" + 0.009*"happy" + 0.008*"time" + 0.008*"head" + 0.007*"anxiety" + 0.007*"sweet" + 0.007*"flavor" + 0.006*"uplifting" + 0.006*"smooth" + 0.005*"pain"'),
 (2,
  '0.013*"pain" + 0.011*"indica" + 0.009*"anxiety" + 0.008*"relaxed" + 0.008*"time" + 0.007*"head" + 0.007*"sweet" + 0.006*"flavor" + 0.006*"sleep" + 0.006*"smooth"')]

In [18]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.012*"sativa" + 0.008*"time" + 0.008*"happy" + 0.007*"sweet" + 0.007*"head" + 0.006*"pain" + 0.006*"anxiety" + 0.006*"flavor" + 0.005*"uplifting" + 0.005*"smooth"'),
 (1,
  '0.011*"sativa" + 0.009*"happy" + 0.008*"time" + 0.008*"head" + 0.007*"flavor" + 0.007*"anxiety" + 0.006*"sweet" + 0.006*"smooth" + 0.005*"pain" + 0.005*"stuff"'),
 (2,
  '0.012*"pain" + 0.010*"anxiety" + 0.008*"head" + 0.008*"time" + 0.007*"cbd" + 0.007*"happy" + 0.006*"sweet" + 0.006*"relaxed" + 0.006*"flavor" + 0.005*"smooth"'),
 (3,
  '0.013*"indica" + 0.011*"pain" + 0.009*"relaxed" + 0.008*"sweet" + 0.007*"sleep" + 0.007*"time" + 0.007*"head" + 0.007*"flavor" + 0.007*"anxiety" + 0.007*"heavy"')]

In [19]:
# Our final 4 topic model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.005*"acdc" + 0.003*"orange_crush" + 0.002*"mk_ultra" + 0.002*"cbd" + 0.002*"sour_tsunami" + 0.002*"harletsu" + 0.001*"panama_red" + 0.001*"remedy" + 0.001*"blueberry_headband" + 0.001*"voodoo"'),
 (1,
  '0.009*"sativa" + 0.009*"happy" + 0.008*"time" + 0.008*"head" + 0.007*"sweet" + 0.007*"flavor" + 0.006*"anxiety" + 0.006*"pain" + 0.006*"relaxed" + 0.006*"smooth"'),
 (2,
  '0.013*"pain" + 0.009*"anxiety" + 0.009*"indica" + 0.008*"time" + 0.008*"head" + 0.007*"relaxed" + 0.007*"flavor" + 0.007*"sweet" + 0.006*"happy" + 0.006*"smooth"'),
 (3,
  '0.020*"sativa" + 0.007*"haze" + 0.006*"sweet" + 0.006*"energetic" + 0.006*"uplifting" + 0.005*"mango" + 0.005*"pineapple" + 0.005*"strawberry" + 0.004*"happy" + 0.004*"cherry_pie"')]

In [20]:
# Our final 3 topic model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.011*"indica" + 0.010*"pain" + 0.008*"sweet" + 0.008*"relaxed" + 0.008*"time" + 0.007*"head" + 0.007*"flavor" + 0.007*"anxiety" + 0.006*"smooth" + 0.006*"heavy"'),
 (1,
  '0.012*"sativa" + 0.009*"happy" + 0.009*"anxiety" + 0.009*"time" + 0.008*"head" + 0.008*"pain" + 0.006*"flavor" + 0.006*"uplifting" + 0.006*"sweet" + 0.005*"smooth"'),
 (2,
  '0.006*"gelato" + 0.004*"bubble_gum" + 0.003*"bubblegum" + 0.003*"chernobyl" + 0.002*"orange_crush" + 0.002*"blue_diesel" + 0.002*"key_lime_pie" + 0.002*"pink_kush" + 0.002*"nebula" + 0.002*"sweet_tooth"')]

In [21]:
# Findal 2 topic model
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.014*"pain" + 0.011*"indica" + 0.009*"anxiety" + 0.008*"relaxed" + 0.007*"sleep" + 0.007*"time" + 0.007*"head" + 0.006*"heavy" + 0.006*"flavor" + 0.006*"relaxing"'),
 (1,
  '0.011*"sativa" + 0.009*"happy" + 0.009*"time" + 0.008*"head" + 0.008*"sweet" + 0.007*"flavor" + 0.006*"anxiety" + 0.006*"smooth" + 0.006*"hybrid" + 0.005*"uplifting"')]

## Identify Topics in Each Document
The 2 topic model makes the most sense

In [22]:
# Final 2 topic model
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.016*"pain" + 0.012*"indica" + 0.010*"anxiety" + 0.008*"sleep" + 0.007*"relaxed" + 0.007*"time" + 0.007*"cbd" + 0.006*"head" + 0.006*"heavy" + 0.006*"relaxing"'),
 (1,
  '0.009*"sativa" + 0.009*"happy" + 0.008*"time" + 0.008*"head" + 0.008*"sweet" + 0.007*"flavor" + 0.007*"anxiety" + 0.006*"smooth" + 0.006*"pain" + 0.006*"relaxed"')]

## The Topics Discussed in the Reviews are Indica and Sativa
Hybrid constituted about half of the reviews but it does not have its own topic.  Hybrid is a cross between indica and sativa so it difficult to distinguish hybrid from the two other species since it holds properties of both.