# Topic Modeling with Gensim
In this notebook we will extract topics from our collection of questions using Latent Dirichlet Allocation (LDA) and the Gensim package.

Gensim markets itself as "topic modelling for humans" and its really fast.

According to [NLP for Hackers](https://nlpforhackers.io/topic-modeling/) topic modeling is:
 - Dimensinality Reduction - We reduce dimensionality by representing a text in its topic space instead of its word space.
 - Unsupervised Learning - Topic modeling is similar to clustering.
 - A Form of Tagging - Topic modeling applys multiple tags to a text. (Similar to the tags applied to this kernel above!)
 
 Topic modeling is useful for many situations, including our task of text classification.
 
From the [gensim documentation](https://radimrehurek.com/gensim/tut2.html#transformation-interface) Latent Semantic Indexing (LSI) is a form of dimensionality reduction where documents are transformed into a latent space of lower dimensionality.

LDA is a probabilistic extension of LSA (aka multinomial PCA).  LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

In [1]:
!pip install pyLDAvis



In [2]:
# import packages
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import spacy
import nltk
import re

from gensim import corpora, models, similarities
import pyLDAvis.gensim

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

np.random.seed(27)

In [3]:
# setting up default plotting parameters
%matplotlib inline

plt.rcParams['figure.figsize'] = [15.0, 7.0]
plt.rcParams.update({'font.size': 22,})

sns.set_palette('Set2')
sns.set_style('white')
sns.set_context('talk', font_scale=0.8)

In [4]:
# suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
train = pd.read_csv('../SUDHEER/train.csv')
test = pd.read_csv('../SUDHEER/test.csv')
train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [6]:
contractions = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

c_re = re.compile('(%s)' % '|'.join(contractions.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return contractions[match.group(0)]
    return c_re.sub(replace, text)

In [7]:
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_numeric, stem_text
from gensim.parsing.preprocessing import strip_multiple_whitespaces, strip_non_alphanum, remove_stopwords, strip_short

CUSTOM_FILTERS = [lambda x: x.lower(), #lowercase
                  strip_tags, # remove html tags
                  strip_punctuation, # replace punctuation with space
                  strip_multiple_whitespaces,# remove repeating whitespaces
                  strip_non_alphanum, # remove non-alphanumeric characters
                  strip_numeric, # remove numbers
                  remove_stopwords,# remove stopwords
                  strip_short, # remove words less than minsize=3 characters long
                  stem_text,
                 ]
def gensim_preprocess(docs):
    # clean text
    docs = [expandContractions(doc) for doc in docs]
    docs = [preprocess_string(text, CUSTOM_FILTERS) for text in docs]
    # create the bigram and trigram models
    bigram = models.Phrases(docs, min_count=1, threshold=1)
    trigram = models.Phrases(bigram[docs], min_count=1, threshold=1)  
    # phraser is faster
    bigram_mod = models.phrases.Phraser(bigram)
    trigram_mod = models.phrases.Phraser(trigram)
    # apply to docs
    docs = trigram_mod[bigram_mod[docs]]
    #docs = [' '.join(text) for text in docs]
    return docs

train_clean = gensim_preprocess(train.question_text)
train_clean[43]

['download_microsoft', 'word', 'window', 'hungarian']

In [8]:
# Create Dictionary from our ngram texts containing number of times token appears in training set
train_dictionary = corpora.Dictionary(train_clean)

# filter out extremes
train_dictionary.filter_extremes(no_below=0.1, # filter tokens appearing in <1% of documents
                                     no_above=0.7, # filter tokens appearing in >70% of documents
                                     keep_n=100000) # after above filters keep only the 100000 most frequent tokens

# For each document create dictionary with how many words and number of times the words appear
train_corpus = [train_dictionary.doc2bow(text) for text in train_clean]

In [9]:
# view human readable output
[[(train_dictionary[id], freq) for id, freq in cp] for cp in train_corpus[:1]]

[[('nation', 1), ('provinc', 1)]]

In [10]:
# initialize tfidf model
tfidfi = models.TfidfModel(train_corpus)
# apply transformation to entire corpus
train_tfidf = tfidfi[train_corpus]

In [11]:
# https://radimrehurek.com/gensim/tut2.html#transformation-interface
# LDA on tfidf
%time train_lda = models.LdaMulticore(train_tfidf, num_topics=10, id2word=train_dictionary, passes=2, workers=6)

Wall time: 6min 33s


In [12]:
train_lda.show_topics()

[(0,
  '0.017*"best" + 0.015*"best_wai" + 0.010*"peopl" + 0.008*"women" + 0.008*"human" + 0.008*"onlin" + 0.008*"colleg" + 0.007*"month" + 0.006*"design" + 0.006*"market"'),
 (1,
  '0.014*"import" + 0.012*"exampl" + 0.010*"us" + 0.007*"movi" + 0.007*"histori" + 0.007*"number" + 0.007*"school" + 0.006*"websit" + 0.006*"valu" + 0.006*"possibl"'),
 (2,
  '0.020*"differ" + 0.014*"quora" + 0.010*"known" + 0.009*"engin" + 0.009*"univers" + 0.008*"world" + 0.007*"class" + 0.006*"product" + 0.006*"experi" + 0.006*"futur"'),
 (3,
  '0.015*"like" + 0.012*"work" + 0.010*"know" + 0.010*"monei" + 0.008*"china" + 0.008*"busi" + 0.008*"relationship" + 0.008*"form" + 0.007*"person" + 0.007*"free"'),
 (4,
  '0.009*"learn" + 0.009*"develop" + 0.008*"book" + 0.008*"girl" + 0.007*"year_old" + 0.007*"prepar" + 0.007*"sex" + 0.006*"cost" + 0.006*"exist" + 0.006*"app"'),
 (5,
  '0.009*"time" + 0.008*"caus" + 0.008*"dai" + 0.007*"benefit" + 0.007*"anim" + 0.007*"bad" + 0.006*"languag" + 0.006*"home" + 0.006*"

In [13]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(train_lda, train_corpus, train_dictionary)
vis

From the visualization above we can see that several topics overlap significantly.

In [None]:
# using coherence score to find optimal number of topics
# ref: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = models.LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=6, passes=2)
        model_list.append(model)
        coherencemodel = models.CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

model_list, coherence_values = compute_coherence_values(dictionary=train_dictionary,
                                                        corpus=train_tfidf,
                                                        texts=train_clean,
                                                        start=2,
                                                        limit=262,
                                                        step=20)


In [None]:
coherence_values

In [None]:
# Show graph
limit=262; start=2; step=20;
x = range(start, limit, step)
sns.lineplot(x, coherence_values)
sns.despine(left=True, bottom=True)
plt.title('Training LDA Coherence Scores', fontsize=30)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.show()

In [None]:
# LDA on tfidf
%time train_lda = models.LdaMulticore(train_tfidf, num_topics=180, id2word=train_dictionary, passes=2, workers=6)

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(train_lda, train_tfidf, train_dictionary)
vis