# Gensim - Topic Modelling for Humans

## 1. Introduction

Topic Modelling is the technique of discovering topics from a large collection of data. Topic modelling helps us understand the hidden semantic structures in text. Gensim is a tool which provides implementations of topic modelling algorithms like Latent Dirichlet Allocation and Latent Semantic Analysis along  with other Natural Language Processing features.

This spotlight intends to highlight the idea behind Gensim and provide a brief overview of some of its important features relevant to topic modelling. This can be used as a starter by changing the dataset and will help you get started with building topic models.

## 2. Idea behind Gensim

In Information Retrieval and Natural Language Processing, a Vector Space Model (VSM) is generally used to represent documents in high dimensional vector space to study the similarity between documents, answer queries, or perform other tasks. Larger the size of the documents, greater will be the dimension of the vector space. This will involve challenges in representing the documents in a scalable way. The idea behind Gensim is to represent **large** volumes of text using their underlying topics. There are two main benefits of this. Firstly, this representation takes less space. Secondly, it reduces noise. **The implementation is based on the fundamental principle of document streaming which involves processsing data document by document instead of storing the whole corpus in main memory and this reduces the loading/storing of sparse matrices to/from disk.** For more information, please refer to this paper by the inventors of Gensim: https://radimrehurek.com/lrec2010_final.pdf

## 3. Installation

Gensim requires python 2.7 or >=3.5. Gensim can be obtained in one of the following ways.
1. Anaconda   
    conda install -c conda-forge gensim
2. Pip  
    pip install gensim

Additional packages required for this spotlight are given below.  
1. conda install pattern OR pip install pattern
2. conda install pyldavis OR pip install pyldavis

## 4. Dataset

For the purpose of this article, we will use **Fake News** dataset available on both Kaggle and Gensim's own data repository. It has a collection of fake news articles that were crawled from different websites. The idea is to apply topic modelling on this dataset to identify the most common topics used in the subject line of fake news articles. This is interesting, isn't it? More information about the dataset is available on https://github.com/RaRe-Technologies/gensim-data.  Now, let's deep dive into some actions in Gensim!

## 5. Data Prep

For any Natural Language Processing or Information Retrieval task, preparing the data is extremely important. The principle is simple: garbage in, garbage out. Let's use some pre-processing methods available in Gensim to prepare the data so that we can appreaciate the results returned by Gensim.

### 5.1 Reading the Data and Sanity Check

In [2]:
import pandas as pd
import gensim.downloader as api

# Download fake news dataset from Gensim storage
dataset = api.load('fake-news')
data = pd.DataFrame(dataset)

In [3]:
data

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689,Muslims BUSTED: They Stole Millions In Gov’t B...,0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689,Re: Why Did Attorney General Loretta Lynch Ple...,0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
2,c70e149fdd53de5e61c29281100b9de0ed268bc3,0,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,english,2016-10-31T01:41:49.479+02:00,100percentfedup.com,US,25689,BREAKING: Weiner Cooperating With FBI On Hilla...,0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
3,7cf7c15731ac2a116dd7f629bd57ea468ed70284,0,Fed Up,2016-11-01T05:22:00.000+02:00,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,english,2016-11-01T15:46:26.304+02:00,100percentfedup.com,US,25689,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0.068,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias
4,0206b54719c7e241ffe0ad4315b808290dbe6c0f,0,Fed Up,2016-11-01T21:56:00.000+02:00,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,english,2016-11-01T23:59:42.266+02:00,100percentfedup.com,US,25689,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0.865,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12994,f1b5d0e44803f48732bde854a9fdf95837219b12,2,replaceme,2016-10-26T23:58:00.000+03:00,,It DOES allow you to put a dog face on top of ...,english,2016-10-27T00:37:46.194+03:00,zerohedge.com,US,2435,"Snapchat To Raise Up To $4 Billion In IPO, Val...",0,,40,32,0,0,0,bs
12995,36011ceba3647e1bea78299b68b6fb705a1fc1ad,3,Freedumb,2016-10-27T00:02:00.000+03:00,,Wait till you see what happens to the valuatio...,english,2016-10-27T00:37:46.220+03:00,zerohedge.com,US,2435,"Snapchat To Raise Up To $4 Billion In IPO, Val...",0,,40,32,0,0,0,bs
12996,6995d1aa9ac99926106489b14b5530e85358059a,4,major major maj...,2016-10-27T00:06:00.000+03:00,,I'm waiting for the one that puts a pussy on m...,english,2016-10-27T00:37:46.244+03:00,zerohedge.com,US,2435,"Snapchat To Raise Up To $4 Billion In IPO, Val...",0,,40,32,0,0,0,bs
12997,7de8ae90eee164eb756db6c8a3772288e11d7a94,5,beemasters,2016-10-27T00:09:00.000+03:00,,$4 Billion even after they are known to be kee...,english,2016-10-27T00:37:46.247+03:00,zerohedge.com,US,2435,"Snapchat To Raise Up To $4 Billion In IPO, Val...",0,,40,32,0,0,0,bs


In [4]:
# Check how many records are null
data.isnull().sum()

uuid                  0
ord_in_thread         0
author                0
published             0
title                 0
text                  0
language              0
crawled               0
site_url              0
country               0
domain_rank           0
thread_title          0
spam_score            0
main_img_url          0
replies_count         0
participants_count    0
likes                 0
comments              0
shares                0
type                  0
dtype: int64

### 5.2 Basic Preprocessing

The preprocessing module in Gensim provides a suite of methods for parsing and preprocessing strings. Some of these methods demonstrated in this splotlight include:  
1. strip_tags - Remove HTML tags
2. strip_punctuation - Replace punctuations with space
3. remove_stopwords - Remove stopwords
4. strip_multiple_whitespaces - Remove repeating whitespaces
5. strip_numeric - Remove digits
6. strip_short - Remove words with length lesser than a given value, default = 3

Gensim allows us to create a pipeline of these processing tasks and apply them using **preprocess_string** instead of individually calling these methods on the data.

In [5]:
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import strip_numeric
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import stem_text
from gensim.utils import tokenize

# Create a pipeline of preprocessing tasks
custom_filters = [lambda x: x.lower(), strip_tags, strip_punctuation, remove_stopwords, \
                  strip_multiple_whitespaces, strip_numeric, strip_short]

# Tokenize each news title
tokens = [tokenize(str(row)) for row in data['title']]

# Execute the preprocessing pipeline
clean_data = [preprocess_string(" ".join(row), custom_filters) for row in tokens]


Now let's check what the data looks like after preprocessing.

In [6]:
clean_data

[['muslims', 'busted', 'stole', 'millions', 'gov', 'benefits'],
 ['attorney', 'general', 'loretta', 'lynch', 'plead', 'fifth'],
 ['breaking',
  'weiner',
  'cooperating',
  'fbi',
  'hillary',
  'email',
  'investigation'],
 ['pin',
  'drop',
  'speech',
  'father',
  'daughter',
  'kidnapped',
  'killed',
  'isis',
  'voted',
  'donald',
  'trump',
  'percentfedup',
  'com'],
 ['fantastic',
  'trump',
  'point',
  'plan',
  'reform',
  'healthcare',
  'begins',
  'bombshell',
  'percentfedup',
  'com'],
 ['hillary', 'goes', 'absolutely', 'berserk', 'protester', 'rally', 'video'],
 ['breaking',
  'nypd',
  'ready',
  'arrests',
  'weiner',
  'case',
  'hillary',
  'visited',
  'pedophile',
  'island',
  'times',
  'money',
  'laundering',
  'underage',
  'sex',
  'pay',
  'play',
  'proof',
  'inappropriate',
  'handling',
  'classified',
  'information',
  'percentfedup',
  'com'],
 ['wow',
  'whistleblower',
  'tells',
  'chilling',
  'story',
  'massive',
  'voter',
  'fraud',
  'tr

### 5.3 Bigrams  

Certain words occur together in a sentence. Let's build a simple bigram model to account for these. The same framework can be extended for trigrams and higher order n-grams.

In [8]:
from gensim.models.phrases import Phrases, Phraser

# Train a bigram model. 
# min_count is used to specify the minimum number of ocuurences of the pair of words in the corpus for them to be classified as a bigram.

bigram = Phrases(clean_data, min_count=5)
bigram_mod = Phraser(bigram)
bigram = [bigram_mod[review] for review in clean_data]
bigram

[['muslims', 'busted', 'stole', 'millions', 'gov', 'benefits'],
 ['attorney_general', 'loretta_lynch', 'plead', 'fifth'],
 ['breaking',
  'weiner',
  'cooperating',
  'fbi',
  'hillary',
  'email_investigation'],
 ['pin',
  'drop',
  'speech',
  'father',
  'daughter',
  'kidnapped',
  'killed',
  'isis',
  'voted',
  'donald_trump',
  'percentfedup_com'],
 ['fantastic',
  'trump',
  'point',
  'plan',
  'reform',
  'healthcare',
  'begins',
  'bombshell',
  'percentfedup_com'],
 ['hillary', 'goes', 'absolutely', 'berserk', 'protester', 'rally', 'video'],
 ['breaking',
  'nypd',
  'ready',
  'arrests',
  'weiner',
  'case',
  'hillary',
  'visited',
  'pedophile',
  'island',
  'times',
  'money',
  'laundering',
  'underage',
  'sex',
  'pay_play',
  'proof',
  'inappropriate',
  'handling',
  'classified_information',
  'percentfedup_com'],
 ['wow',
  'whistleblower',
  'tells',
  'chilling',
  'story',
  'massive_voter',
  'fraud',
  'trump',
  'campaign',
  'readies',
  'lawsuit',


Observe the bigrams in the above output!  
attorney general -> attorney_general  
loretta lynch -> loretta_lynch  
donald trump -> donald_trump

### 5.4 Lemmatization  

Lemmatization converts a word to its base form by taking the context into account. Gensim uses the pattern package to perform lemmatization. The default setting includes only NN (noun), VB (verb), JJ (adjective) and RB (adverb). You can filter lemma by specifying the required part of speech as a parameter to the lemmatize method.  

For each word, the method returns the base form and the part of speech. We will discard the part of speech as we don't require it.

In [14]:
from gensim.utils import lemmatize

lemmatized_data = [[wd.decode('utf-8').split('/')[0] for wd in lemmatize(" ".join(title))] for title in bigram]

In [15]:
lemmatized_data

[['muslim', 'busted', 'steal', 'million', 'gov', 'benefit'],
 ['loretta_lynch', 'plead', 'fifth'],
 ['breaking', 'weiner', 'cooperate', 'fbi', 'hillary'],
 ['pin',
  'drop',
  'speech',
  'father',
  'daughter',
  'kidnap',
  'killed',
  'isis',
  'vote',
  'donald_trump'],
 ['fantastic',
  'trump',
  'point',
  'plan',
  'reform',
  'healthcare',
  'begin',
  'bombshell'],
 ['hillary', 'go', 'absolutely', 'berserk', 'protester', 'rally', 'video'],
 ['breaking',
  'nypd',
  'ready',
  'arrest',
  'weiner',
  'case',
  'hillary',
  'visit',
  'pedophile',
  'island',
  'time',
  'money',
  'laundering',
  'underage',
  'sex',
  'pay_play',
  'proof',
  'inappropriate',
  'handling'],
 ['whistleblower',
  'tell',
  'chill',
  'story',
  'massive_voter',
  'fraud',
  'trump',
  'campaign',
  'ready',
  'lawsuit',
  'sec',
  'election',
  'critical',
  'district',
  'com'],
 ['break',
  'clinton',
  'clear',
  'coordinate',
  'minute',
  'trick',
  'energize',
  'hillary',
  'base'],
 ['ev

## 6. Mapping Strings

To train a LDA model on the given dataset, we convert the strings to a "bag-of-words" document representation. Gensim represents each article as a vector and each vector represents a question-answer pair. For example, in the bag-of-words approach, the question-answer pair is "How many times does this word occur in the document"-"twice". We can convert the text to this representation using doc2bow method. Gensim represents these questions using unique ids. Therefore, before applying doc2bow function, we must map the questions to ids using a dictionary object.

In [16]:
from gensim import corpora

# Map each word to a unique id
dictionary = corpora.Dictionary(lemmatized_data)


In [17]:
dictionary.token2id

{'benefit': 0,
 'busted': 1,
 'gov': 2,
 'million': 3,
 'muslim': 4,
 'steal': 5,
 'fifth': 6,
 'loretta_lynch': 7,
 'plead': 8,
 'breaking': 9,
 'cooperate': 10,
 'fbi': 11,
 'hillary': 12,
 'weiner': 13,
 'daughter': 14,
 'donald_trump': 15,
 'drop': 16,
 'father': 17,
 'isis': 18,
 'kidnap': 19,
 'killed': 20,
 'pin': 21,
 'speech': 22,
 'vote': 23,
 'begin': 24,
 'bombshell': 25,
 'fantastic': 26,
 'healthcare': 27,
 'plan': 28,
 'point': 29,
 'reform': 30,
 'trump': 31,
 'absolutely': 32,
 'berserk': 33,
 'go': 34,
 'protester': 35,
 'rally': 36,
 'video': 37,
 'arrest': 38,
 'case': 39,
 'handling': 40,
 'inappropriate': 41,
 'island': 42,
 'laundering': 43,
 'money': 44,
 'nypd': 45,
 'pay_play': 46,
 'pedophile': 47,
 'proof': 48,
 'ready': 49,
 'sex': 50,
 'time': 51,
 'underage': 52,
 'visit': 53,
 'campaign': 54,
 'chill': 55,
 'com': 56,
 'critical': 57,
 'district': 58,
 'election': 59,
 'fraud': 60,
 'lawsuit': 61,
 'massive_voter': 62,
 'sec': 63,
 'story': 64,
 'tell': 

In [18]:
# Bag of words
tf = [dictionary.doc2bow(text) for text in lemmatized_data]
tf


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(6, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1)],
 [(14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)],
 [(12, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1)],
 [(9, 1),
  (12, 1),
  (13, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1)],
 [(31, 1),
  (49, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1)],
 [(12, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1)],
 [(12, 1),
  (31, 2),
  (56, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 1),
  (83, 1),
  (84, 1)],
 [(1

The above output gives the number of times each word occurs in the particular article. For example, (0, 1) in the first article means that the word with id=0 (which is the word "benefit" from the dictionary object) occurs one time.

## 7. Latent Dirichlet Allocation (LDA)

Now that our data is clean and represented using bag of words, let's train a probabilistic topic model using LDA. Gensim provides two interfaces for training a LDA model:
1. LdaModel
2. LdaMulticore  


The main difference between these two is the speed of training. LdaMulticore uses CPU cores to parallelize and speed up model training.  


### 7.1 Model Training

LDA in Gensim allows various parameters that can be tuned to obtain better results. For the purpose of this demo, we will be using a few basic parameters which include:
1. num_topics - The number of topics to be derived from the data.
2. passes - Number of iterations during training.
3. per_word_topics - If true, the model computes the topic distribution for each word.

In [19]:
import gensim

# Train the model
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=tf, id2word=dictionary, num_topics=5, random_state=100, \
                                                    passes=10, per_word_topics=True)

In [20]:
import pprint

pp = pprint.PrettyPrinter(indent=4)
# List the topics and words
pp.pprint(lda_model.print_topics())

[   (   0,
        '0.020*"hillary" + 0.015*"trump" + 0.009*"hillary_clinton" + '
        '0.008*"clinton" + 0.007*"obama" + 0.007*"video" + 0.007*"election" + '
        '0.006*"russia" + 0.005*"world" + 0.005*"russian"'),
    (   1,
        '0.030*"trump" + 0.015*"new" + 0.007*"russia" + 0.007*"comment" + '
        '0.004*"video" + 0.004*"say" + 0.004*"hillary_clinton" + 0.004*"man" + '
        '0.004*"clinton" + 0.004*"putin"'),
    (   2,
        '0.009*"election" + 0.008*"trump" + 0.008*"hillary" + 0.006*"vote" + '
        '0.006*"video" + 0.005*"black" + 0.004*"donald_trump" + 0.004*"russia" '
        '+ 0.004*"comment" + 0.004*"day"'),
    (   3,
        '0.015*"clinton" + 0.011*"hillary" + 0.006*"trump" + 0.005*"comment" + '
        '0.005*"fbi" + 0.005*"election" + 0.005*"video" + 0.005*"say" + '
        '0.005*"get" + 0.004*"hillary_clinton"'),
    (   4,
        '0.005*"war" + 0.005*"trump" + 0.004*"year" + 0.004*"veteran" + '
        '0.003*"donald_trump" + 0.003*"muslim" + 

The above output shows the words that contribute to each topic along with their contribution weight to that topic. For example, in topic 0, the highest contribution is by the word "hillary" as it has the highest weight of 0.020.

### 7.2 Model Evaluation

One way of evaluating our topic model is by eye-balling the top N words in each group and inferring the topic using these words. A better way is to use intrinsic metrics such as perplexity and coherence. Perplexity captures how a model responds to unseen data, and is measured as the normalized log-likelihood of a held-out test set. Coherence measure scores a single topic by measuring the degree of semantic similarity between high scoring words in the topic.

Lower the perplexity, less is the error with unseen data and better is the model.  
Higher the coherence value, better is the model.

In [21]:
from gensim.models import CoherenceModel

print('Perplexity: ', lda_model.log_perplexity(tf))

coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized_data, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)


Perplexity:  -8.904785856162505
Coherence Score:  0.3018179591219838


### 7.3 Model Visualization

pyLDAvis integrated with gensim provides a simple tool for visualizing the topic distribution. This is shown below.

In [22]:
import pyLDAvis
import pyLDAvis.gensim
import warnings
warnings.filterwarnings('ignore')

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, tf, dictionary)
vis


This is an interactive chart that helps you visualize the topics and the words that contribute to each topic. The circles on the left represent topics and the size of the circle indicates the importance of the topic. Scattered circles indicate a good topic model as the overlap between the words in multiple topics will be less. When you hover over a circle, the word distribution on the right updates accordingly. 

### 7.4 Using the Model to Answer Questions

Let's consider the subject line of the first article.  
What is the word to id mapping for all the words in this text?  
What is the topic distribution for this text?  
What topics do the words belong to?  
What is the probability that these words belonging to the set of topics?  

These four questions help us understand the structure of the results returned by the model and can be used further to identify the dominant topics.

In [23]:
print("Sentence 0: word to word_id mapping")
pp.pprint([(dictionary[word[0]], word[0]) for word in tf[0]])
print("Sentence 0: topic distribution for the document")
pp.pprint(lda_model[tf][0][0])
print("Sentence 0: topics to which each word belongs to")
pp.pprint(lda_model[tf][0][1])
print("Sentence 0: probability of the word belonging to that topic")
pp.pprint(lda_model[tf][0][2])


Sentence 0: word to word_id mapping
[   ('benefit', 0),
    ('busted', 1),
    ('gov', 2),
    ('million', 3),
    ('muslim', 4),
    ('steal', 5)]
Sentence 0: topic distribution for the document
[   (0, 0.02886411),
    (1, 0.028912798),
    (2, 0.8833071),
    (3, 0.029377636),
    (4, 0.029538406)]
Sentence 0: topics to which each word belongs to
[(0, [2]), (1, [2]), (2, [2]), (3, [2]), (4, [2]), (5, [2])]
Sentence 0: probability of the word belonging to that topic
[   (0, [(2, 0.998628)]),
    (1, [(2, 0.9979007)]),
    (2, [(2, 0.99533165)]),
    (3, [(2, 0.9980914)]),
    (4, [(2, 0.9922692)]),
    (5, [(2, 0.99900335)])]


What is the dominant topic for each title?

In [24]:
def dominant_topic():
    t = []
    d = []
    for row in lda_model[tf]:
        distribution = row[0]
        distribution = sorted(distribution, key=lambda x: (x[1]), reverse=True)
        t.append(distribution[0][0])
        d.append(distribution[0][1])
    return t, d

data["dominant_topic"], data["topic_distr"] = dominant_topic()
data[["title", "dominant_topic", "topic_distr"]]


Unnamed: 0,title,dominant_topic,topic_distr
0,Muslims BUSTED: They Stole Millions In Gov’t B...,2,0.883324
1,Re: Why Did Attorney General Loretta Lynch Ple...,0,0.795077
2,BREAKING: Weiner Cooperating With FBI On Hilla...,0,0.865264
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,4,0.925329
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0,0.734122
...,...,...,...
12994,,0,0.200000
12995,,0,0.200000
12996,,0,0.200000
12997,,0,0.200000


What are the top titles corresponding to each dominant topic?

In [26]:
data.loc[data.groupby(["dominant_topic"], sort=False)["topic_distr"].idxmax()][["title", "dominant_topic", "topic_distr"]]


Unnamed: 0,title,dominant_topic,topic_distr
1648,Comment on 10 Nutritious Foods That Help Reduc...,2,0.961175
1712,Comment on Breaking: CDC Blocks Testimony of T...,0,0.966181
12203,Veterans Day is typical .01% rogue state inver...,4,0.971743
3220,Comment on Guess what happened in Chiraq this ...,1,0.961066
1646,Comment on The Science is “Overwhelming At Thi...,3,0.966373


## 8. Latent Semantic Indexing (LSI)

LSI is another algorithm for topic modelling. LSI is a Singular Value Decomposition (SVD) based algorithm and ranks the words in a topic.

### 8.1 Model Training

Similar to LDA, there are various parameters in LSI that can be tuned for better performance. For this spotlight, a basic combination is considered to define the number of topics and the number of iterations.

In [30]:
lsi_model = gensim.models.LsiModel(corpus=tf, id2word=dictionary, num_topics=5, onepass=False, power_iters=10)

In [31]:
pp.pprint(lsi_model.print_topics())

[   (   0,
        '0.866*"trump" + 0.288*"hillary" + 0.182*"clinton" + 0.151*"video" + '
        '0.095*"new" + 0.084*"election" + 0.072*"fbi" + 0.069*"say" + '
        '0.068*"obama" + 0.064*"vote"'),
    (   1,
        '0.839*"hillary" + -0.411*"trump" + 0.166*"clinton" + 0.144*"fbi" + '
        '0.098*"email" + 0.075*"wikileaks" + 0.075*"comment" + 0.075*"new" + '
        '0.060*"video" + 0.059*"investigation"'),
    (   2,
        '-0.797*"clinton" + 0.373*"hillary" + -0.198*"new" + -0.186*"fbi" + '
        '0.171*"trump" + -0.139*"email" + -0.133*"comment" + '
        '-0.118*"hillary_clinton" + -0.094*"election" + -0.078*"russia"'),
    (   3,
        '0.506*"video" + 0.448*"election" + -0.378*"clinton" + '
        '0.350*"hillary_clinton" + 0.229*"comment" + 0.202*"new" + '
        '-0.149*"hillary" + -0.143*"trump" + 0.138*"donald_trump" + '
        '0.120*"russia"'),
    (   4,
        '-0.797*"video" + 0.398*"election" + 0.309*"hillary_clinton" + '
        '0.224*"comment" +

In the topics and weights listed above, we see that LSI ranks the weights and sorts them in the order of absolute value of its weight.

### 8.2 Model Evaluation

In [32]:
coherence_model_lsi = CoherenceModel(model=lsi_model, texts=lemmatized_data, dictionary=dictionary)
coherence_lsi = coherence_model_lsi.get_coherence()
print('Coherence Score: ', coherence_lsi)


Coherence Score:  0.2939126990601599


Between LDA and LSI, the one with the lower coherence score is a better suitable model for the dataset.

## 9. Word Embeddings

Besides topic modelling, Gensim is also known for its implementation of word embeddings. Two popular word embeddings, word2vec and fasttext are demonstrated in this spotlight.

### 9.1 Training Word Embeddings

Let's train simple models for both Word2Vec and Fasttext. The performances of these models can be improved by tuning the parameters of these methods. The focus of this spotlight is on the fundamental differences between these two word embedding techniques.

In [33]:
w2v_model = gensim.models.Word2Vec(lemmatized_data, min_count=1, batch_words=10)

In [34]:
from gensim.models.fasttext import FastText as FT_gensim

ft_model = FT_gensim(batch_words=10)
ft_model.build_vocab(lemmatized_data)
ft_model.train(lemmatized_data, epochs=ft_model.epochs, total_examples=ft_model.corpus_count, total_words=ft_model.corpus_total_words)

### 9.2 Word Vectors

The model builds a vector for each word. Given below is an example of the vector representation for the word "trump".

In [35]:
w2v_model["trump"]

array([ 0.03704221,  0.26223484,  0.5330106 , -0.14901142, -0.10175788,
        0.08923507,  0.1228729 , -0.15682848,  0.19681828, -0.00817361,
        0.14887741, -0.05675298,  0.13274507,  0.20282826,  0.25858253,
       -0.41782632,  0.5389828 ,  0.13513501, -0.06769281, -0.41418415,
       -0.19506016, -0.04955445,  0.1302631 ,  0.02961428, -0.45758066,
        0.14477146, -0.28899857, -0.2873996 , -0.2217404 ,  0.16677028,
       -0.05388457,  0.49362534,  0.18638776, -0.29804763, -0.19510925,
        0.12890863, -0.3368247 ,  0.12754917, -0.31146038, -0.18916108,
       -0.24787469,  0.34011048,  0.15712136, -0.09241911,  0.39905995,
        0.30940917,  0.14408623, -0.02301322, -0.08759511,  0.14877196,
        0.12030753, -0.4000204 , -0.1602061 ,  0.01488817,  0.1548028 ,
        0.04141129, -0.4412525 , -0.00731504, -0.1314852 ,  0.03975075,
        0.29879013, -0.09214485,  0.10002936, -0.2599106 , -0.06259269,
        0.33538496,  0.02545279, -0.14477026,  0.1261843 ,  0.04

In [36]:
ft_model["trump"]

array([ 0.24225868, -0.8030525 ,  0.4455241 , -0.19918093, -0.7206647 ,
        0.64833   , -0.15629227, -0.628177  , -0.47809982, -0.41554454,
       -0.41788575, -0.10848226, -0.42819637, -0.48572192, -0.89252186,
        0.72376674,  0.7838862 ,  0.10343214,  1.0110719 ,  0.9133532 ,
       -0.17534487,  0.3811551 ,  0.02586271,  0.20308414, -0.2893501 ,
        0.15144713,  0.7427336 ,  0.48654917,  0.4766399 , -0.23880534,
        0.77245235, -0.99571526, -0.33386505, -0.18729465, -0.5720045 ,
       -0.25380906,  0.4046728 , -0.305097  ,  0.26856214,  0.3186764 ,
        0.05690456,  0.05181462,  0.02231599, -0.27731642, -0.23136793,
        0.44405043,  0.4749755 ,  0.50868213,  0.15517242,  0.06556841,
        0.3528776 ,  0.32602713, -0.9625345 , -0.60101986, -0.8019427 ,
       -0.9866762 ,  0.9891996 ,  0.26641473, -0.25233093,  0.60967803,
        0.9936777 ,  0.008481  , -0.5573574 , -0.37820923,  0.58550274,
        0.28202236,  0.46265486, -0.73450816,  0.30136672, -0.67

### 9.3 Word Similarity

From the results of LDA, we know what words define a topic. Now, we will consider one such word from LDA and find words that are similar to it. Let's take the word "election" from topic 0.

In [37]:
w2v_model.most_similar("election")

[('trump', 0.999671459197998),
 ('hillary', 0.9996004104614258),
 ('world', 0.9995781183242798),
 ('video', 0.9995728731155396),
 ('comment', 0.9995511770248413),
 ('clinton', 0.9995421767234802),
 ('russia', 0.9995238780975342),
 ('say', 0.9995136857032776),
 ('new', 0.9995099306106567),
 ('obama', 0.9994674324989319)]

In [38]:
ft_model.most_similar("election")

[('rig_election', 0.999997615814209),
 ('stolen_election', 0.9999969601631165),
 ('days_election', 0.9999967217445374),
 ('post_election', 0.9999967217445374),
 ('collection', 0.9999964237213135),
 ('stake_election', 0.9999962449073792),
 ('election_night', 0.9999960660934448),
 ('rigged_election', 0.9999960660934448),
 ('steal_election', 0.9999960660934448),
 ('election_fraud', 0.999995231628418)]

The results are significantly different between word2vec and fasttext. The reason is that word2vec takes the semantics into account while fasttext works on the syntax. Moreover, fasttext considers each word as an aggregation of its subwords. Therefore, you see words that are close to "election" in structure in fasttext.

In [39]:
w2v_model.wv.similarity(w1="calibration", w2="election")

0.21384692

In [40]:
ft_model.wv.similarity(w1="calibration", w2="election")

0.9999904

In [41]:
w2v_model.wv.doesnt_match(["election", "infection", "destruction"])

'infection'

In [42]:
ft_model.wv.doesnt_match(["election", "infection", "destruction"])

'destruction'

Gensim's similarity function computes the similarity score between two words using cosine similarity. Semantically "calibration" and "election" are not relevant. Fasttext computes a high similarity score as the two words are syntactically similar.

We can also determine the word that matches the least with other words in a list of words.

In [43]:
"trumps" in w2v_model.wv.vocab

False

In [44]:
"trumps" in ft_model.wv.vocab

False

In [45]:
ft_model.wv.similarity(w1="trumps", w2="trump")

0.9999936

We can use the membership operator "in" to check if a word exists in the vocabulary or not. The word "trumps" doesn't exists in both the vocabularies. If we try to determine the similarity between words "trump" and "trumps" using the word2vec model, it will throw an error and fail as its vocabulary doesn't contain the word "trumps". However, fasttext computes the similarity and even assigns a high similarity score as it looks at the syntactic elements of the words and the words trump and trumps appear similar.

## 10. Conclusion

We began this spotlight with an introduction to Gensim and moved on to text preprocessing in Gensim. We focused on topic modelling and covered several aspects of it. We also discovered word embeddings, another important feature in Gensim. I hope this spotlight serves as a beginner's guide and a template to use Gensim for your IR and NLP tasks. 

## 11. Gensim Documentation

This spotlight was written by referring to Gensim's official documentation which can be found at: https://radimrehurek.com/gensim/auto_examples/index.html