# <font color='green'> Topic Modeling </font>

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.

## <font color='blue'> Why Do It? </font>
To extract information from a huge amount of text
* You can use this information to create indexes and then provide a query interface
* Your mini Google search engine

## <font color='blue'> What is Gensim? </font>
An open-source library that lets one work with topic modeling.

## <font color='blue'> Steps involved </font>
* Get a corpus
* Find words in each document of the corupus that are most likely to represent the document (find topics) - store these in a dictionary
* Use this dictionary to convert each document into a vector that represents how relevant each topic is to the document
* This is our model
* Now, you can use it to query topics and find relevant documents (this is one use case).

# <font color='green'> An Example </font>

## <font color='blue'> A Game of Corpora (Step 1) </font>
Get a corpus

In [1]:
# Let's use the abc corpus
import pprint
import nltk
nltk.download('abc')
corpus = nltk.corpus.abc.sents()[:20]

[nltk_data] Downloading package abc to
[nltk_data]     C:\Users\prgzz\AppData\Roaming\nltk_data...
[nltk_data]   Package abc is already up-to-date!


In [14]:
print(corpus[0])

['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', '.']


## <font color='red'> A Clash Of Words (Step 2) </font>
Prune the corpus

In [13]:
# Remove some common words and lower case everything
stoplist = set("""for a of the and to in . , $ " ' """.split(' '))
pruned_corpus = [[word.lower() for word in document if word.lower() not in stoplist]
         for document in corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in pruned_corpus:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
pruned_corpus = [[token for token in text if frequency[token] > 1] for text in pruned_corpus]
print(pruned_corpus[0])

['awb', 'kickbacks', 'prime', 'minister', 'has', 'he', 'knew', 'awb', 'was', 'paying', 'kickbacks', 'iraq', 'despite', 'wheat', 'on', 'iraq', 'wheat', 'sales']


## <font color='indigo'> A Storm Of Vectors (Step 3) </font>
Create a dictionary and convert the corpus to a list of vectors. Gensim will be used now. Install it using: `pip install gensim`

In [15]:
# gensim imports
from gensim import corpora
from gensim import models

# Create dictionary for most popular words (this will be our list of topics)
dictionary = corpora.Dictionary(pruned_corpus)
# There are 65 of these
pprint.pprint(dictionary.token2id)

{',"': 31,
 '20': 63,
 '2002': 27,
 'about': 36,
 'actually': 55,
 'an': 42,
 'another': 57,
 'at': 32,
 'average': 58,
 'awb': 0,
 'been': 14,
 'but': 37,
 'by': 15,
 'cole': 16,
 'despite': 1,
 'do': 38,
 'email': 43,
 'from': 17,
 'geary': 44,
 'get': 49,
 'government': 23,
 'grain': 47,
 'had': 45,
 'has': 2,
 'have': 18,
 'he': 3,
 'howard': 19,
 'i': 40,
 'inquiry': 20,
 'into': 21,
 'iraq': 4,
 'it': 41,
 'kickbacks': 5,
 'knew': 6,
 'letters': 22,
 'may': 33,
 'minister': 7,
 'mr': 24,
 'much': 59,
 'not': 39,
 'on': 8,
 'one': 25,
 'over': 60,
 'paying': 9,
 'payments': 30,
 'pretty': 56,
 'prices': 50,
 'prime': 10,
 'producer': 54,
 's': 28,
 'said': 34,
 'sales': 11,
 'says': 29,
 'support': 48,
 'that': 46,
 'their': 51,
 'they': 52,
 'think': 53,
 'this': 35,
 'tonne': 64,
 'too': 61,
 'was': 12,
 'wheat': 13,
 'will': 62,
 'with': 26}


In [12]:
# Use this dictionary to convert our pruned corpus into a list of vectors
# Our list of vectors correspond to a bag-of-words representation
bow_corpus = [dictionary.doc2bow(text) for text in pruned_corpus]
# First element of the tuple is the topic while the second is no. of times it occurred in this document
# Doesn't exactly look like a vector. Think of each document as a 65-dimensional vector, where each 
# dimension corresponds to the count of the associated word in the document.
print(bow_corpus[0])

[(0, 2), (1, 1), (2, 1), (3, 1), (4, 2), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2)]


## <font color='orange'> A Feast For Modeling (Step 4) </font>
Create a model (TF-IDF here)

In [16]:
# Let's create a TF-IDF model out of these vectors

# Model training
tfidf = models.TfidfModel(bow_corpus)

## <font color='brown'> A Dance With Queries (Step 5) </font>
We're done. Let's query with a topic

In [28]:
# Create a similarity matrix (an index)
from gensim import similarities
# This one uses cosine similarity to compare two vectors
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=65)

# Our query - think of this as a Google search
query = 'iraq minister'.split()
query_bow = dictionary.doc2bow(query)
sims = index[tfidf[query_bow]]

# Print how relevant is each document to the query
for doc_ind, sim in enumerate(sims):
    print("%i: %s" % (doc_ind, sim))

0: 0.38083693
1: 0.14157072
2: 0.19464979
3: 0.19936569
4: 0.0
5: 0.19307509
6: 0.28589374
7: 0.15641421
8: 0.0
9: 0.0
10: 0.0
11: 0.0
12: 0.0
13: 0.0
14: 0.0
15: 0.0
16: 0.0
17: 0.0
18: 0.0
19: 0.0


In [27]:
# Let's see 2 relevant documents
print(corpus[0])
print(corpus[6])

['PM', 'denies', 'knowledge', 'of', 'AWB', 'kickbacks', 'The', 'Prime', 'Minister', 'has', 'denied', 'he', 'knew', 'AWB', 'was', 'paying', 'kickbacks', 'to', 'Iraq', 'despite', 'writing', 'to', 'the', 'wheat', 'exporter', 'asking', 'to', 'be', 'kept', 'fully', 'informed', 'on', 'Iraq', 'wheat', 'sales', '.']
['But', 'the', 'Prime', 'Minister', 'says', 'letters', 'show', 'he', 'was', 'inquiring', 'about', 'the', 'future', 'of', 'wheat', 'sales', 'in', 'Iraq', 'and', 'do', 'not', 'prove', 'the', 'Government', 'knew', 'of', 'the', 'payments', '.']


# <font color='green'> Other Ways To Model </font>

* We just looked at the **TF-IDF** model.
* Other popular approaches: **LSA, LDA**
* Link: https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html (**Section: Available transformations**)

# <font color='green'> References </font>
* Based on: https://radimrehurek.com/gensim/auto_examples/ (**This is the official one. Has 4 good core guides**)
* A nice reference in general: https://monkeylearn.com/blog/introduction-to-topic-modeling