### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.

As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making. 

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency (TfIdf). NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation(LDA) is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.



In [12]:
# lets see how it works with the following sentences.

doc1 = "I have big exam tomorrow and I need to study hard to get a good grade."
doc2 = "My wife likes to go out with me but I prefer staying at home and studying."
doc3 = "Kids are playing football in the field and they seem to have fun"
doc4 = "Sometimes I feel depressed while driving and it's hard to focus on the road."
doc5 = "I usually prefer reading at home but my wife prefers watching a TV."

In [13]:
# array of documents aka corpus
corpus = [doc1, doc2, doc3, doc4, doc5]

In [3]:
# now lets prepare our corpus to be used in LDA. We'll usee the same functions we wrote before
# First, we are creating a dictionary from the data, then convert to bag-of-words corpus.

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

stop_words=set(stopwords.words('english'))

token_list=[]
for sentence in corpus:
    token_list.append(tokenizer.tokenize(sentence.lower()))


def remove_stopwords(words):

    filtered_words = []
    for word in words:
        if word not in stop_words:
            filtered_words.append(word)
            
    return filtered_words

    
tokenized_data=[]

for token in token_list:
    #distinct_tokens=list(set(token))
    tokenized_data.append(remove_stopwords(token))
    # we at first remove punctuations and then stopwords


tokenized_data
# now here are the tokens for each sentence inside the corpus

[['big', 'exam', 'tomorrow', 'need', 'study', 'hard', 'get', 'good', 'grade'],
 ['wife', 'likes', 'go', 'prefer', 'staying', 'home', 'studying'],
 ['kids', 'playing', 'football', 'field', 'seem', 'fun'],
 ['sometimes', 'feel', 'depressed', 'driving', 'hard', 'focus', 'road'],
 ['usually', 'prefer', 'reading', 'home', 'wife', 'prefers', 'watching', 'tv']]

The LDA model discovers the different topics that the documents represent and how much of each topic is present in a document. 

Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [18]:
# first import corpora and models module from gensim package
from gensim import corpora, models

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
numerical_corpus = [dictionary.doc2bow(text) for text in tokenized_data]

# Build the LDA model
# We are asking LDA to find 10 topics in the data
lda_model = models.LdaModel(corpus=numerical_corpus, num_topics=10, id2word=dictionary)

for idx in range(10):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 5))



Topic #0: 0.030*"prefer" + 0.030*"wife" + 0.030*"hard" + 0.030*"reading" + 0.030*"playing"
Topic #1: 0.118*"seem" + 0.118*"kids" + 0.118*"fun" + 0.118*"football" + 0.118*"field"
Topic #2: 0.030*"prefer" + 0.030*"wife" + 0.030*"hard" + 0.030*"depressed" + 0.030*"reading"
Topic #3: 0.030*"prefer" + 0.030*"wife" + 0.030*"home" + 0.030*"go" + 0.030*"fun"
Topic #4: 0.109*"hard" + 0.057*"good" + 0.057*"exam" + 0.057*"study" + 0.057*"get"
Topic #5: 0.107*"home" + 0.107*"staying" + 0.107*"studying" + 0.107*"likes" + 0.107*"wife"
Topic #6: 0.030*"prefer" + 0.030*"hard" + 0.030*"wife" + 0.030*"focus" + 0.030*"home"
Topic #7: 0.030*"wife" + 0.030*"prefer" + 0.030*"playing" + 0.030*"football" + 0.030*"sometimes"
Topic #8: 0.097*"watching" + 0.097*"prefer" + 0.097*"home" + 0.097*"tv" + 0.097*"prefers"
Topic #9: 0.030*"wife" + 0.030*"prefer" + 0.030*"playing" + 0.030*"hard" + 0.030*"home"


Since we trained and built our LDA model over the five simple sentences, whenever we want to detect the topic of a new sentence or text, we'll at first prepare the text and then push that into our model to get a topic. Let's try to predict a topic for a new sentence.

In [19]:
new_sentence="My wife plans to go out tonight"
lda_model.get_document_topics(dictionary.doc2bow(new_sentence.split()) )
# dictionary.doc2bow command only accepts bag of word list.. 
# so, without applying any cleaning, we just use splitting to create a list of tokens

[(0, 0.033333335),
 (1, 0.033333335),
 (2, 0.033333335),
 (3, 0.033333335),
 (4, 0.033333335),
 (5, 0.6999944),
 (6, 0.033333335),
 (7, 0.033333335),
 (8, 0.033338886),
 (9, 0.033333335)]

In [20]:
# we can also use bag of words as an index in LDA model and would get the same output
lda_model[dictionary.doc2bow(new_sentence.split())]

[(0, 0.033333335),
 (1, 0.033333335),
 (2, 0.033333335),
 (3, 0.033333335),
 (4, 0.033333335),
 (5, 0.6999944),
 (6, 0.033333335),
 (7, 0.033333335),
 (8, 0.033338886),
 (9, 0.033333335)]

as you can see, topic-5 (listed above) is the most relevant topic (0.699) for this sentence.

Topic #5: 0.107*"home" + 0.107*"staying" + 0.107*"studying" + 0.107*"likes" + 0.107*"wife"

In [21]:

print(corpus[:3])

['I have big exam tomorrow and I need to study hard to get a good grade.', 'My wife likes to go out with me but I prefer staying at home and studying.', 'Kids are playing football in the field and they seem to have fun']


In [38]:
# We can also use LDA model to find the similarities between documents.
# Gensim offers a simple way of performing similarity queries using topic models.

new_sentence="We are going play soccer with the kids"

from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[numerical_corpus])

# bag of word of the target sentence
bow=dictionary.doc2bow(new_sentence.split())

# Let's perform some queries
similarities = lda_index[lda_model[bow]]
 
# document similarity scores:

print(similarities)


[0.33695588 0.355996   0.36244625 0.35599533 0.35112408]


In [39]:
# as you see, the highest score is 0.36244625 and it belongs to doc_3
# doc3 = "Kids are playing football in the field and they seem to have fun"