<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/master/7-topics-in-brief/02_topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Topic Modelling

Topic modeling is one of the most common applications of NLP in industrial use
cases. For analyzing different forms of text from news articles to tweets, from visualizing word clouds to creating graphs of connected topics and documents,
topic models are useful for a range of use cases. Topic models are used extensively for document clustering and organizing large collections of text data. They’re also useful for text classification.

One way to approach it is to bring out some words that best describe the corpus,
like the most common words in the corpus. This is called a word cloud. The key
to a good word cloud is to remove stop words. If we take any English text corpus and list out the most frequent k words, we won’t get any meaningful insights, as the most frequent words will be stop words (the, is, are, am, etc.). After doing appropriate preprocessing, the word cloud may yield some meaningful insights depending on the document collection.

Topic modeling operationalizes this intuition. It tries to identify the “key” words (called “topics”) present in a text corpus without prior knowledge about it, unlike the rule-based text mining approaches that use regular expressions or dictionary-based keyword searching techniques.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/7-3.png?raw=1' width='800'/>

Topic modeling generally refers to a collection of unsupervised statistical learning methods to discover latent topics in a large collection of text documents. Some of the popular topic modeling algorithms are latent Dirichlet allocation (LDA), latent semantic analysis (LSA), and probabilistic latent semantic analysis (PLSA). In practice, the technique that’s most commonly used is LDA.

Let’s start with a toy corpus. Say we have a collection of documents, D1 to D5, and each document consists of a single sentence:

- D1: I like to eat broccoli and bananas.
- D2: I ate a banana and salad for breakfast.
- D3: Puppies and kittens are cute.
- D4: My sister adopted a kitten yesterday.
- D5: Look at this cute hamster munching on a piece of broccoli.

Learning a topic model on this collection using LDA may produce an output like this:

- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching
- Topic B: 20% puppies, 20% kittens, 20% cute, 15% hamster

- Document 1 and 2: 100% Topic A
- Document 3 and 4: 100% Topic B
- Document 5: 60% Topic A, 40% Topic B

Thus, topics are nothing but a mixture of keywords with a probability distribution, and documents are made up of a mixture of topics, again with a probability distribution.

A topic model only gives a collection of keywords per topic. What exactly the
topic represents and what it should be named is typically left to human interpretation in an LDA model. Here, we might look at Topic A and say, “it is about food.” Likewise, for topic B, we might say, “it is about pets.”


## Setup

In [1]:
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Training topic model

Here, we’ll use an LDA implementation from the Python library gensim and the CMU Book Summary Dataset.