### Topic Modeling
In Machine Learning and Natural Language Processing, a topic model is a
type of statistical model for discovering the abstract "topics" that 
occur in a collection of documents.

### Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is an example of topic model and is
used to classify text in a document to a particular topic.

There are two key assumptions with LDA:
1. Documents that have similar words usually have the same topic
2. Documents that have groups of words frequently occurring together usually have the same topic.


### Data
We use publicly available data as form of tweets about six US airlines.

In [15]:
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', -1)

# Let's not put extra pressure on my Laptop!
reviews = pd.read_csv('~/PycharmProjects/macai/nlp/data/Reviews.csv',
                      nrows=30000)
reviews.dropna(inplace=True)

In [16]:
reviews.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


#### Useless columns
We build our LDA model upon the `Text` column, so let's get rid of
other columns.

In [17]:
reviews = pd.DataFrame(data=reviews['Text'])
reviews.head()

Unnamed: 0,Text
0,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,"This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."
3,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
4,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


#### Time to vectorize!
LDA works with vectors and numbers, so let's convert words to numbers

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

# What are max_df and min_df?
vector = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = vector.fit_transform(reviews['Text'].values)
doc_term_matrix

<29999x17708 sparse matrix of type '<class 'numpy.int64'>'
	with 901558 stored elements in Compressed Sparse Row format>

Please note that we could perform some text processing operations
such as `Lemmatization` to end up with a more accurate result, but for
the sake of simplicity we don't do that here.

Let's see some of the words in our vocabulary

In [19]:
import random

print(random.sample(vector.get_feature_names(), 10))

['hiked', 'peek', 'enhanced', 'overtly', 'apparatus', 'savers', 'spends', 'knobby', 'vegetative', 'salting']


#### Creating model
LDA works with the probability distribution for each words for each
topic. Se let's compute that.

In [20]:
from sklearn.decomposition import LatentDirichletAllocation

# n_components specify the number of topics
LDA = LatentDirichletAllocation(n_components=5, random_state=1)
LDA.fit(doc_term_matrix)
print(LDA.__dir__())

['n_components', 'doc_topic_prior', 'topic_word_prior', 'learning_method', 'learning_decay', 'learning_offset', 'max_iter', 'batch_size', 'evaluate_every', 'total_samples', 'perp_tol', 'mean_change_tol', 'max_doc_update_iter', 'n_jobs', 'verbose', 'random_state', 'random_state_', 'n_batch_iter_', 'n_iter_', 'doc_topic_prior_', 'topic_word_prior_', 'components_', 'exp_dirichlet_component_', 'bound_', '__module__', '__doc__', '__init__', '_check_params', '_init_latent_vars', '_e_step', '_em_step', '_check_non_neg_array', 'partial_fit', 'fit', '_unnormalized_transform', 'transform', '_approx_bound', 'score', '_perplexity_precomp_distr', 'perplexity', '_get_param_names', 'get_params', 'set_params', '__repr__', '__getstate__', '__setstate__', '_get_tags', '__dict__', '__weakref__', '__hash__', '__str__', '__getattribute__', '__setattr__', '__delattr__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__new__', '__reduce_ex__', '__reduce__', '__subclasshook__', '__init_subclass_

It's always a good practice to invoke `__dir__()` method on such objects
that we are not very familiar with.

Now let's see what `components_` is.

In [21]:
print(len(LDA.components_[0]))

17708


So for each topic there is a probability distribution for each word
in the vocabulary.

Now let's see which words are the top words (highest probability) associated
with the last topic

In [22]:
last_topic = LDA.components_[-1]
# Which index (which words) have the highest probabilities in the last topic
top_ten_words_indices = last_topic.argsort()[-10:]
for i in top_ten_words_indices:
    print(vector.get_feature_names()[i])

love
price
like
just
good
br
chips
product
amazon
great


Now we might have a sense of what this topic is about

How about the words with highest probabilities in all of the topics

In [23]:
for index, topic in enumerate(LDA.components_):
    print('Top 10 words in topic ', index)
    print([vector.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n\n')
    

Top 10 words in topic  0
['mix', 'great', 'sugar', 'flavor', 'just', 'chocolate', 'good', 'taste', 'like', 'br']



Top 10 words in topic  1
['great', 'cups', 'taste', 'good', 'flavor', 'like', 'cup', 'tea', 'br', 'coffee']



Top 10 words in topic  2
['treat', 'love', 'cat', 'loves', 'like', 'dogs', 'treats', 'br', 'dog', 'food']



Top 10 words in topic  3
['flavor', 'sugar', 'taste', 'just', 'juice', 'product', 'drink', 'like', 'tea', 'br']



Top 10 words in topic  4
['love', 'price', 'like', 'just', 'good', 'br', 'chips', 'product', 'amazon', 'great']





#### The power of unsupervised learning!
Now we want to know the probability of each topic being associated with
each document.

In [24]:
topic_probabilities = LDA.transform(doc_term_matrix)
# Note the shape (document x topics)
topic_probabilities.shape

(29999, 5)

But we only want to find the closest topic, so let's assign that to the
highest probability among the values

In [25]:
reviews['Topic'] = topic_probabilities.argmax(axis=1)
reviews.head(30)

Unnamed: 0,Text,Topic
0,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,2
1,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",4
2,"This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.",0
3,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,3
4,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",4
5,"I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat.",3
6,"This saltwater taffy had great flavors and was very soft and chewy. Each candy was individually wrapped well. None of the candies were stuck together, which did happen in the expensive version, Fralinger's. Would highly recommend this candy! I served it at a beach-themed party and everyone loved it!",4
7,This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!!,4
8,Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too,2
9,This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.,2


#### Data Cleaning matters!
As you can see, there are tokens like `br` with absolutely no meaning
and no actual effect on a text's topic, but as we did not perform
data cleaning, it now affects our modeling.