<h1>Topic Modelling with LDA(Latent Dirichlet Allocation)</h1>
<ul>
    <li>This tasks deals with assigning topics to unlabeled data</li>
    <li>Applications: Categorisation of newsapaper articles</li>
    <li>Its a probablistic model</li>
    <li>It tries to find groups of words that appear freuquently in documents belonging to same topic</li>
    <li>Input to LDA is a bag-of-words model</li>
    <li>LDA decomposes it into A document-to-topic matrix and a word-to-topic matrix</li>
    <li>We must define the number of topics beforehand as it is a hyperparameter to LDA</li>
</ul>

<h2>LDA with Scikit-learn</h2>
<ul>
    <li>Class name: LatentDirichletAllocation</li>
    <li>Module Name: sklearn.decomposition</li>
</ul>

In [1]:
import pandas as pd

movie_dataset = "../movie_data.csv"
df = pd.read_csv(movie_dataset, encoding="utf-8")
print(df.head(3))

                                              review  sentiment
0  In 1974, the teenager Martha Moxley (Maggie Gr...          1
1  OK... so... I really like Kris Kristofferson a...          0
2  ***SPOILER*** Do not read this, if you think a...          0


Remove stop-words from the database
MAX_DF = 0.1 means to only use words with a max document frequency of 10% as the words which are used too many times are not possesing any usful information
max_features = 5000, means we are limiting to only the top 5000 most recurring words

These hyperparameters can be tuned to change the performance of the results

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english', max_df=0.1, max_features=5000)

X = count.fit_transform(df['review'].values)
#print(count.vocabulary_)

<p>X is now a bag-of-words matrix. This needs to be passed to LDA estimator so that it can infer 10 different topics from the documents</p>

In [3]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, random_state=123, learning_method='batch')
#learning-method='online' is also there but it is time consuming and leads to better estimations
#n_components = 10 mean infer 10 topics from the documents

X_topics = lda.fit_transform(X)
#components_ attribute holds a matrix containing the top 5000 words for each 10 topics in increasing order
print(lda.components_.shape)


(10, 5000)


<p>Let's print the 5 most important words for each of the 10 inferred topics</p>

In [4]:
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx+1)}:')
    print(' '.join([feature_names[i] for i in topic.argsort() [:-n_top_words -1 :-1]]))

Topic 1:
horror worst script effects budget
Topic 2:
dvd watched video music guy
Topic 3:
war american series history documentary
Topic 4:
game killer murder thriller crime
Topic 5:
kids comedy episode series school
Topic 6:
family woman mother beautiful feel
Topic 7:
role performance comedy john plays
Topic 8:
action horror john effects dr
Topic 9:
book version original read music
Topic 10:
action wife father police james


Above code will give the 10 topics identified by LDA estimator. It does not assign any name to these topics, rather classifies the top 5000 words for each topic. For this particular project the output is:
Topic 1:
horror worst script effects budget
Topic 2:
dvd watched video music guy
Topic 3:
war american series history documentary
Topic 4:
game killer murder thriller crime
Topic 5:
kids comedy episode series school
Topic 6:
family woman mother beautiful feel
Topic 7:
role performance comedy john plays
Topic 8:
action horror john effects dr
Topic 9:
book version original read music
Topic 10:
action wife father police james

Therefore, LDA identify following:
1. Bad Horror
2. Unidentifiable
3. War Documentary
4. Thriller
5. Comedy
6. Family Drama
7. Comedy
8. Action Horror
9. Movie based on books
10. Action

We can confirm that the topics match by looking at some movies from the category

In [9]:
topics_6 = X_topics[:,2].argsort()[::-1]
for iter_idx, movie_idx in enumerate(topics_6[:3]):
    print(f'Movie number #{(iter_idx+1)}:')
    print(df['review'][movie_idx][:300],'...')

Movie number #1:
The Drug Years actually suffers from one of those aspects to mini-series or other kinds of TV documentaries run over and over again for a couple of weeks on TV. It's actually not long enough, in a way. All of the major bases in the decades are covered, and they're all interesting to note as views in ...
Movie number #2:
Not a balanced point of view. The director shouldn't express her opinion as truth. The movie has some criticism of Fujimori but it always gives him and his family the last words. So few critics of Fujimori were provided that it seems the only reason they were included was to be able to say the movie ...
Movie number #3:
This series, produced at probably the most propitious time following the events of the second World War, is on a scale of value that stands far above any individual's presumption to criticize.<br /><br />The timing of World at War's production in 1974, amounting to some three decades after the event ...


In [None]:
The movie review of cateogy 6 infact suggest suggestss thaqt all are drama movies