In [1]:
import pandas as pd
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,This picture's following will only grow as tim...,1
1,John Candy. Need we say more? He is the main r...,0
2,This amazing documentary gives us a glimpse in...,1
3,"Well, sadly, I can't help but feeling a little...",1
4,"That's right. A movie written, directed and pr...",0


#### CountVectorizer to create the bag-of-words matrix as input to the LDA

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# we set the maximum document frequency of words to be considered to 10 percent (max_df=.1) to exclude words that occur too frequently across documents
# we limited the number of words to be considered to the most frequently occurring 5,000 words (max_features=5000), to limit the dimensionality of this dataset to improve the inference performed by LDA
count = CountVectorizer(stop_words='english',max_df=0.1,max_features=5000) 
X = count.fit_transform(df['review'].values)

In [6]:
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


#### fit a LatentDirichletAllocation estimator

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,random_state=123,learning_method='batch') # 1.infer ten different topics 2. learning_method=batch or online
X_topics = lda.fit_transform(X)

In [9]:
lda.components_.shape

(10, 5000)

In [11]:
# for 10 topics, 5000 words importance
for i,j in enumerate(lda.components_):
    print(i,j)

0 [ 88.13189387 101.61940638 353.39259471 ... 378.99023995 243.98333763
  31.47435473]
1 [29.8829067  12.15152038 50.04699935 ...  0.10000444  0.1000031
  2.99470342]
2 [2.86746897e+01 1.56450064e+02 1.33723206e+02 ... 1.00011140e-01
 1.00011088e-01 4.39233822e+00]
3 [ 0.10455942 22.39439154 63.28077671 ...  0.10000985  0.10001133
 12.20144048]
4 [ 54.48391306 223.88472394  38.73832192 ... 784.38104035 534.90771222
  26.14218539]
5 [ 2.6416837  14.55096918  6.5787204  ... 12.92864946 44.40887665
  2.70855062]
6 [1.04749174e+00 5.47558452e+00 1.15354145e+02 ... 1.00007758e-01
 1.00008324e-01 1.00296595e-01]
7 [1.44204315e-01 1.56920225e+01 8.51591043e+00 ... 1.00009999e-01
 1.00010215e-01 1.95272084e+02]
8 [ 3.78863506 48.91360211 65.71883362 ...  0.10001269  0.10001536
  5.13209602]
9 [ 0.10002246 18.86771514 93.65049261 ...  0.10001437  0.1000141
  0.58195101]


In [12]:
# print the five most important words for each of the 10 topics
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {(topic_idx + 1)}:')
    print(' '.join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic 1:
worst minutes script awful stupid
Topic 2:
family mother father children girl
Topic 3:
american dvd music war tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex blood girl
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original effects fi
Topic 10:
action fight fun guy kids


#### Based on reading the five most important words for each topic, you may guess that the LDA identified the following topics:

    Generally bad movies (not really a topic category)
    Movies about families
    War movies
    Art movies
    Crime movies
    Horror movies
    Comedy movie reviews
    Movies somehow related to TV shows
    Movies based on books
    Action movies


#### let's verify some examples - `Action`

In [29]:
action = X_topics[:, 9].argsort()[::-1]

for iter_idx, movie_idx in enumerate(action[:5]):
    print(f'\nAction movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Action movie #1:
This is an above average Jackie Chan flick, due to the fantastic finale and great humor, however other then that it's nothing special. All the characters are pretty cool, and the film is entertaining throughout, plus Jackie Chan is simply amazing in this!. Jackie and Wai-Man Chan had fantastic chemi ...

Action movie #2:
This is the ultimate Kung Fu movie! This is the only Kung Fu movie! This is the only Kung Fu movie I have ever seen! I am giving this movie way too much credit! My best guess for the reason for making this movie is that someone wanted to show off someone else's martial arts abilities, but realized t ...

Action movie #3:
OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop ...

Action movie #4:
To start off,