### Data

In [1]:
import pandas as pd
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,This picture's following will only grow as tim...,1
1,John Candy. Need we say more? He is the main r...,0
2,This amazing documentary gives us a glimpse in...,1
3,"Well, sadly, I can't help but feeling a little...",1
4,"That's right. A movie written, directed and pr...",0


In [2]:
# removing very common occuring words using basic preprocessing
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(sent_ence):
    sent_ence = sent_ence.split(' ')
    sent_ence = [i for i in sent_ence if i not in stop]
    return ' '.join(sent_ence)

df.review = df.review.apply(remove_stopwords)

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,This picture's following grow time goes by. Be...,1
1,John Candy. Need say more? He main reason see ...,0
2,This amazing documentary gives us glimpse live...,1
3,"Well, sadly, I can't help feeling little bit d...",1
4,"That's right. A movie written, directed produc...",0


In [4]:
docs = df['review'].to_list()

In [5]:
print(len(docs))
print(docs[0][:500])

50000
This picture's following grow time goes by. Better best picture nominees 97 rewards repeated viewings. I've seen three times I know. Anderson compared great American directors (Altman, Scorcese, Tarantino) may influences chances are, films, he'll considered part short list himself.<br /><br />One last note: Julianne Moore's "Amber Waves" resonate memory long 90's movie characters faded. THE best performance year -in four categories.


### Pre-process and vectorize the documents

    Tokenize (split the documents into tokens).

    Lemmatize the tokens.

    Compute bigrams.

    Compute a bag-of-words representation of the data.


In [6]:
# remove numeric tokens and tokens that are only a single character

# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [7]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [8]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [9]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [10]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]


In [11]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 20426
Number of documents: 50000


In [12]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 8
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [13]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.9698.
[([(0.01765403, 'like'),
   (0.013317758, 'bad'),
   (0.012545579, 'this'),
   (0.01196481, 'even'),
   (0.009842883, 'good'),
   (0.009719115, 'really'),
   (0.008870149, 'make'),
   (0.008292448, 'that'),
   (0.007922411, 'would'),
   (0.007807308, 'acting'),
   (0.0075955866, 'thing'),
   (0.007400202, 'plot'),
   (0.007286482, 'there'),
   (0.0066941604, 'and'),
   (0.0066787475, 'could'),
   (0.006493284, 'get'),
   (0.0060703927, 'people'),
   (0.005998627, 'time'),
   (0.0059191193, 'look'),
   (0.0058302013, 'ever')],
  -1.1440975713414971),
 ([(0.014232025, 'show'),
   (0.012577127, 'time'),
   (0.011757986, 'good'),
   (0.011099146, 'see'),
   (0.010892286, 'great'),
   (0.010610615, 'like'),
   (0.010390559, 'this'),
   (0.008534627, 'first'),
   (0.007922086, 'really'),
   (0.0073557813, 'year'),
   (0.007038468, 'funny'),
   (0.00684613, 'well'),
   (0.0064623747, 'would'),
   (0.006290001, 'watch'),
   (0.0061408584, 'still'),
   (0.00608

#### Based on reading the five most important words for each topic, you may guess that the LDA identified the following topics:

    # Random Category 1
    Comedy movies
    Biographical movies
    Horror movies
    # Random Category 2
    Family movies
    Historical movies
    Action movies

#### If you are familiar with the subject of the articles in this dataset, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place