# Tutorial 7 (Part II)
# Topic Modelling
## Preparing the IMDb movie review data for text processing
## Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm `cd` into the download directory and execute 

`tar -zxf aclImdb_v1.tar.gz`

B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive.

In [1]:
import warnings
warnings.filterwarnings('ignore') # We can suppress the warnings

**Optional code to download and unzip the dataset via Python:**

In [2]:
import os
import sys
import tarfile
import time


source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    #print(duration)
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size
    sys.stdout.write("\r%d%% | %d MB | %.2f MB/s | %d sec elapsed" %
                    (percent, progress_size / (1024.**2), speed, duration))
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    
    if (sys.version_info < (3, 0)):
        import urllib
        urllib.urlretrieve(source, target, reporthook)
    
    else:
        import urllib.request
        urllib.request.urlretrieve(source, target, reporthook)

In [3]:
if not os.path.isdir('aclImdb'):
    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

## Preprocessing the movie dataset into more convenient format

In [4]:
pip install pyprind

Collecting pyprind
  Obtaining dependency information for pyprind from https://files.pythonhosted.org/packages/ab/b3/1f12ebc5009c65b607509393ad98240728b4401bc3593868fb161fdd3760/PyPrind-2.11.3-py2.py3-none-any.whl.metadata
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Using cached PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3
Note: you may need to restart the kernel to use updated packages.


In [8]:
# Install pyprind using anaconda prompt
# conda install -c conda-forge pyprind
import pyprind
import pandas as pd
import os

# change the `basepath` to the directory of the
# unzipped movie dataset
# Download the dataset from  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df_list = []
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df_list.append(pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment']))
            pbar.update()

df = pd.concat(df_list, ignore_index=True)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:01


In [9]:
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [10]:
df.columns = ['review', 'sentiment']

Shuffling the DataFrame:

In [11]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [12]:
df.head()

Unnamed: 0,review,sentiment
11841,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
19602,OK... so... I really like Kris Kristofferson a...,0
45519,"***SPOILER*** Do not read this, if you think a...",0
25747,hi for all the people who have seen this wonde...,1
42642,"I recently bought the DVD, forgetting just how...",0


Optional: Saving the assembled data as CSV file:

In [13]:
df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

In [14]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding = 'utf-8')
df.head(5), df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


(                                              review  sentiment
 0  In 1974, the teenager Martha Moxley (Maggie Gr...          1
 1  OK... so... I really like Kris Kristofferson a...          0
 2  ***SPOILER*** Do not read this, if you think a...          0
 3  hi for all the people who have seen this wonde...          1
 4  I recently bought the DVD, forgetting just how...          0,
 None)

In [15]:
## @Readers: PLEASE IGNORE THIS CELL
##
## This cell is meant to create a smaller dataset if
## the notebook is run on the Travis Continuous Integration
## platform to test the code on a smaller dataset
## to prevent timeout errors and just serves a debugging tool
## for this notebook

# if 'TRAVIS' in os.environ:
df.loc[:500].to_csv('movie_data.csv')
df = pd.read_csv('movie_data.csv', nrows = 500)
print('SMALL DATA SUBSET CREATED FOR TESTING')

SMALL DATA SUBSET CREATED FOR TESTING


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  500 non-null    int64 
 1   review      500 non-null    object
 2   sentiment   500 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 11.8+ KB


In [17]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df = .1,
                        max_features = 500)
X = count.fit_transform(df['review'].values)

In [18]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components = 10,
                                random_state = 123,
                                learning_method = 'batch')
X_topics = lda.fit_transform(X)

In [19]:
lda.components_.shape

(10, 500)

In [20]:
n_top_words = 5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
worst team hard waste project
Topic 2:
action series fans fan enjoy
Topic 3:
ship making woman original animation
Topic 4:
war horror soldiers later family
Topic 5:
women men family woman shows
Topic 6:
book david eyes mr patty
Topic 7:
french performance car gives different
Topic 8:
tv effects course horror shows
Topic 9:
thrust dead house brother short
Topic 10:
play script game dead picture


Based on reading the 5 most important words for each topic, we may guess that the LDA identified the following topics:
    
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedies
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based on the reviews, let's plot 5 movies from the horror movie category (category 6 at index position 5):

In [21]:
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
I have read a lot of books in my short lifetime but this is by far the WORST!!! I just got done reading this worthless piece of trash and when I finished it I threw it across the room! I hated it and let me state the reasons! 1.The soldier dies. Why would the author make the soldier die?! Why couldn ...

Horror movie #2:
Have you seen The Graduate? It was hailed as the movie of its generation. But A River Runs Through It is the story about all generations. Long before Dustin Hoffman's character got all wrapped up in the traps of modern suburbia, Norman Maclean and his brother Paul were facing the same crushing press ...

Horror movie #3:
Ring! Ring! Have-been horror directors hotline, how may we help you? UmyeahPronto! I mean hello, my name is Rugge err, call me by my initials R.D! Okay Mr. R.D, what seems to be the problem? Well the reviews on my latest movie "Dial: Help" were all negative and harsh and, frankly, I myself feel l ...


Using the preceeding code example, we printed the first 300 characters from the top 3 horror movies and indeed, we can see that the reviews -- even though we don't know which exact movie they belong to -- sound like reviews of horror movies, indeed. (However, one might argue that movie #2 could also belong to topic category 1.)

## Reference:
* Python Machine Learning 2nd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2017
* Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/LICENSE.txt)