It is a good practice to use a python progress bar to track the progress of the code when working with large datafiles. 

We are working with a movie review dataset with 50,000 reviews to do a sentiment analysis (also called opinion mining).

In [3]:
%pip install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pyprind 
import pandas as pd
import numpy as np
import os

In [11]:
# level of pwd that holds all data
level1 = '/home/rat42/Downloads/aclImdb'

labels = {'pos': 1, 'neg':0}

progress = pyprind.ProgBar(50000)

# Initialize a dataframe to hold the texts from reviews
df = pd.DataFrame()

for level2 in ('test','train'):
    for level3 in ('pos','neg'):
        path = os.path.join(level1,level2,level3)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path,file), 'r',
                      encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt,labels[level3]]],ignore_index=True)
            progress.update()
df.columns = ['review', 'sentiment']

  df = df.append([[txt,labels[level3]]],ignore_index=True)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:31


In [12]:
df.head

<bound method NDFrame.head of                                                   review  sentiment
0      I went and saw this movie last night after bei...          1
1      Actor turned director Bill Paxton follows up h...          1
2      As a recreational golfer with some knowledge o...          1
3      I saw this film in a sneak preview, and it is ...          1
4      Bill Paxton has taken the true story of the 19...          1
...                                                  ...        ...
49995  Towards the end of the movie, I felt it was to...          0
49996  This is the kind of movie that my enemies cont...          0
49997  I saw 'Descent' last night at the Stockholm Fi...          0
49998  Some films that you pick up for a pound turn o...          0
49999  This is one of the dumbest films, I've ever se...          0

[50000 rows x 2 columns]>

- sentiment (class lables) has been encoded as 1 (pos) and 0 (neg)
- class labels are sorted
- use np.random to shuffle and save the shuffled df as a csv file

In [24]:
np.random.seed(0)

df=df.reindex(np.random.permutation(df.index))

df.to_csv('movie_data.csv', index='False', encoding='utf-8')

df.head

In [25]:
df = pd.read_csv('movie_data.csv',encoding='utf-8')
df.head(5)

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,review,sentiment
0,11841,38268,28180,This film stands head and shoulders above the ...,1
1,19602,14722,17579,I remember all the hype around this movie when...,0
2,45519,46153,32503,The material in this documentary is so powerfu...,1
3,25747,1823,31803,Kusturika made it again. Another masterpiece. ...,1
4,42642,37573,25995,What would you expect from a film titled 'Surv...,1


In [26]:
df.shape

(50000, 5)

### Bag of words representation

- __raw term frequencies__: rf(t,d)- number of times a term _t_ occurs in a document _d_
- __term frequency-inverse document frequency (ft-idf)__: tf-idf: used to downweight frequently occuring words that don't contain discriminatory information between different class labels

$tf-idf(t,d) = tf(t,d) \times (idf(t,d)+1)$

$idf(t,d) = log\frac{1+n_d}{1+df(d,t)}$

- $n_d$ = no. of documents, df(d,t) = no. of docs that contain _t_

#### Data cleaning
- removing unwanted punctuation, HTML markups and non-letter characters

In [33]:
import re

def preprocessor(txt):
    txt = re.sub('<[^>]*>', '', txt) # remove HTML markup
    emo_icons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', txt)
    txt = (re.sub('[\W]+', ' ', txt.lower()) + ''.join(emo_icons)
          .replace('-',''))
    return txt

In [35]:
#preprocessor("<a>This :) is :( a test :-)!")

df['review'] = df['review'].apply(preprocessor)

In [42]:
# %pip install nltk
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [57]:
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop=stopwords.words('english')


[nltk_data] Downloading package stopwords to /home/rat42/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [64]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, 
                      preprocessor=None)

param_grid = [{'vect__ngram_range': [(1,1)],
              'vect__stop_words': [stop,None],
              'vect__tokenizer':[tokenizer, tokenizer_porter],
              'clf__penalty':['l1','l2'],
              'clf__C':[1.0,10.0,100.0]},
             {'vect__ngram_range':[(1,1)],
             'vect__stop_words':[stop,None],
             'vect__tokenizer':[tokenizer, tokenizer_porter],
             'vect__use_idf':[False],
             'vect__norm':[None],
             'clf__penalty':['l1','l2'],
             'clf__C':[1.0,10.0,100.0]}  
             ]

In [65]:
lr_tfidf = Pipeline([('vect',tfidf'), ('clf',
                LogisticRegression(random_state=0, solver='liblinear'))])

In [67]:
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,scoring='accuracy',
                          cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
  File "/home/rat42/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/rat42/anaconda3/lib/python3.10/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/rat42/anaconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 416, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/rat42/anaconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 370, in _fit
    X, fitted_

In [68]:
print("Optimal parameter set: %s" % gs_lr_tfidf.best_params_)

Optimal parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fa873c591b0>}


In [69]:
print("Cross-validation accuracy: %.3f" % gs_lr_tfidf.best_score_)

Cross-validation accuracy: 0.892


## Topic Modeling with Latent Dirichlet Allocation

- assigning topics to unlabeled text documents
- clustering task (a subcategory of unsupervised learning)

- stop_word='english' will sotp preposition, conjuction, etc from being regarded as feature vectors
- max_df=.1 sets the maximum document frequency of a word. It will exclude words that appear in more than 10% of the instances
- max_features limits the no. of words to reduce dimensionality

In [72]:
X = count.fit_transform(df['review'].values)

In [73]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, random_state=123,
                               learning_method='batch')
X_topics = lda.fit_transform(X)

- n_components=10 creates 10 topics 


In [81]:
print(lda.components_.shape)
print(lda.components_)

feature_names = count.get_feature_names_out()
print(feature_names)

(10, 5000)
[[9.18962864e+01 1.00964233e+02 3.47367273e+02 ... 3.56820252e+02
  2.31714844e+02 3.20082566e+01]
 [2.78946164e+01 9.46809301e+00 4.93799582e+01 ... 1.00004938e-01
  1.00003007e-01 4.61089338e+00]
 [1.69851807e+01 1.62522592e+02 1.31723467e+02 ... 1.00010104e-01
  1.00011222e-01 4.35160033e+00]
 ...
 [1.94356500e+00 1.37369359e+01 1.02039671e+01 ... 1.00009470e-01
  1.00010687e-01 1.95232368e+02]
 [8.44902368e+00 2.84799397e+01 6.56077733e+01 ... 1.00011605e-01
  1.00013185e-01 3.92715426e-01]
 [1.00019267e-01 3.02612929e+01 9.68148893e+01 ... 1.00013709e-01
  1.00015377e-01 4.55974126e-01]]
['00' '000' '100' ... 'zombie' 'zombies' 'zone']


In [82]:
n_top_words = 5

feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:"%(topic_idx+1))
    print(" ".join([feature_names[i] for i in topic.argsort()
                   [:-n_top_words -1: -1]]))

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music history
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex blood gore
Topic 7:
role performance comedy actor plays
Topic 8:
series episode war episodes tv
Topic 9:
book version original read effects
Topic 10:
action fight guy guys cool


__The topics identified are__
1. Generally bad movies
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies

Let us see how this works. Let us print the reviews of some movies in some topic category

In [86]:
war = X_topics[:,2].argsort()[::-1]
for i, movie_i in enumerate(war[:3]):
    print ("\n War movie #%d:" %(i+1))
    print(df['review'][movie_i][:300])


 War movie #1:
In the 1980s in wrestling the world was simple. Hulk Hogan would take on Roddy Piper, or Bobby Heenan's cronies or Ted DiBiase and come out victorious more often than not. Occasionally he would get an ally like Randy Savage in 1988, but mostly it was all about Hulk Hogan vs Bobby Heenan, and that's 

 War movie #2:
Hollywood Hotel was the last movie musical that Busby Berkeley directed for Warner Bros. His directing style had changed or evolved to the point that this film does not contain his signature overhead shots or huge production numbers with thousands of extras. By the last few years of the Thirties, sw

 War movie #3:
This is a movie about the music that is currently being played in Istanbul. Istanbul was the center of the two Old World superpowers, the Byzantine Empire and the Ottoman Empire. Today, it is a megalopolis of almost 10 million. So it is to no ones surprise that a lot of music is being played in Ista
