### Bag of Words (BOW) Model for Sentiment Analysis

* IMDB movie review dataset : https://ai.stanford.edu/~amaas/data/sentiment/
* Deep model : https://arxiv.org/pdf/1512.08183.pdf
* In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WordPunctTokenizer


import time
from datetime import timedelta

In [2]:
# time = time.time()
def elapsed_time(start_time, end_time):
    elapsed_time_secs = end_time - start_time
    return timedelta(seconds=round(elapsed_time_secs))

In [3]:
train_inpath = '../data/IMDB/aclImdb/train/'
test_inpath = '../data/IMDB/aclImdb/test/'
outpath = '../data/IMDB/'

def imdb_reviews_to_csv(inpath, outpath, csv_file_name):

    indices = []
    text = []
    rating = []
   
    i =  0 
    for filename in os.listdir(inpath+"pos"):
        data = open(inpath+"pos/"+filename, 'r' , encoding="ISO-8859-1").read()
        indices.append(i)
        text.append(data)
        rating.append("1")
        i = i + 1

    i = 0   
    for filename in os.listdir(inpath+"neg"):
        data = open(inpath+"neg/"+filename, 'r' , encoding="ISO-8859-1").read()
        indices.append(i)
        text.append(data)
        rating.append("0")
        i = i + 1
    
        
    dataset = list(zip(indices,text,rating))
    np.random.shuffle(dataset)
    df = pd.DataFrame(data = dataset, columns=['row_number', 'review', 'polarity'])
    df.to_csv(outpath+csv_file_name, index=False, header=True)

In [4]:
#imdb_reviews_to_csv(train_inpath, outpath, 'imdb_train.csv')

In [5]:
#imdb_reviews_to_csv(test_inpath, outpath, 'imdb_test.csv')

In [4]:
imdb_train = pd.read_csv('../data/IMDB/imdb_train.csv')

In [5]:
imdb_train.shape

(25000, 3)

In [6]:
imdb_train.columns

Index(['row_number', 'review', 'polarity'], dtype='object')

In [7]:
pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 500)

In [8]:
imdb_train.head()

Unnamed: 0,row_number,review,polarity
0,7601,"It is hard to describe this film and one wants to tried hard not to dismiss it too quickly because you have a feeling that this might just be the perfect film for some 12 years old girl...<br /><br />This film has a nice concept-the modern version of Sleeping Beauty with a twist. It has some rather dreamy shots and some nice sketches of the young boy relationship with his single working mother and his schoolmate... a nice start you might say, but then it got a bit greedy, very greedy, it tri...",0
1,777,"1st watched 8/3/2003 - 2 out of 10(Dir-Brad Sykes): Mindless 3-D movie about flesh-eating zombies in a 3 story within a movie chronicle. And yes, we get to see zombies eating human flesh parts in 3D!! Wow, not!! That has been done time and time again in 2D in a zombie movie but what usually makes a zombie movie better is the underlying story not the actual flesh-eating. That's what made the original zombie classics good. The flesh-eating was just thrown in as an extra. We're actually bored t...",0
2,12110,"Maaan, where do i start with this god awful movie. Bad bad bad story telling. I do not know what the director was thinking when he made this movie. Namaste London was quite an enjoyable movie to be honest..even the soundtrack was good. But in this one..oh my good..for a movie which is supposed to be a musical one..the songs are soooo bad. AR Rahman should have been the music director. <br /><br />Given two great actors a much better job should have been done by the director. Even though the ...",0
3,11782,"Time and time again, it seems that the comedic actors of Hollywood are surprising me with their talents as dramatic performers: first it was Robin Williams {'One Hour Photo (2002)'}, then it was Jim Carrey {'Eternal SunshineÂ (2004)' being one example}, then Will Ferrell {'Stranger than Fiction (2006)'} and now Adam Sandler. Yes, that's absolutely right: the guy who has based an entire career on making brainless, goofball comedies {I'm not complaining; I've always been a fan} has finally gi...",1
4,10787,"The Matador is better upon reflection because at the time one is watching it, it seems so light. The humor is always medium-gauge, never unfunny but never gut-busting. The story is a very simple thread. The characteristics of the plot are often recycled features, namely the unscrupulous bad guy in need of a pal and the straight-laced glass-wearing good guy in need of security in life team up and learn from each other and somehow complement each other's lifestyles. I also find the bullfightin...",1


#### Below code uses 7 GB of RAM and almost all 2-cores in processing, to reduce this intense memory usage one should use generators.

In [10]:
from nltk.parse.corenlp import CoreNLPParser 
st_parser = CoreNLPParser(url='http://localhost:9000')

#start_time = time.time()

#stem_corpus = []
#for i in range(1000):
#    review = str(imdb_train['review'][i]).lower()
#    review = re.sub('[^\d\w]', ' ', review)
#    tokens = st_parser.tokenize(review)
    #print([word for word in tokens])
#    ps = PorterStemmer()
#    review = [ps.stem(word) for word in tokens if not word in set(stopwords.words('english'))]
#    review = ' '.join(review)
#    stem_corpus.append(review)
    
#end_time = time.time()
#print("Time taken to execute: ", elapsed_time(start_time, end_time))    

### Time taken to execute:  0:04:35

#### Generator to reduce resource usage

In [11]:
wp_tokenizer = WordPunctTokenizer()

def normalize_text(reviews):
    for i in range(reviews.shape[0]):
        review = str(reviews[i]).lower()
        review = re.sub('[^\w]', ' ', reviews[i])
        #print(review)
        tokens = wp_tokenizer.tokenize(review)
        ps = PorterStemmer()
        review = [ps.stem(word) for word in tokens if not word in set(stopwords.words('english'))]
        review = ' '.join(review)
        yield review    

In [12]:
#start_time = time.time()

#stem_corpus = []
#for review in normalize_text(imdb_train['review'][0:1000]):
#    stem_corpus.append(review) 
    
#end_time = time.time()
#print("Time taken to execute: ", elapsed_time(start_time, end_time))

### Time taken to execute:  0:05:42

In [12]:
start_time = time.time()

stem_corpus = []
for review in normalize_text(imdb_train['review']):
    stem_corpus.append(review) 
    
end_time = time.time()
print("Time taken to execute: ", elapsed_time(start_time, end_time))

### Time taken to execute:  0:28:26

Time taken to execute:  0:28:26


In [14]:
import pickle

# Save to disk
filename = '../data/IMDB/imdb_stemmed_corpus.sav'
pickle.dump(stem_corpus, open(filename, 'wb'))

In [None]:
# load from disk
filename = 'C:/Users/thisi/Workspace/AI_dataset/IMDB/imdb_stemmed_corpus.sav'
stem_corpus = pickle.load(open(filename, 'rb'))

In [16]:
print(stem_corpus[0:3])

['I rent I bit weari 80 nbc program appar I save lot money I noth actor credit good job show flaw premis br br We charact unlik He full flaw enlighten complet jerk good day yet reason anybodi care while creat american sitcom center around complet bullhead jackass revolutionari full potenti met within show most support charact fulli flesh charact rather sad punch bag want empathi audienc punch bag As sitcom one made normal audienc relat negat lead charact extent see bitting harm peopl stay there reason ani normal peopl would simpli left abus keep without real reason even realli unbeliev one given joanna cassidi special 2 part abort episod major problem show fall apart To simpli believ peopl put guy told heart gold mesh realiti situat If anyth even dramedi thi badli plot conceiv execut premis moment overal met fate deserv someon gut go make good idea execut haphazard look like weirdli script version jerri springer show someon abus tyrant suppos root told A show like requir deft touch act

In [17]:
from nltk.stem import WordNetLemmatizer

def lemma_normalize_text(reviews):
    for i in range(reviews.shape[0]):
        review = str(reviews[i]).lower()
        review = re.sub('[^\d\w]', ' ', reviews[i])
        tokens = st_parser.tokenize(review)
        lem = WordNetLemmatizer()
        review = [lem.lemmatize(word) for word in tokens if not word in set(stopwords.words('english'))]
        review = ' '.join(review)
        yield review

In [18]:
#start_time = time.time()

#lemma_corpus = []
#for review in lemma_normalize_text(imdb_train['review']):
#    lemma_corpus.append(review) 
    
#end_time = time.time()
#print("Time taken to execute: ", elapsed_time(start_time, end_time))

#### Time taken to execute:  0:44:44

In [19]:
# Save to disk
#filename = 'C:/Users/thisi/Workspace/AI_dataset/IMDB/imdb_lemma_corpus.sav'
#pickle.dump(lemma_corpus, open(filename, 'wb'))

In [20]:
# load from disk
filename = 'C:/Users/thisi/Workspace/AI_dataset/IMDB/imdb_lemma_corpus.sav'
lemma_corpus = pickle.load(open(filename, 'rb'))

In [21]:
print(lemma_corpus[0:3])

['I rented I bit weary 80 NBC programming apparently I saved lot money I nothing actor credit good job show flawed premise br br We character unlikable He full flaw enlightened complete jerk good day Yet reason anybody care While creating American sitcom centered around complete bullheaded jackass revolutionary full potential met within show Most supporting character fully fleshed character rather sad punching bag want empathy audience punching bag As sitcom one made normal audience relate negate lead character extent see Bittinger harming people stay There reason Any normal people would simply left abuse Keeping without real reason even really unbelievable one given Joanna Cassidy special 2 part abortion episode major problem show fall apart To simply believe people put guy told heart gold mesh reality situation If anything even dramedy This badly plotted conceived executed premise moment overall met fate deserved Someone gut go make good idea execution haphazard look like weirdly scr

In [22]:
len(lemma_corpus)

25000

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

start_time = time.time()
cv = CountVectorizer(ngram_range=(1, 1))
X = cv.fit_transform(lemma_corpus).toarray()
y = imdb_train['polarity'].values
end_time = time.time()

print("Time taken to execute: ", elapsed_time(start_time, end_time))

Time taken to execute:  0:00:18


In [29]:
filename = './models/IMDB/imdb_count_vectorizer.pkl'
pickle.dump(cv, open(filename, 'wb'))

In [24]:
X.shape

(25000, 70340)

In [25]:
#cv.vocabulary_  # A mapping of terms to feature indices.

In [26]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

lm = SGDClassifier()
lm = lm.fit(X, y)
y_pred = lm.predict(X) 
accu = accuracy_score(y, y_pred)
print(accu)
## Got 0.93864 accuracy on TRAIN set



0.93864


In [28]:
filename = './models/IMDB/imdb_linear_model.pkl'
pickle.dump(lm, open(filename, 'wb'))