- In this assignment, we are given some movie ratings and their reviews as +ve and -ve.
- We have to train a model on that and then make predictions for our movie.
- We will use nltk and Naive Bayes for this assignment

In [2]:
import pandas as pd, matplotlib.pyplot as plt, numpy as np

#### 1. Import data

In [4]:
df = pd.read_csv("./Train.csv")
df.head()

Unnamed: 0,review,label
0,mature intelligent and highly charged melodram...,pos
1,http://video.google.com/videoplay?docid=211772...,pos
2,Title: Opera (1987) Director: Dario Argento Ca...,pos
3,I think a lot of people just wrote this off as...,pos
4,This is a story of two dogs and a cat looking ...,pos


In [8]:
print(df.shape)

(40000, 2)


- There are a total of 40000 reviews and there labels respectively.
- Here 1st columns represents X_train and 2nd column represents y_train

In [10]:
np.unique(df.values[:, 1])

array(['neg', 'pos'], dtype=object)

- There are only 2 classes: 'pos' and 'neg'

In [12]:
## convert df to X_train and y_train 
X_train = df.values[:,0]  # 1st column of 'df' which is review is X_train
y_train = df.values[:,1]  # 2nd column of 'df' represents label
X_train.shape, y_train.shape

((40000,), (40000,))

#### Data Visualization:

In [15]:
np.mean(y_train=='pos')

0.500275

- We can see that almost 50% of reviews are 'positive' and rest 50% are 'negative'

### 2. Data Cleaning:

In [16]:
from nltk.tokenize import RegexpTokenizer # to make user-defined token
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

1. tokenizer: is used to convert a sentence into tokens: words. We used regex to make our custom tokenizer. We can also use nltk.tokenize.word_tokenize 
2. stopward: It is used to remove stopwards like 'is' 'am' etc.. which occur in all docs and do not have some meaning specially.
3. Stemming: To stem the words. eg- loving, lovely, love are all stemmed to same word: love

Details of these all are given in "../8Natural Language Provessing/02_nltk.ipynb" file

In [19]:
## create objects of all these classes to use them
tokenizer = RegexpTokenizer(r'\w+')  # create custom tokenizer
## try other regex also to remove more unneccessary data.

eng_stopwords = set(stopwords.words('english')) # It is set of all stopwards in english
ps = PorterStemmer()  # create object to use it later on for stemming

In [20]:
## define a function to do all above steps to clean some data
def getCleanReview(review):
    
    review = review.lower()     # convert all uppercase to lowercase to generalize text
    review = review.replace("<br /><br />"," ")  # replace newlines with space
    
    #Tokenize
    tokens = tokenizer.tokenize(review)   # tokenize using our custom 'tokenizer'
    new_tokens = [ i for i in tokens if (i not in eng_stopwords or i=='not') ] # eng_stopward is instance of stopwards.word('english')
    stemmed_tokens = [ps.stem(token) for token in new_tokens]  # do dtemming
    
    cleaned_review = ' '.join(stemmed_tokens)  # join all tokens now with space seperated.
    
    return cleaned_review    

- By default `obj = stopward.wod('english')` considers 'not' also as a stopward and removes it also. So we can do following to include 'not' also as useful_word as 'not' has v.high importance in our review classification<br>
<font color="cyan">**new_tokens = [i for i in tokens if (i not in eng_stopwards or i=='not')]**</font>
- And then we will use biagram features also to use 'not good' as a different thing from not

## Import test data and clean both train and test data 

In [22]:
df2 = pd.read_csv("./Test.csv")
df2.head()

Unnamed: 0,review
0,Remember those old kung fu movies we used to w...
1,This movie is another one on my List of Movies...
2,How in the world does a thing like this get in...
3,"""Queen of the Damned"" is one of the best vampi..."
4,The Caprica episode (S01E01) is well done as a...


In [25]:
X_test = df2.values[:,0]
X_test.shape, X_train.shape

((10000,), (40000,))

In [26]:
## Here instead of using for loop to pass in getCleanReview(), we use list comprehension 
xtrain_clean = [getCleanReview(i) for i in X_train] #List Comprehension
xtest_clean = [getCleanReview(i) for i in X_test]

- It will take large time as there are 40000(train)+10000(test) examples in total **for stemming part**, rest everything is done very fast. It takes around 70-80 seconds in my case for stemming. But this time depends on system configuration. 
- Also we can try any other Stemmer. But I wasn't able to find any better as SnowballStemmer takes even more time than PorterScanner. And LancastersStemmer gives high error rate although it is fast.

## 3. Vectorization

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
cv = CountVectorizer() # create object of this class 

In [30]:
## do training with fit() and convert X_train to vectors with transform() 
X_tr_vec = cv.fit_transform(xtrain_clean)  # it gives us sparse matrix
print(X_tr_vec.shape)
print(type(X_tr_vec))  # of type scipy.sparse

(40000, 65742)
<class 'scipy.sparse.csr.csr_matrix'>


- As vocab size is 65742 and all 40000 examples are of size 93813. Thus total memory to store it = 40000*65742 = $2^{28}$ = in GiB(s)
- So it exceeds our RAM. That's why it uses sparse matrix. **If we try to convert it to array using toarray() method, then it will show error as**: "Unable to allocate 20 GiB for an array with shape (40000, 65742) and data type int64"

In [31]:
print(X_tr_vec)  # see sparse matrix

  (0, 36382)	1
  (0, 29108)	1
  (0, 26775)	1
  (0, 10867)	1
  (0, 37086)	1
  (0, 60314)	1
  (0, 20903)	2
  (0, 11291)	1
  (0, 430)	1
  (0, 63325)	2
  (0, 55598)	2
  (0, 43686)	1
  (0, 10402)	1
  (0, 34594)	1
  (0, 59288)	1
  (0, 52684)	1
  (0, 41872)	1
  (0, 51132)	1
  (0, 35311)	1
  (0, 56815)	1
  (1, 27676)	1
  (1, 62250)	1
  (1, 24143)	1
  (1, 12462)	1
  (1, 62266)	1
  :	:
  (39999, 31993)	1
  (39999, 25733)	1
  (39999, 25701)	2
  (39999, 13817)	2
  (39999, 23685)	1
  (39999, 42661)	1
  (39999, 47890)	1
  (39999, 63872)	1
  (39999, 50348)	1
  (39999, 20114)	1
  (39999, 53850)	2
  (39999, 49721)	1
  (39999, 13503)	1
  (39999, 27523)	1
  (39999, 34437)	1
  (39999, 29834)	1
  (39999, 21936)	1
  (39999, 27291)	1
  (39999, 37774)	1
  (39999, 62984)	1
  (39999, 53855)	1
  (39999, 40271)	1
  (39999, 40340)	1
  (39999, 50894)	1
  (39999, 52531)	1


In [32]:
print(cv.vocabulary_)

{'matur': 36382, 'intellig': 29108, 'highli': 26775, 'charg': 10867, 'melodrama': 37086, 'unbelivebl': 60314, 'film': 20903, 'china': 11291, '1948': 430, 'wei': 63325, 'stun': 55598, 'perform': 43686, 'catylast': 10402, 'love': 34594, 'triangl': 59288, 'simpli': 52684, 'oppurun': 41872, 'see': 51132, 'magnific': 35311, 'take': 56815, 'http': 27676, 'video': 62250, 'googl': 24143, 'com': 12462, 'videoplay': 62266, 'docid': 16704, '211772166650071408': 610, 'hl': 26994, 'en': 18756, 'distribut': 16578, 'tri': 59281, 'opt': 41874, 'mass': 36207, 'appeal': 3782, 'want': 62988, 'best': 6649, 'possibl': 45173, 'view': 62302, 'rang': 47053, 'forgo': 21722, 'profit': 45778, 'continu': 13029, 'manual': 35825, 'labor': 32598, 'job': 30322, 'gladli': 23703, 'entertain': 18964, 'work': 64412, 'texa': 57655, 'tale': 56850, 'pleas': 44618, 'write': 64566, 'like': 33852, 'not': 40900, 'alex': 2594, 'stuie': 55590, 'opinion': 41842, 'rule': 49457, 'titl': 58414, 'opera': 41827, '1987': 486, 'director'

In [33]:
a = sorted(cv.vocabulary_.keys())
print(a)

['00', '000', '0000000000001', '00000001', '00001', '00015', '000dm', '001', '003830', '006', '007', '0079', '0080', '0083', '009', '0093638', '00am', '00o', '00pm', '00schneider', '01', '0126', '0148', '01pm', '02', '020410', '0230', '029', '03', '039', '04', '044', '05', '050', '05nomactr', '06', '0615', '07', '07b', '08', '087', '089', '08th', '09', '0f', '0ne', '0r', '0s', '0tt', '10', '100', '1000', '10000', '1000000', '10000000000', '10000000000000', '10000th', '1001', '1004', '100b', '100bt', '100ft', '100ib', '100k', '100m', '100mile', '100min', '100mph', '100th', '100time', '100x', '100yard', '100â', '101', '101st', '102', '102nd', '103', '104', '1040', '1040a', '105', '1050', '105lb', '106', '106min', '107', '108', '1080p', '109', '10_', '10ft', '10ish', '10k', '10line', '10mil', '10min', '10minut', '10p', '10pm', '10star', '10th', '10x', '10yo', '10yr', '10â', '11', '110', '1100', '1100ad', '110mph', '111', '112', '113', '1138', '113min', '113minut', '114', '1146', '115', '1

- This is our vocabulary with 65742 unique words and there mapping.
- And there are some words like 10, 20, ... numbers which we can remove. But they might be helpful as rating given like word more than 6 will mostly be in 'pos' class. But there are other lot unneccessary numbers also. So we can check results by removing them also.
- And after sorting, we can see in the dictionary there are a lot of words like: aaaaaaa, aaaaazzz, aaasd, .... So we can also remove them by changing out regex for tokenizer.

In [34]:
len(cv.vocabulary_)

65742

### trasform test data into vectors using this above vocab

In [35]:
# let's see what our first sentences look like
print(xtest_clean[0])  # print 1st five sentences.

rememb old kung fu movi use watch friday saturday late night babysitt thought charg well movi play exactli like one movi patsi kensit biggest claim fame love interest mel gibson charact lethal weapon 2 perform one reason never made big terribl actress lethal weapon 2 thought cute cute enough check movi includ love music love danc anoth big let obvious not impress either attract eye soul scream turn play anoth cheap predict role done badli movi kensit star comedienn not good one either work club franc cut homeland make ear bleed luck even wors french govern want throw expir visa mayb caught act get marri casanova freiss luck predict begin terribl way give movi neg rate 1 10 star rate


In [36]:
## Vectorization on the test set

#transform converts x_test to vector using already learned vocab of X_train which we fit earlier.
xtest_vec = cv.transform(xtest_clean)  
print(xtest_vec)  # uses same vocab as above
print(xtest_vec.shape)

  (0, 49)	1
  (0, 1770)	1
  (0, 1796)	1
  (0, 3476)	2
  (0, 4661)	1
  (0, 5115)	1
  (0, 5232)	1
  (0, 6226)	1
  (0, 6865)	2
  (0, 6881)	1
  (0, 7305)	1
  (0, 10201)	1
  (0, 10413)	1
  (0, 10839)	1
  (0, 10867)	1
  (0, 10986)	1
  (0, 11007)	1
  (0, 11794)	1
  (0, 12097)	1
  (0, 12484)	1
  (0, 14312)	1
  (0, 14316)	2
  (0, 14604)	1
  (0, 16907)	1
  (0, 17927)	1
  :	:
  (9999, 55254)	1
  (9999, 55278)	1
  (9999, 55826)	1
  (9999, 55827)	1
  (9999, 56192)	1
  (9999, 57706)	1
  (9999, 57909)	2
  (9999, 58047)	2
  (9999, 58272)	2
  (9999, 58414)	1
  (9999, 58831)	2
  (9999, 60168)	1
  (9999, 60583)	1
  (9999, 60743)	1
  (9999, 61514)	1
  (9999, 62988)	1
  (9999, 63114)	1
  (9999, 63197)	2
  (9999, 63385)	2
  (9999, 63425)	1
  (9999, 63736)	1
  (9999, 64394)	1
  (9999, 64412)	1
  (9999, 64935)	1
  (9999, 65609)	2
(10000, 65742)


## Apply Multinomial NB

In [37]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [38]:
## create objects of MultinomialNB() and BernoulliNB classes.
mnb = MultinomialNB()
bnb = BernoulliNB()

In [39]:
mnb.fit(X_tr_vec, y_train)

MultinomialNB()

In [40]:
mnb.score(X_tr_vec, y_train)   # check accuracy of model for train data

0.8905

In [41]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(mnb, X_tr_vec, y_train, cv=10))
print(cross_val_score(mnb, X_tr_vec, y_train, cv=10).mean())

[0.85475 0.85625 0.8525  0.8475  0.84875 0.863   0.855   0.85675 0.8575
 0.85775]
0.8549749999999999


- It gives 89% accuracy on our train data. And an average of 85% accuracy by cross validating.
- Now let us try with Bernoulli NB:

In [42]:
bnb.fit(X_tr_vec, y_train)
bnb.score(X_tr_vec, y_train)  # check accuracy of model for train data

0.886725

In [43]:
print(cross_val_score(bnb, X_tr_vec, y_train, cv=10))
print(cross_val_score(bnb, X_tr_vec, y_train, cv=10).mean())

[0.8465  0.85025 0.84575 0.85125 0.8445  0.85025 0.84875 0.847   0.8485
 0.8475 ]
0.848025


- It gives us an average of 84% of accuracy. It is less than Multinomial NB bcoz vocab size is very high and Bernoulli NB is less efficient as vocab size increases.

## Testing and predictions for test data:

In [44]:
y_pred = mnb.predict(xtest_vec)
y_pred.shape

(10000,)

In [45]:
print(y_pred)

['neg' 'neg' 'neg' ... 'pos' 'pos' 'neg']


In [46]:
## convert to csv file
df = pd.DataFrame(y_pred, columns=["label"])
df.to_csv("output.csv", index_label="Id")

- Here accuracy is 84% bcoz we have only used unigram features. And some words will have different meaning in biagrams like: 'not good', 'not present'. So here 'not good' belongs to '-ve' class but 'not present' will have 50-50 chances for both classes.

## Let us try with bigram features.
- Step1: Import data and Step2: Data Cleaning is already done and same for this also. So we will use same cleaned data for this also.

## 3. Vectorization

In [48]:
from sklearn.feature_extraction.text import CountVectorizer

In [49]:
cv = CountVectorizer(ngram_range=(1,2)) # create object of this class 
# n-gram range is 1-2 to create unigrams and biagrams

- We use unigrams and biagrams because "not" is considered as a stopward. So we can use biagrams to consider the effect of not.

In [50]:
## do training with fit() and convert X_train to vectors with transform() 
X_tr_vec = cv.fit_transform(xtrain_clean)  # it gives us sparse matrix
print(X_tr_vec.shape)
print(type(X_tr_vec))  # of type scipy.sparse

(40000, 2265307)
<class 'scipy.sparse.csr.csr_matrix'>


- Now our vocab size is 2265307 which is 35times larger than earlier size. But it will give us better results with more accuracy.

In [51]:
print(X_tr_vec)  # see sparse matrix

  (0, 1243022)	1
  (0, 1024782)	1
  (0, 937294)	1
  (0, 337928)	1
  (0, 1257884)	1
  (0, 2095107)	1
  (0, 740062)	2
  (0, 351787)	1
  (0, 7137)	1
  (0, 2187151)	2
  (0, 1929611)	2
  (0, 1476407)	1
  (0, 318993)	1
  (0, 1189454)	1
  (0, 2068077)	1
  (0, 1816567)	1
  (0, 1422699)	1
  (0, 1749999)	1
  (0, 1208261)	1
  (0, 1969985)	1
  (0, 1243135)	1
  (0, 1025075)	1
  (0, 937348)	1
  (0, 338131)	1
  (0, 1258121)	1
  :	:
  (39999, 1810726)	1
  (39999, 1810731)	1
  (39999, 1550226)	1
  (39999, 908695)	1
  (39999, 2206671)	1
  (39999, 904824)	1
  (39999, 1445393)	1
  (39999, 1474970)	1
  (39999, 850920)	1
  (39999, 462153)	1
  (39999, 1759756)	1
  (39999, 1182107)	1
  (39999, 1319705)	1
  (39999, 1739946)	1
  (39999, 793104)	1
  (39999, 1860658)	1
  (39999, 963604)	1
  (39999, 1277266)	1
  (39999, 1508690)	1
  (39999, 1361479)	1
  (39999, 954495)	1
  (39999, 1704790)	1
  (39999, 799602)	1
  (39999, 462167)	1
  (39999, 1859801)	1


In [52]:
print(cv.vocabulary_)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



- In this case vocab size is very large. So it can't be displayed.

In [53]:
a = sorted(cv.vocabulary_.keys())
print(a)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



- Here also we will have a lot of unneccessary things like aaaaa, aaaazz, bbbb, ...., zzzzz.
- We can do seperately cleaning for this before vectorization where we can remove these all by designing such regex.

In [54]:
len(cv.vocabulary_)

2265307

### trasform test data into vectors using this above vocab

In [55]:
# let's see what our first sentences look like
print(xtest_clean[0])  # print 1st five sentences.

rememb old kung fu movi use watch friday saturday late night babysitt thought charg well movi play exactli like one movi patsi kensit biggest claim fame love interest mel gibson charact lethal weapon 2 perform one reason never made big terribl actress lethal weapon 2 thought cute cute enough check movi includ love music love danc anoth big let obvious not impress either attract eye soul scream turn play anoth cheap predict role done badli movi kensit star comedienn not good one either work club franc cut homeland make ear bleed luck even wors french govern want throw expir visa mayb caught act get marri casanova freiss luck predict begin terribl way give movi neg rate 1 10 star rate


In [56]:
## Vectorization on the test set

#transform converts x_test to vector using already learned vocab of X_train which we fit earlier.
xtest_vec = cv.transform(xtest_clean)  
print(xtest_vec)  # uses same vocab as above
print(xtest_vec.shape)

  (0, 717)	1
  (0, 1602)	1
  (0, 36172)	1
  (0, 37193)	1
  (0, 43463)	1
  (0, 105355)	2
  (0, 105545)	1
  (0, 105671)	1
  (0, 149500)	1
  (0, 149699)	1
  (0, 164766)	1
  (0, 171803)	1
  (0, 172024)	1
  (0, 197373)	1
  (0, 198827)	1
  (0, 216485)	2
  (0, 217242)	1
  (0, 218214)	1
  (0, 218264)	1
  (0, 229720)	1
  (0, 311170)	1
  (0, 319050)	1
  (0, 319061)	1
  (0, 332872)	1
  (0, 335115)	1
  :	:
  (9999, 2048075)	2
  (9999, 2048905)	1
  (9999, 2049118)	1
  (9999, 2092862)	1
  (9999, 2101185)	1
  (9999, 2104146)	1
  (9999, 2119957)	1
  (9999, 2121510)	1
  (9999, 2164689)	1
  (9999, 2172326)	1
  (9999, 2175753)	1
  (9999, 2178607)	2
  (9999, 2179670)	1
  (9999, 2180621)	1
  (9999, 2188771)	2
  (9999, 2191514)	1
  (9999, 2192509)	1
  (9999, 2192606)	1
  (9999, 2201110)	1
  (9999, 2201463)	1
  (9999, 2226349)	1
  (9999, 2227939)	1
  (9999, 2250590)	1
  (9999, 2263689)	2
  (9999, 2264054)	1
(10000, 2265307)


## Apply Multinomial NB

In [57]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [58]:
## create objects of MultinomialNB() and BernoulliNB classes.
mnb = MultinomialNB()
bnb = BernoulliNB()

In [59]:
mnb.fit(X_tr_vec, y_train)

MultinomialNB()

In [60]:
mnb.score(X_tr_vec, y_train)   # check accuracy of model for train data

0.994675

In [61]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(mnb, X_tr_vec, y_train, cv=10))
print(cross_val_score(mnb, X_tr_vec, y_train, cv=10).mean())

[0.884   0.878   0.88225 0.874   0.879   0.88725 0.88075 0.882   0.8805
 0.88725]
0.8815


- See it gives 99.4% accuracy on training data. And in general it gives us 88% of accuracy

In [62]:
bnb.fit(X_tr_vec, y_train)
bnb.score(X_tr_vec, y_train)  # check accuracy of model for train data

0.992825

In [63]:
print(cross_val_score(bnb, X_tr_vec, y_train, cv=10))
print(cross_val_score(bnb, X_tr_vec, y_train, cv=10).mean())

[0.869   0.87425 0.87075 0.86925 0.86875 0.87375 0.8755  0.86475 0.8705
 0.87025]
0.870675


- It also gives us 99.2% accuracy on training data and a total of 87% accuracy on average

## Testing and predictions for test data:

In [64]:
y_pred = mnb.predict(xtest_vec)
y_pred.shape

(10000,)

In [65]:
print(y_pred)

['neg' 'neg' 'neg' ... 'pos' 'pos' 'neg']


In [66]:
## convert to csv file
df = pd.DataFrame(y_pred, columns=["label"])
df.to_csv("output.csv", index_label="Id")

### This gives 87% accuracy
- I uploaded it on assignemt and it gave me 87% accuracy. Earlier it was 84% accuracy. This is a great improvement. We can get upto 89% or 90% accuracy if we do data cleaning more efficiently.

- I also tried with ngram_range=(1,3) . But fit_transform() wasn't computed even after 1hour. So I deleted that part. Mybe ngram_range=(2,3) can give us some better results.
- And use of TF-IDF also makes gr8 changes to our accuracy. We can do that also.

##### Here accuracy is only 85% bcoz in our dictionary we have a lot of unneccessary words which we can see where we have sorted our vocabulary. We can change our custom tokenizer to remove those also like remove words where same character occurs more than 3 times: It will remove characters like aaaaaa, aaaaaah, zzzzzzz, dddddda, gggghdss etc..
- And we can use tf-idf, ngram=(1,2) etc..