### Movie review sentiment analysis

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('IMDB Dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df['Sentiment'] = df.sentiment.apply(lambda x :1 if x =='positive' else 0)

In [5]:
df.head()

Unnamed: 0,review,sentiment,Sentiment
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [6]:
df.drop('sentiment', axis = 1, inplace = True)

In [7]:
df.head()

Unnamed: 0,review,Sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
df.Sentiment.value_counts()

1    25000
0    25000
Name: Sentiment, dtype: int64

In [9]:
df.shape

(50000, 2)

In [10]:
# So our data set is balanced

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.review, df.Sentiment, test_size=0.2,random_state=1)

In [12]:
X_train.shape

(40000,)

In [13]:
X_test.shape

(10000,)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv  = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)

In [15]:
X_train_cv

<40000x93163 sparse matrix of type '<class 'numpy.int64'>'
	with 5471200 stored elements in Compressed Sparse Row format>

In [16]:
X_train_cv_np = X_train_cv.toarray()

In [17]:
X_train_cv_np

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [18]:
from sklearn.naive_bayes import MultinomialNB
model_1 = MultinomialNB()

In [19]:
model_1.fit(X_train_cv,y_train)

In [20]:
X_test_cv = cv.transform(X_test)

In [21]:
y_pred = model_1.predict(X_test_cv)

In [22]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      5044
           1       0.87      0.82      0.85      4956

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



In [38]:
review = ["The movie was quite entertaining. loved it.", "Complete waste of time and money",]

In [39]:
review_count = cv.transform(review)

In [40]:
model_1.predict(review_count)

array([1, 0], dtype=int64)

In [41]:
model_1.predict_proba(review_count)

array([[0.2143717 , 0.7856283 ],
       [0.97842078, 0.02157922]])

In [42]:
# Now using a pipeline


In [44]:
from sklearn.pipeline import Pipeline
model_2 = Pipeline([('vectorizer',CountVectorizer()),
         ('nb',MultinomialNB())
         ])

model_2.fit(X_train,y_train)

In [45]:
y_pred = model_2.predict(X_test)

In [46]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      5044
           1       0.87      0.82      0.85      4956

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



###     Now using Random forest

In [47]:
from sklearn.ensemble import RandomForestClassifier

In [53]:
model_3 = Pipeline([('vectorizer',CountVectorizer()),
          ('rf',RandomForestClassifier(n_estimators=50, criterion='entropy'))
         ])

In [54]:
model_3.fit(X_train,y_train)

In [55]:
y_pred  = model_3.predict(X_test)

In [57]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      5044
           1       0.83      0.84      0.84      4956

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [58]:
### USing KNN Algorithm

In [60]:
from sklearn.neighbors import KNeighborsClassifier

model_4 = Pipeline([('vectorizer',CountVectorizer()),
          ('knn',KNeighborsClassifier(n_neighbors=10, metric = 'euclidean'))
         ])

In [61]:
model_4.fit(X_train,y_train)

In [63]:
y_pred = model_4.predict(X_test)

In [64]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.65      0.66      0.66      5044
           1       0.65      0.64      0.65      4956

    accuracy                           0.65     10000
   macro avg       0.65      0.65      0.65     10000
weighted avg       0.65      0.65      0.65     10000



As Machine learning algorithms does not work on Text data directly, we need to convert them into numeric vector and feed that into models while training.

In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words.

Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.

The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.

As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.

Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.