<a href="https://colab.research.google.com/github/lavanya957/imbd_reviews_sentiment_analysis/blob/main/IMBD_SENTIMENT_ANALYSISipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np

## **IMPORT DATASET**

In [9]:
df=pd.read_csv("/content/IMDB Dataset.csv", on_bad_lines='skip',engine="python")

In [58]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [10]:
df.shape

(3980, 2)

# **CLEANING AND PREPROCESSING**

In [11]:
import re

replace_no_space=re.compile("[.;:!\'?,\"()\[\]]")
replace_with_space=re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
def preprocess_reviews(reviews):
  reviews=[replace_no_space.sub("",line.lower()) for line in reviews]
  reviews=[replace_with_space.sub(" ",line) for line in reviews]

  return "".join(reviews)

df["review"]=df["review"].apply(preprocess_reviews)



In [14]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production <br ><br >the ...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


In [12]:
df.sentiment.replace("positive",1,inplace=True)
df.sentiment.replace("negative",0,inplace=True)

In [16]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production <br ><br >the ...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically theres a family where a little boy j...,0
4,petter matteis love in the time of money is a ...,1


# **SPLITING INTO TRAIN DATA AND TEST DATA**

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
x=df["review"]
y=df["sentiment"]

In [15]:
x.head()

0    one of the other reviewers has mentioned that ...
1    a wonderful little production <br  ><br  >the ...
2    i thought this was a wonderful way to spend ti...
3    basically theres a family where a little boy j...
4    petter matteis love in the time of money is a ...
Name: review, dtype: object

In [16]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=1)

In [17]:
x_train.shape

(2985,)

In [18]:
y_train.shape

(2985,)

In [19]:
x_train.head()

996     i hated it i hate self aware pretentious inani...
2264    the penultimate episode of star treks third se...
592     as is often the case when you attempt to take ...
3227    i was forced to read this sappy love story bet...
2619    many other viewers are saying that this is not...
Name: review, dtype: object

In [20]:
x_test.head()

187     while i count myself as a fan of the babylon 5...
1380    this movie which i just discovered at the vide...
1550    <br  ><br  >i saw the glacier fox in the theat...
364     it got to be a running joke around bonanza abo...
2346    water lilies is a well made first film from fr...
Name: review, dtype: object

In [21]:
y_train.head()

996     0
2264    1
592     1
3227    0
2619    1
Name: sentiment, dtype: int64

In [22]:
y_test.head()

187     0
1380    0
1550    1
364     1
2346    1
Name: sentiment, dtype: int64

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(binary=True)
cv.fit(x_train)
x_cv=cv.transform(x_train)
x_test_cv=cv.transform(x_test)

# **BUILD CLASSIFIER**

In [91]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

x_train_cv,x_val,y_train_cv,y_val=train_test_split(x_cv,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  lr=LogisticRegression(C=c,solver="lbfgs",max_iter=1000)
  lr.fit(x_train_cv,y_train_cv)
  print("Accuracy for C=%s: %s" % (c,accuracy_score(y_val,lr.predict(x_val))))




Accuracy for C=0.01: 0.821954484605087
Accuracy for C=0.05: 0.8353413654618473
Accuracy for C=0.25: 0.8326639892904953
Accuracy for C=0.5: 0.8366800535475234
Accuracy for C=1: 0.8326639892904953


it looks like the value of C that gives us the highest accuracy is 0.05


# **TRAIN FINAL MODEL**

In [92]:
final_model=LogisticRegression(C=0.05,solver="lbfgs",max_iter=1000)
final_model.fit(x_train_cv,y_train_cv)
print("final accuracy: %s" %accuracy_score(y_test,final_model.predict(x_test_cv)))

final accuracy: 0.8291457286432161



As a sanity check, let's look at the 5 most discriminating words for both positive and negative reviews. We'll do this by looking at the largest and smallest coefficients, respectively

In [27]:
feature_to_coef={
    word:coef for word, coef in zip(cv.get_feature_names_out(),final_model.coef_[0])
}

for best_positive in sorted(feature_to_coef.items(),key=lambda x:x[1], reverse=True)[:5]:
  print(best_positive)

for best_negative in sorted(feature_to_coef.items(),key=lambda x:x[1])[:5]:
  print(best_negative)

('excellent', 0.600279087909803)
('great', 0.5644817751380315)
('wonderful', 0.4074706742872957)
('definitely', 0.3800837855057107)
('perfect', 0.3710108049801905)
('worst', -0.6595806496234844)
('bad', -0.602289863093714)
('waste', -0.560977110556151)
('awful', -0.4242936214661599)
('boring', -0.37376711332946827)


# **TEXT PROCESSING**

In [28]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [29]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

english_stop_words = stopwords.words('english')
def remove_stop_words(corpus):
  return ' '.join([review for review in corpus.split() if review not in english_stop_words])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
df["review"]=df["review"].apply(remove_stop_words)


In [55]:
df["review"]

0       one reviewers mentioned watching 1 oz episode ...
1       wonderful little production <br ><br >the film...
2       thought wonderful way spend time hot summer we...
3       basically theres family little boy jake thinks...
4       petter matteis love time money visually stunni...
                              ...                        
3975    visually cluttered plot less incredibly mind n...
3976    okso film well knownand wasnt well publicisedi...
3977    reeves plays haji murad hero 1850s russia<br >...
3978    brother law wife brought movie one night watch...
3979    im big fan zombie movies admit zombie movies u...
Name: review, Length: 3980, dtype: object

**STEMMING**

In [32]:
from nltk.stem.snowball import PorterStemmer
def get_stemmed_test(corpus):
  from nltk.stem.porter import PorterStemmer
  stemmer=PorterStemmer()
  return ' '.join([stemmer.stem(review) for review in corpus.split()])

In [56]:
df["review"]=df["review"].apply(get_stemmed_test)

In [57]:
df["review"]

0       one review mention watch 1 oz episod youll hoo...
1       wonder littl product <br ><br >the film techni...
2       thought wonder way spend time hot summer weeke...
3       basic there famili littl boy jake think there ...
4       petter mattei love time money visual stun film...
                              ...                        
3975    visual clutter plot less incred mind numb rubb...
3976    okso film well knownand wasnt well publicisedi...
3977    reev play haji murad hero 1850 russia<br ><br ...
3978    brother law wife brought movi one night watch ...
3979    im big fan zombi movi admit zombi movi usual a...
Name: review, Length: 3980, dtype: object

**LEMMATIZATION**

In [35]:
def get_lemmatized_text(corpus):
  nltk.download('wordnet')
  from nltk.stem import WordNetLemmatizer
  lemmatizer=WordNetLemmatizer()
  return ' '.join([lemmatizer.lemmatize(word) for word in corpus.split()])

In [None]:
df["review"]=df["review"].apply(get_lemmatized_text)

In [59]:
df["review"]

0       one review mention watch 1 oz episod youll hoo...
1       wonder littl product <br ><br >the film techni...
2       thought wonder way spend time hot summer weeke...
3       basic there famili littl boy jake think there ...
4       petter mattei love time money visual stun film...
                              ...                        
3975    visual clutter plot le incred mind numb rubbis...
3976    okso film well knownand wasnt well publicisedi...
3977    reev play haji murad hero 1850 russia<br ><br ...
3978    brother law wife brought movi one night watch ...
3979    im big fan zombi movi admit zombi movi usual a...
Name: review, Length: 3980, dtype: object

# **SPLITTING AFTER STEMMING AND LEMMATIZATION**

In [60]:
x=df["review"]
y=df["sentiment"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=1)

In [73]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vectorizer=CountVectorizer(binary=True,ngram_range=(1,3))
ngram_vectorizer.fit(x_train)
x_ng=ngram_vectorizer.transform(x_train)


x_train_ng,x_val,y_train_ng,y_val=train_test_split(x_ng,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  lr=LogisticRegression(C=c, solver="lbfgs", max_iter=1000)
  lr.fit(x_train_ng,y_train_ng)
  print("Accuracy for C=%s: %s" %(c, accuracy_score(y_val,lr.predict(x_val))))



Accuracy for C=0.01: 0.8406961178045516
Accuracy for C=0.05: 0.8406961178045516
Accuracy for C=0.25: 0.8514056224899599
Accuracy for C=0.5: 0.8500669344042838
Accuracy for C=1: 0.8487282463186078


In [74]:
x_test_ng=ngram_vectorizer.transform(x_test)

In [75]:
final_ngram=LogisticRegression(C=0.5, solver="lbfgs", max_iter=1000)
final_ngram.fit(x_train_ng,y_train_ng)
print("final accuracy: %s" % accuracy_score(y_test,lr.predict(x_test_ng)))

final accuracy: 0.8231155778894472



By simply considering 2-word sequence in addition to single words increased our accuracy by more than 1.6 percentage points.

# **REPRESENTATION**


While this simple approach can work very well, there are ways that we can encode more information into the vector.

**WORD COUNTS**

In [88]:
wc_vectorizer=CountVectorizer(binary=False)
wc_vectorizer.fit(x_train)
x_wc=wc_vectorizer.transform(x_train)
x_test_wc=wc_vectorizer.transform(x_test)

x_train_wc,x_val,y_train_wc,y_val=train_test_split(x_wc,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  lr=LogisticRegression(C=c, solver="lbfgs", max_iter=1000)
  lr.fit(x_train_wc,y_train_wc)
  print("Accuracy for C=%s: %s" %(c,accuracy_score(y_val,lr.predict(x_val))))

Accuracy for C=0.01: 0.8192771084337349
Accuracy for C=0.05: 0.8406961178045516
Accuracy for C=0.25: 0.8406961178045516
Accuracy for C=0.5: 0.8406961178045516
Accuracy for C=1: 0.8353413654618473


In [89]:
final_wc=LogisticRegression(C=1, solver="lbfgs", max_iter=1000)
final_wc.fit(x_train_wc,y_train_wc)
print("Final accuracy: %s" % accuracy_score(y_test,final_wc.predict(x_test_wc)))

Final accuracy: 0.821105527638191


**TF-IDF**

In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer=TfidfVectorizer()
tfidf_vectorizer.fit(x_train)
x_tfidf=tfidf_vectorizer.transform(x_train)
x_test_tfidf=tfidf_vectorizer.transform(x_test)

x_train_tf,x_val,y_train_tf,y_val=train_test_split(x_tfidf,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  lr=LogisticRegression(C=c, solver="lbfgs", max_iter=1000)
  lr.fit(x_train_tf,y_train_tf)
  print("Accuracy for C=%s: %s" %(c,accuracy_score(y_val,lr.predict(x_val))))

Accuracy for C=0.01: 0.7443105756358769
Accuracy for C=0.05: 0.8072289156626506
Accuracy for C=0.25: 0.8406961178045516
Accuracy for C=0.5: 0.8420348058902276
Accuracy for C=1: 0.8500669344042838


In [87]:
final_tfidf=LogisticRegression(C=1, solver="lbfgs", max_iter=1000)
final_tfidf.fit(x_train_tf,y_train_tf)
print("Final accuracy: %s" % accuracy_score(y_test,final_tfidf.predict(x_test_tfidf)))

Final accuracy: 0.8251256281407036


# **ALGORITHMS**

**SUPPORT VECTOR MACHINES**

In [71]:
from sklearn.svm import LinearSVC

ngram_vectorizer=CountVectorizer(binary=True,ngram_range=(1,2))
ngram_vectorizer.fit(x_train)
x_ng=ngram_vectorizer.transform(x_train)
x_test_ng=ngram_vectorizer.transform(x_test)

x_train_ng,x_val,y_train_ng,y_val=train_test_split(x_ng,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  svm=LinearSVC(C=c, max_iter=1000)
  svm.fit(x_train_ng,y_train_ng)
  print("Accuracy for C=%s: %s" %(c, accuracy_score(y_val,svm.predict(x_val))))


Accuracy for C=0.01: 0.8473895582329317
Accuracy for C=0.05: 0.8473895582329317
Accuracy for C=0.25: 0.8473895582329317
Accuracy for C=0.5: 0.8500669344042838
Accuracy for C=1: 0.8500669344042838


In [72]:
final_svm_ngram=LinearSVC(C=0.01, max_iter=1000)
final_svm_ngram.fit(x_train_ng,y_train_ng)
print("Final accuracy: %s" % accuracy_score(y_test,final_svm_ngram.predict(x_test_ng)))


Final accuracy: 0.8321608040201005


# **FINAL MODEL**


It can be noticed that removing a small set of stopwords along with an TFIDF and a support vector machine classifier gives me the best results.

In [84]:

stop_words=["in","of","at","a","the"]
tfidf_vectorizer=TfidfVectorizer(stop_words=stop_words)
tfidf_vectorizer.fit(x_train)
x_tfidf=tfidf_vectorizer.transform(x_train)
x_test_tfidf=tfidf_vectorizer.transform(x_test)

x_train_tf,x_val,y_train_tf,y_val=train_test_split(x_tfidf,y_train,train_size=0.75)

for c in [0.01,0.05,0.25,0.5,1]:
  svm=LinearSVC(C=c, max_iter=1000)
  svm.fit(x_train_tf,y_train_tf)
  print("Accuracy for C=%s: %s" %(c, accuracy_score(y_val,svm.predict(x_val))))


Accuracy for C=0.01: 0.8165997322623829
Accuracy for C=0.05: 0.8393574297188755
Accuracy for C=0.25: 0.8514056224899599
Accuracy for C=0.5: 0.85809906291834
Accuracy for C=1: 0.856760374832664


In [85]:
final_svm_tfidf=LinearSVC(C=1, max_iter=1000)
final_svm_tfidf.fit(x_train_tf,y_train_tf)
print("Final accuracy: %s" % accuracy_score(y_test,final_svm_tfidf.predict(x_test_tfidf)))


Final accuracy: 0.8341708542713567
