![](https://i.ytimg.com/vi/38RemgSBG0w/maxresdefault.jpg)

# **Please Upvote if you like my work. Thank You and God Bless You!!!****

# Introduction

##### IMDB dataset having 50K movie reviews for natural language processing or Text analytics.This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

##### Today we will preprocess this dataset using regular explressions, beautiful soup, visualize the dataset using matplotlib, seaborn and wordcloud then vectorize it using Tfidf Vectorizer and perform sentiment classification using ensemble methods

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [None]:
df

In [None]:
df.shape

Dataframe have 50000 rows and 2 columns

In [None]:
df.isnull().sum()

There are no null values

In [None]:
df['sentiment'].value_counts()

Dataset is balanced. There are 25000 reviews for each negative and posiive sentiments.

In [None]:
##New column for length of review
df['length']=df['review'].str.len()

In [None]:
#Displaying maximum col width
pd.set_option('display.max_colwidth', None)

In [None]:
df.head()

In [None]:
#Removing html tags
def removehtml(text):
    soup=BeautifulSoup(text)
    return soup.get_text()

df['review']=df['review'].apply(removehtml)

In [None]:
#Converting into lower case
df['review']=df['review'].str.lower()

In [None]:
#Replacing everthing except alpabets and numbers with space
df['review']=df['review'].str.replace(r'[^a-zA-Z0-9]',' ')

In [None]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
sw=set(stopwords.words('english'))
snow=SnowballStemmer('english')

In [None]:
#Removing stop words and performimg stemming
def stemming(text):
    text=' '.join([snow.stem(word) for word in text.split() if word not in sw])
    return text

df['review']=df['review'].apply(stemming)

In [None]:
df.head()

In [None]:
#Label coding 0 and 1
df['sentiment'].replace({'negative':0,'positive':1},inplace=True)

In [None]:
df.head()

In [None]:
#New column (clean length) after removal of punctuations and stopwords
df['clean_length']=df['review'].str.len()

In [None]:
#Message distribution before cleaning
f,ax=plt.subplots(1,2,figsize=(15,8))

sns.distplot(df[df['sentiment']==1]['length'],bins=20,ax=ax[0],label='Positive review distribution',color='r')

ax[0].set_xlabel('Positive review length')
ax[0].legend()

sns.distplot(df[df['sentiment']==0]['length'],bins=20,ax=ax[1],label='Negative review distribution',color='b')

ax[1].set_xlabel('Negative review length')
ax[1].legend()

plt.show()

In [None]:
#Message distribution before cleaning
f,ax=plt.subplots(1,2,figsize=(15,8))

sns.distplot(df[df['sentiment']==1]['clean_length'],bins=20,ax=ax[0],label='Positive review distribution',color='r')

ax[0].set_xlabel('Positive review length')
ax[0].legend()

sns.distplot(df[df['sentiment']==0]['clean_length'],bins=20,ax=ax[1],label='Negative review distribution',color='b')

ax[1].set_xlabel('Negative review length')
ax[1].legend()

plt.show()

In [None]:
from wordcloud import WordCloud

In [None]:
#Getting sense of loud words in positive sentiments
positive=df['review'][df['sentiment']==1]
spamcloud=WordCloud(width=1200,height=800,background_color='white',max_words=25).generate(' '.join(positive))

plt.figure(figsize=(12,8),facecolor='r')
plt.imshow(spamcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Positive reviews have words like well, life, love, play etc

In [None]:
#Getting sense of loud words in negative sentiments
negative=df['review'][df['sentiment']==0]
spamcloud=WordCloud(width=1200,height=800,background_color='white',max_words=25).generate(' '.join(negative))

plt.figure(figsize=(12,8),facecolor='r')
plt.imshow(spamcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Negative reviews have word like see, plot etc.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
review=df['review']
tfidf=TfidfVectorizer(ngram_range=(1,3))
review=tfidf.fit_transform(review)

In [None]:
sentiment=df['sentiment']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(review,sentiment,test_size=0.3,random_state=7)

### Random Forest

In [None]:
model=RandomForestClassifier()
model.fit(xtrain,ytrain)

In [None]:
p=model.predict(xtest)

In [None]:
print('Accuracy score', accuracy_score(p,ytest))
print('-----------------------------------------')
print('Confusion Matrix')
print(confusion_matrix(p,ytest))
print('-----------------------------------------')
print('Classification Report')
print(classification_report(p,ytest))

### XGBoost

In [None]:
model=XGBClassifier(verbosity=0)
model.fit(xtrain,ytrain)

In [None]:
p=model.predict(xtest)

In [None]:
print('Accuracy score', accuracy_score(p,ytest))
print('-----------------------------------------')
print('Confusion Matrix')
print(confusion_matrix(p,ytest))
print('-----------------------------------------')
print('Classification Report')
print(classification_report(p,ytest))

# Conclusion

Xgboost Seems to be slightly better than random Forest. We could have also used TextBlob to correct spellings from text only if we had more powerful machine. 