# Natural Language Processing
It analyses the movie review entered by a user for any specific movie and analyses what is the sentiment of the review.
It helps the companies rate the movie and understand crowd sentiment regarding it. 
Sentiment analysis is a natural language processing problem where text is understood and the underlying intent is predicted.

## Dataset description:
    
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. 
We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing.
So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

## Importing the essential libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset

In [2]:
data=pd.read_csv("IMDB Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
# check for any nan or missing values
data.isnull().any()

review       False
sentiment    False
dtype: bool

In [4]:
# describe the dataset
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


## Tokenization

In [5]:
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.toktok import ToktokTokenizer

In [6]:
#Tokenization of text
tokenizers=ToktokTokenizer()
#Setting English stopwords
stopwords=nltk.corpus.stopwords.words('english')

In [7]:
#removal of noise from the data
import re,string,unicodedata
from bs4 import BeautifulSoup
def noise_removal(text):
    soup=BeautifulSoup(text,"html.parser")
    text=soup.get_text()
    text=re.sub("\[[^]]*\]","",text)
    return text

In [8]:
#Apply function on review column
data['review']=data['review'].apply(noise_removal)
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Stemming

In [9]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
def stemming(text):
    text=[ps.stem(word) for word in text.split()]
    text=" ".join(text)
    return text

In [10]:
data["review"]=data["review"].apply(stemming)
data.head()

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after ...,positive
1,A wonder littl production. the film techniqu i...,positive
2,I thought thi wa a wonder way to spend time on...,positive
3,basic there' a famili where a littl boy (jake)...,negative
4,"petter mattei' ""love in the time of money"" is ...",positive


## Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer 
lemmati = WordNetLemmatizer() 
def lemmatization(text):
    text=[lemmati.lemmatize(word) for word in text.split()]
    text=" ".join(text)
    return text

In [12]:
data["review"]=data["review"].apply(lemmatization)
data.head()

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after ...,positive
1,A wonder littl production. the film techniqu i...,positive
2,I thought thi wa a wonder way to spend time on...,positive
3,basic there' a famili where a littl boy (jake)...,negative
4,"petter mattei' ""love in the time of money"" is ...",positive


## Removing Stop Words

In [13]:
# set of stop words
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))


In [14]:
def stopwords(text):
    text=text.lower()
    text=text.split()
    text=[word for word in text if word not in stop_words]
    text=" ".join(text)
    return text

In [15]:
data["review"]=data["review"].apply(stopwords)
data.head()

Unnamed: 0,review,sentiment
0,one review ha mention watch 1 oz episod hooked...,positive
1,wonder littl production. film techniqu veri un...,positive
2,thought thi wa wonder way spend time hot summe...,positive
3,basic there' famili littl boy (jake) think the...,negative
4,"petter mattei' ""love time money"" visual stun f...",positive


## Train Test split

In [16]:
# Splitting the train data
train_data=data.review[:30000]
# splitting the test data
test_data=data.review[30000:]

## Bag of Words

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
train_cv=cv.fit_transform(train_data)
test_cv=cv.transform(test_data)
print('BOW_cv_train:',train_cv.shape)
print('BOW_cv_test:',test_cv.shape)

BOW_cv_train: (30000, 80393)
BOW_cv_test: (20000, 80393)


## TF_IDF

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
train_tf=tf.fit_transform(train_data)
test_tf=tf.transform(test_data)
print('TF_IDF_train:',train_tf.shape)
print('TF_IDF_test:',test_tf.shape)

TF_IDF_train: (30000, 80393)
TF_IDF_test: (20000, 80393)


## Label Encoding

In [19]:
from sklearn.preprocessing import LabelBinarizer
label=LabelBinarizer()
sentiment_data=label.fit_transform(data["sentiment"])
sentiment_data.shape

(50000, 1)

## Training the Logistic Regression Model

In [20]:
train_data=data.sentiment[:30000]
test_data=data.sentiment[30000:]

In [21]:
from sklearn.linear_model import LogisticRegression,SGDClassifier
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
# fitting the model for bag of words
lr_bow=lr.fit(train_cv,train_data)
print(lr_bow)
#Fitting the model for tfidf features
lr_tfidf=lr.fit(train_tf,train_data)
print(lr_tfidf)

LogisticRegression(C=1, max_iter=500, random_state=42)
LogisticRegression(C=1, max_iter=500, random_state=42)


## Predicting the results

In [22]:
#Predicting the model for bag of words
lr_bow_predict=lr.predict(test_cv)
print(lr_bow_predict)

['positive' 'negative' 'negative' ... 'positive' 'negative' 'negative']


In [23]:
#Predicting the model for tf_idf
lr_tf_predict=lr.predict(test_tf)
print(lr_tf_predict)

['positive' 'negative' 'negative' ... 'positive' 'negative' 'negative']


## evaluation of  model

In [24]:
#Accuracy score for bag of words
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
lr_bow_score=accuracy_score(test_data,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)

lr_bow_score : 0.867


In [25]:
lr_tf_score=accuracy_score(test_data,lr_tf_predict)
print("lr_bow_score :",lr_tf_score)

lr_bow_score : 0.88705
