**NLP stands for Natural Language Processing which is the task of mining the text and find out some meaningful insights like Sentiments, Named Entity, Topic of Discussion and even Summary of the text.**

With this IMDB dataset we will do the Sentiment Analysis.

Firstly,we will apply some text cleaning techniques i.e do some text pre-processing since textual data is in free form.

Since we cannot apply text to our Machine Learning Model directly we have to convert the text into mathematical form (vector representation) and explore different Vectorization / Text Encoding Techniques. 

***Importing basic libraries***

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

***Loading dataset***

In [None]:
dataset = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

print(dataset.shape)
dataset.head(10)


In [None]:
dataset.describe()

In [None]:
dataset.info()

**There are two columns - review and sentiment.
Sentiment is the target column that we have to predict further.**

In [None]:
dataset['sentiment'].value_counts()

*Dataset is balanced and has equal number of positive and negative sentiments.*

**Taking one review as sample and understanding the need of cleaning the text.**

In [None]:
review = dataset['review'].iloc[1]
review

**In general any NLP task involves the following text cleaning techniques -**
1. Removal of HTML contents like "\<br>"
2. Removal of punctuation and special characters.
3. Removal of stopwords like the, when, how etc which do not offer much insights.
4. Stemming/Lemmatization techniques to have the stem word of the words having multiple forms of words.
5. Vectorization - encoding the textual data to numerical form after cleaning.
6. Fitting the data to ML model.




**Applying the techniques to sample data to understand the process first -**

1. Removal of HTML contents

In [None]:
!pip install bs4


In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(review, 'html.parser')
review = soup.get_text()
review

**HTML tags are removed now remove everything except the lowercase/uppercase letters using regular expressions**

In [None]:
import re
review = re.sub('\[[^]]*/]',' ',review)
review = re.sub('[^a-zA-Z]',' ',review)
review

**Now converting everthing into lowercase**

In [None]:
review =  review.lower()
review

**Now removal of Stopwords (common words) and for this we will create a list of words separated by .split().**

**Note:- Split function splits a sentences by their whitespaces and returns a list containing words**

In [None]:
review = review.split()
review

In [None]:
import nltk

from nltk.corpus import stopwords

review = [word for word in review if not word in set(stopwords.words('english'))]
review

**Stemming/Lemmatization**
Will apply and observe the differences in both the techniques.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

review_stemmer = [stemmer.stem(word) for word in review]
review_stemmer = ' '.join(review_stemmer)
review_stemmer

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
review_lemmatize = [lemmatizer.lemmatize(word) for word in review]
review_lemmatize = ' '.join(review_lemmatize)
review_lemmatize

As seen in both the paragraphs the Stemming does not impart proper meaning to the stem word while in Lemmatizing stem words conveys a meaning ***eg-  fantasi & fantasy respectively.***

**We will use Lemmatized review further**.

Now we will do Vectorization of out text for this we will create a corpus first.

In [None]:
corpus = []
corpus.append(review_lemmatize)
corpus

**To vectorize we will apply -**
1. Bag of Words model ( CountVectorizer)
2. TF - IDF model (TfidfVectorizer)
3.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()
x

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf  =  TfidfVectorizer()

review_tfidf = tfidf.fit_transform(corpus).toarray()
review_tfidf

**Now we will apply the techiques on whole dataset keeping aside 25% of data for testing purposes**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(dataset['review'], dataset['sentiment'], test_size=0.25, random_state=42)


**Converting sentiments into numerical form**

In [None]:
y_train = y_train.replace({'positive':1, 'negative':0})
y_test = y_test.replace({'positive':1, 'negative':0})

**Cleaning text and forming train and test corpus**

In [None]:
x_train.iloc[1]

In [None]:
import re

In [None]:
corpus_train = []
corpus_test = []

for i in range(x_train.shape[0]):
    soup = BeautifulSoup(x_train.iloc[i],'html.parser')
    review = soup.get_text()
    review = re.sub('/[[^]]*/]',' ',review)
    review = re.sub('[^a-zA-Z]',' ',review)
    review = review.lower()
    review = review.split()
    
    lm  = WordNetLemmatizer()
    review = [lm.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus_train.append(review)


for j in range(x_test.shape[0]):
    soup = BeautifulSoup(x_test.iloc[j],'html.parser')
    review = soup.get_text()
    review = re.sub('/[[^]]*/]',' ',review)
    review = re.sub('[^a-zA-Z]',' ',review)
    review = review.lower()
    review = review.split()
    
    lm = WordNetLemmatizer()
    review = [lm.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus_test.append(review)
    
    
    


    
    
  

**Validating sample entry**

In [None]:
corpus_train[-1]

In [None]:
corpus_test[-1]

**Vectorization using TF-IDF Technique**

In [None]:
tfidf = TfidfVectorizer(ngram_range = (1, 3))

tfidf_train = tfidf.fit_transform(corpus_train)
tfidf_test = tfidf.transform(corpus_test)

**Using Linear SupportVectorClassifier(SVC) as first model-**

In [None]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC(C=0.5, random_state=42)
linear_svc.fit(tfidf_train,y_train)

predict = linear_svc.predict(tfidf_test)


**Performance Metric**
- Classification Report
- Confusion Matrix
- Accuracy Score

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print('Classification Report: \n', classification_report(y_test, predict, target_names = ['Negative', 'Positive']))
print('Confusion Matrix: \n', confusion_matrix(y_test, predict))
print('Accuracy score: \n', accuracy_score(y_test, predict))

In [None]:
predict