___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

In [None]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Colab Notebooks/IMDB Dataset.csv'


df = pd.read_csv(file_path)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Check for missing values:
Always a good practice.

In [None]:
# Checking for NaN values in the Dataframe.
nan_values = df.isna().sum()
print(nan_values)

review       0
sentiment    0
dtype: int64


In [None]:
# Checking for empty cells in the Dataframe.
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [None]:
# Checking for cells with just whitespaces.
# initializing a list
blanks = []

# Iterate through each row using itertuples
for row in df.itertuples():
    # Iterate through each cell in the row
    for cell in row[1:]:  # Skip the first element which is the index. Can search along with index as well.
        if str(cell).isspace():  # Convert cell to string and check if it contains only whitespace
            blanks.append(row.Index)  # Append index of the row with whitespace to the blanks list

print(blanks)  # Printing the results to see all index position which has whitespaces.

[]


We see that there are no empty cells

## Split the data into train & test sets:

In [None]:
from sklearn.model_selection import train_test_split

X = df['review']  # this time we want to look at the text
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier. It combines everything what we saw (CountVectorisation, TfidfVectorisation and everything above into one step)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

## Takes arguments in form of tuples (for each function) in a list.
## This is a pipeline so we can provided as many functions (tuples inside list) as we want in the pipeline
## and it will perform the functions according to the order specified.
## Here we are performing TfidfVectorizer first and then LinearSVC (Classification model)
## Here the functions are named tfidf and clf. You can name it anything you want.
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Fitting the training data through the pipeline
text_clf.fit(X_train, y_train)

## Test the classifier and display results

In [None]:
# Form a prediction set
predictions = text_clf.predict(X_test) ## Notice that we are putting X_test which is raw text messages into this predict function.
## This is taken care of the pipeline. It will first perform Tfidf and then LinearSVC on this as well.

In [None]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[7291  917]
 [ 735 7557]]


In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      8208
    positive       0.89      0.91      0.90      8292

    accuracy                           0.90     16500
   macro avg       0.90      0.90      0.90     16500
weighted avg       0.90      0.90      0.90     16500



In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.8998787878787878


In [None]:
## Predicting a single new message if it is spam or ham.
text_clf.predict(["I dont know what to say about this movie to be honest. The story was really good and gripping. However the way it was told and the scequence it was told can be altered to make a better remake. The acting in compare to the directer was superb. There were few direction mistakes which was well compensated with the natural acting."])

array(['positive'], dtype=object)