# The abbreviation NLP refers to Natural Language Processing, which involves analyzing text to identify significant details such as Sentiments, Named Entity, Topic of Discussion, and even Summary of the text.
#### In this particular case, we will use the IMDB dataset to perform Sentiment Analysis. Initially, we will employ some text cleaning methods, such as text pre-processing, since textual data is unstructured. As we cannot directly apply text to our Machine Learning Model, we must convert it into a mathematical form (vector representation) and investigate various Vectorization / Text Encoding Techniques.

## Importing needed libraries

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
import re 
from bs4 import BeautifulSoup 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split 
import joblib 


### Loading dataset


In [8]:
dataset = pd.read_csv(r'C:\Users\HP\Desktop\projects\backend\BizBot\sentiment\IMDB Dataset.csv')


### Printing the shape and the first 10 rows of the dataset


In [9]:
print(dataset.shape)
dataset.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### Describing the dataset

In [11]:
dataset.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [13]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


#### There are two columns - review and sentiment. Sentiment is the target column that we have to predict further.


In [14]:
dataset['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## Performing Sentiment Analysis on the IMDB dataset

#### Defining a function to clean the text by removing HTML tags, punctuation and non-alphanumeric characters


#### Removing HTML tags using BeautifulSoup

In [19]:
def clean_text(text):
  # Removing HTML tags using BeautifulSoup
  text = BeautifulSoup(text, "html.parser").get_text()
  # Removing punctuation and non-alphanumeric characters using regular expressions
  text = re.sub("[^a-zA-Z]", " ", text)
  # Converting the text to lower case and splitting it into words
  words = text.lower().split()
  # Returning the cleaned text as a string
  return " ".join(words)


In [20]:
# Applying the clean_text function to the review column of the dataset
dataset["clean_review"] = dataset["review"].apply(clean_text)

# Printing the first 10 rows of the cleaned dataset
dataset.head(10)

  text = BeautifulSoup(text, "html.parser").get_text()


Unnamed: 0,review,sentiment,clean_review
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,basically there s a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei s love in the time of money is a...
5,"Probably my all-time favorite movie, a story o...",positive,probably my all time favorite movie a story of...
6,I sure would like to see a resurrection of a u...,positive,i sure would like to see a resurrection of a u...
7,"This show was an amazing, fresh & innovative i...",negative,this show was an amazing fresh innovative idea...
8,Encouraged by the positive comments about this...,negative,encouraged by the positive comments about this...
9,If you like original gut wrenching laughter yo...,positive,if you like original gut wrenching laughter yo...


In [21]:
# Defining a function to convert the sentiment column into numerical labels (0 for negative, 1 for positive, 2 for neutral)
def label_sentiment(sentiment):
  # If the sentiment is positive, return 1
  if sentiment == "positive":
    return 1
  # If the sentiment is negative, return 0
  elif sentiment == "negative":
    return 0
  # Else, return 2
  else:
    return 2


## Applying the label_sentiment function to the sentiment column of the dataset

In [23]:
dataset["label"] = dataset["sentiment"].apply(label_sentiment)

# Printing the first 10 rows of the labeled dataset
dataset.head(10)


Unnamed: 0,review,sentiment,clean_review,label
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,1
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,negative,basically there s a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei s love in the time of money is a...,1
5,"Probably my all-time favorite movie, a story o...",positive,probably my all time favorite movie a story of...,1
6,I sure would like to see a resurrection of a u...,positive,i sure would like to see a resurrection of a u...,1
7,"This show was an amazing, fresh & innovative i...",negative,this show was an amazing fresh innovative idea...,0
8,Encouraged by the positive comments about this...,negative,encouraged by the positive comments about this...,0
9,If you like original gut wrenching laughter yo...,positive,if you like original gut wrenching laughter yo...,1


## Splitting the dataset into features (X) and target (y)

In [26]:
X = dataset["clean_review"]
y = dataset["label"]

# Splitting the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [30]:
# Fitting and transforming the training features into a sparse matrix
X_train_vec = vectorizer.fit_transform(X_train)

# Transforming the testing features into a sparse matrix using the same vectorizer
X_test_vec = vectorizer.transform(X_test)

# Creating a LinearSVC object with default parameters
classifier = LinearSVC()

### Creating a TfidfVectorizer object with a maximum of 5000 features and removing stopwords

In [27]:
# Creating a TfidfVectorizer object with a maximum of 5000 features and removing stopwords
vectorizer = TfidfVectorizer(max_features=5000, stop_words="english")

### Fitting and  transforming the model 

In [30]:
# Fitting and transforming the training features into a sparse matrix
X_train_vec = vectorizer.fit_transform(X_train)

# Transforming the testing features into a sparse matrix using the same vectorizer
X_test_vec = vectorizer.transform(X_test)

# Creating a LinearSVC object with default parameters
classifier = LinearSVC()

In [31]:
# Fitting the classifier to the training data
classifier.fit(X_train_vec, y_train)




### Fitting the classifier to the training data & Predicting the labels for the testing data

In [33]:
# Fitting the classifier to the training data
classifier.fit(X_train_vec, y_train)

# Predicting the labels for the testing data
y_pred = classifier.predict(X_test_vec)



### Evaluating the performance of the classifier using accuracy score

In [38]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print('Classification Report: \n', classification_report(y_test, y_pred, target_names = ['Negative', 'Positive']))
print('Confusion Matrix: \n', confusion_matrix(y_test,y_pred ))
print('Accuracy score: \n', accuracy_score(y_test, y_pred))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.88      0.86      0.87      4961
    Positive       0.87      0.89      0.88      5039

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

Confusion Matrix: 
 [[4289  672]
 [ 562 4477]]
Accuracy score: 
 0.8766


In [39]:
y_pred

array([0, 1, 0, ..., 1, 1, 1], dtype=int64)

#### Saving the model using jobllib 

In [41]:
# Saving the classifier to an object using joblib
joblib.dump(classifier, "classifier.joblib")

['classifier.joblib']

#### to import and use the model on new dataset 

In [43]:
# Importing joblib for loading models
import joblib

# Loading the classifier from the object using joblib
classifier = joblib.load("classifier.joblib")

# Using the classifier to predict the labels for new data
new_data = ["This movie is awesome", "This movie is terrible", "This movie is okay"]
new_data_vec = vectorizer.transform(new_data) # using the same vectorizer as before
new_pred = classifier.predict(new_data_vec)
print(new_pred)

[1 0 0]


#### Finally our model is performing very well .You can try different method to improve the performance of the model

### ENOCHH BAAH 