<a href="https://colab.research.google.com/github/sobiahashmi/BIA_codes/blob/main/imdb_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dataset used: IMDB**
- ***dataset containing movie reviews and their corresponding sentiments (positive, negative)***

# **Models Trained using IMDB:**

- RNN (Recurrent Neural Network)
- SVM (Support Vector Machine)
- Random Forest
- Gradient Boosting
- LSTM (Long Short Term Memory)
- GRU (Gated Recurrent Network)


### `The *IMDB dataset* is a popular dataset used in natural language processing (NLP) and sentiment analysis tasks. Here's a simple overview:`


### **What it Contains:**

- It includes 50,000 movie reviews from the Internet Movie Database (IMDB), with each review labeled as either positive or negative for sentiment.

### **Purpose:**

- It is primarily used to build models that can classify text into positive or negative sentiment.

### **Dataset Split:**

- 80% (40,000 reviews) for training.
- 20% (10,000 reviews) for testing.

### **Text Format:**

- The reviews are preprocessed and stored as sequences of words or tokenized into numerical values for machine learning.

## Step-01 Load Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



## Step-02 Load Dataset

In [3]:
df = pd.read_csv('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/IMDB_dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Step-03 Data Preprocessing

In [None]:
df.shape

(50000, 2)

In [None]:
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [4]:
# download and prepare stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))
print(stop_words)

{'when', 'his', 'on', "didn't", "haven't", 'until', 'why', 'this', "should've", 'himself', 'doesn', 're', 'aren', 'was', 'where', 'no', 'between', 'same', 'through', 'once', "isn't", 'down', "wasn't", "weren't", 'shouldn', 'off', 'are', 'again', 'll', 'having', 'can', 'your', 'm', 'while', 'only', "aren't", 'into', 'whom', 'who', 'were', 'yours', 'had', 'o', 'y', 'a', 'up', 'ain', "doesn't", 'hasn', 'each', 'some', 'then', 'wasn', "mightn't", "she's", 'will', 'too', 'to', 'about', 'her', 'after', 'from', 'hers', 'more', "hadn't", 'those', 'my', 'been', 'she', 'wouldn', 'very', 'they', 'all', 'so', "you've", "wouldn't", 've', 'd', 'shan', 'not', 'there', "it's", 'herself', 'than', 'does', "couldn't", 'now', "needn't", 't', 'or', "that'll", 'the', 'them', 'him', 'such', 'our', 'weren', 'and', 'haven', "hasn't", "shouldn't", 'its', 'with', 'during', 'their', 'yourself', 'ours', 'theirs', 'themselves', 'do', 'that', 'of', 'doing', 'an', 'mightn', 'mustn', 'don', "don't", 'ourselves', 'i', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
df.columns

Index(['review', 'sentiment'], dtype='object')

In [5]:
df['review'] = df['review'].apply(lambda x:' '.join(words.lower() for words in word_tokenize(x) if words.isalpha()))
df['review'] = df['review'].apply(lambda x:' '.join(word for word in x.split() if word not in stop_words))

In [9]:
df['review']

Unnamed: 0,review
0,one reviewers mentioned watching oz episode ho...
1,wonderful little production br br filming tech...
2,thought wonderful way spend time hot summer we...
3,basically family little boy jake thinks zombie...
4,petter mattei love time money visually stunnin...
...,...
49995,thought movie right good job creative original...
49996,bad plot bad dialogue bad acting idiotic direc...
49997,catholic taught parochial elementary schools n...
49998,going disagree previous comment side maltin on...


## Step-04 Feature Extraction

* We are doing Feature Extraction for machine understandable Language. As it understand numbers only.

In [6]:
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['review'])
print(X)

  (0, 60399)	0.017974952839640245
  (0, 71441)	0.06462480990968943
  (0, 54092)	0.05609257336852549
  (0, 93368)	0.06575817373268848
  (0, 61684)	0.45623212993794404
  (0, 27602)	0.0975544426738396
  (0, 39899)	0.07094898893961152
  (0, 71837)	0.0735281265840711
  (0, 28436)	0.04904385280355244
  (0, 37578)	0.04906897696993937
  (0, 10061)	0.1047786971609892
  (0, 30894)	0.054421297960725694
  (0, 85831)	0.03303889550109451
  (0, 82165)	0.14215854177509643
  (0, 10973)	0.07903842493089844
  (0, 89869)	0.09486972298505915
  (0, 74571)	0.03257488692995884
  (0, 92320)	0.20057263688247617
  (0, 76083)	0.04041402416824635
  (0, 95155)	0.05023337023058612
  (0, 35029)	0.032218575955768956
  (0, 88294)	0.06204631857696947
  (0, 77159)	0.1032577785093174
  (0, 29340)	0.08475172881437788
  (0, 38275)	0.07921586789377567
  :	:
  (49999, 89906)	0.1082206330087094
  (49999, 48730)	0.09747854593557065
  (49999, 43001)	0.19065207524758182
  (49999, 28775)	0.11166077174693212
  (49999, 5351)	0.12239

## Step-05 Split the data into Training and Testing sets

In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,df['sentiment'], test_size = 0.2 , random_state=42)

# len(X_train),len(y_train),len(X_test), len(y_test)

## Step-06 Build and Train the Model

- **Let's train four different Models and compare their accuracies.**

  - Multinomial Naive Bayes
  - Gradient Boosting Classifier
  - Random Forest Classifier
  - Support Vector Classifier


In [8]:
model = MultinomialNB()
print(model.fit(X_train,y_train))

model_gbc = GradientBoostingClassifier()
print(model_gbc.fit(X_train,y_train))

model_rfc = RandomForestClassifier()
print(model_rfc.fit(X_train,y_train))

model_svc = SVC()
print(model_svc.fit(X_train,y_train))

MultinomialNB()
GradientBoostingClassifier()
RandomForestClassifier()
SVC()


In [10]:
model_lgbm = LGBMClassifier()
model_lgbm.fit(X_train,y_train)

#model_xgb = XGBClassifier()
#model_xgb.fit(X_train,y_train)

[LightGBM] [Info] Number of positive: 19961, number of negative: 20039
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 10.224409 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 733174
[LightGBM] [Info] Number of data points in the train set: 40000, number of used features: 14944
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499025 -> initscore=-0.003900
[LightGBM] [Info] Start training from score -0.003900


## Step-07 Save the Model

In [11]:
import joblib
joblib.dump(model,'/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model.pkl')

['/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model.pkl']

In [14]:
joblib.dump(model_gbc,'/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_gbc.pkl')

['/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_gbc.pkl']

In [13]:
joblib.dump(model_rfc,'/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_rfc.pkl')

['/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_rfc.pkl']

In [15]:
joblib.dump(model_svc,'/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_svc.pkl')

['/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_svc.pkl']

In [16]:
joblib.dump(model_lgbm,'/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_lgbm.pkl')

['/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_lgbm.pkl']

In [11]:
import joblib
model = joblib.load('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model.pkl')
model_gbc = joblib.load('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_gbc.pkl')
model_rfc = joblib.load('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_rfc.pkl')
model_svc = joblib.load('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_svc.pkl')
model_lgbm = joblib.load('/content/drive/MyDrive/BIA_class/NLP/IMDB dataset/model_lgbm.pkl')

## Step-08 Model Prediction

In [12]:
y_pred_nb = model.predict(X_test)

In [13]:
y_pred_gbc = model_gbc.predict(X_test)
y_pred_rfc = model_rfc.predict(X_test)
y_pred_svc = model_svc.predict(X_test)
y_pred_lgbm = model_lgbm.predict(X_test)

## Step-09 Model Evaluation

In [14]:
# Multinomial Naive Bayes
print("___Multinomial Naive Bayes____")
print("Accuracy:" , accuracy_score(y_test,y_pred_nb))
print("Classificaton Report:", classification_report(y_test,y_pred_nb))

# Gradient Boosting
print("___Gradient Boosting____")
print("Accuracy:" , accuracy_score(y_test,y_pred_gbc))
print("Classificaton Report:", classification_report(y_test,y_pred_gbc))

# Random Forest
print("___Random Forest____")
print("Accuracy:" , accuracy_score(y_test,y_pred_rfc))
print("Classificaton Report:", classification_report(y_test,y_pred_rfc))

# SVM
print("___Support Vector Machine____")
print("Accuracy:" , accuracy_score(y_test,y_pred_svc))
print("Classificaton Report:", classification_report(y_test,y_pred_svc))

# Light GBM Classifier
print("___Light GBM Classifier____")
print("Accuracy:" , accuracy_score(y_test,y_pred_lgbm))
print("Classificaton Report:", classification_report(y_test,y_pred_lgbm))

___Multinomial Naive Bayes____
Accuracy: 0.8672
Classificaton Report:               precision    recall  f1-score   support

    negative       0.86      0.88      0.87      4961
    positive       0.88      0.85      0.87      5039

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

___Gradient Boosting____
Accuracy: 0.8102
Classificaton Report:               precision    recall  f1-score   support

    negative       0.84      0.76      0.80      4961
    positive       0.78      0.86      0.82      5039

    accuracy                           0.81     10000
   macro avg       0.81      0.81      0.81     10000
weighted avg       0.81      0.81      0.81     10000

___Random Forest____
Accuracy: 0.8554
Classificaton Report:               precision    recall  f1-score   support

    negative       0.85      0.86      0.86      4961
    positive       0.86      0.85      0.86