<a href="https://colab.research.google.com/github/souravmondal01/binary_sentiment_classification_on_IMDB/blob/main/binary_sentiment_classification_on_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import necessary libraries

In [None]:
import os
import kagglehub
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from wordcloud import WordCloud,STOPWORDS
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata
from textblob import TextBlob
from textblob import Word
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Download latest version of the datasets
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")



importing the training data

In [None]:
# importing the training data
imdb_df = pd.read_csv(f'{path}/IMDB Dataset.csv')
print(imdb_df.shape)
imdb_df.head(5)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


sentiment count

In [None]:
#sentiment count
imdb_df['sentiment'].value_counts()


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


## **Spliting the training dataset**

In [None]:
#split the dataset
#train dataset
train_reviews=imdb_df.review[:40000]
train_sentiments=imdb_df.sentiment[:40000]
#test dataset
test_reviews=imdb_df.review[40000:]
test_sentiments=imdb_df.sentiment[40000:]
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(40000,) (40000,)
(10000,) (10000,)


## **Text normalization**
**`tokenizer=ToktokTokenizer()`**  
- This line initializes an instance of the `ToktokTokenizer` class from the `nltk.tokenize` module.
- `ToktokTokenizer` is a text tokenizer, which splits a sentence or paragraph into individual tokens (words or punctuation marks).  

**Example**:
```python
from nltk.tokenize import ToktokTokenizer

tokenizer = ToktokTokenizer()
tokens = tokenizer.tokenize("This is an example sentence.")
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'sentence', '.']
```
**`nltk.download('stopwords')`**  
- This command downloads the predefined stopwords dataset from the NLTK (Natural Language Toolkit) library.
- **What Are Stopwords?**:
 - Stopwords are commonly used words in a language (e.g., "is", "the", "and") that often don’t contribute much meaning to the text and are usually removed during preprocessing for natural language processing (NLP) tasks.

**`stopword_list=nltk.corpus.stopwords.words('english')`**  
- This line loads the list of English stopwords from the downloaded stopwords dataset into the variable `stopword_list`.

For example:
```python
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)  # Tokenize the text
filtered_tokens = [word for word in tokens if word.lower() not in stopword_list]  # Remove stopwords
print(filtered_tokens)
# Output: ['example', 'sentence']
```



In [None]:
#Tokenization of text
tokenizer=ToktokTokenizer()
# Downloading Stopwords
nltk.download('stopwords')
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Removing html strips and noise text
* Input Review Text: "This is an amazing movie! [Spoiler Alert] Check out <a href='link'>this link</a> for more details."
* HTML Removal (strip_html): "This is an amazing movie! [Spoiler Alert] Check out this link for more details."
* Square Bracket Removal (remove_between_square_brackets): "This is an amazing movie! Check out this link for more details."


In [None]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser") # Parses the input text as if it were an HTML document.
    return soup.get_text() # Extracts the plain text by removing all HTML tags and structure.

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text) #\[ and \]: Match the literal square brackets. [^]]*: Matches any characters inside the brackets, except the closing bracket. Replaces the matched text with an empty string (''), effectively removing it.

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
imdb_df['review']=imdb_df['review'].apply(denoise_text)

# Removing special characters

* This function removes special characters (e.g., punctuation, symbols) from a given text while retaining alphanumeric characters and spaces.
* Input Review Text: "Wow! This movie is amazing :) @user123 #MustWatch."
* Resulting Text: "Wow This movie is amazing  user123 MustWatch"

In [None]:
def remove_special_characters(text, remove_digits=True):
    pattern = r'[^a-zA-Z0-9\s]'
    text = re.sub(pattern, '', text)
    return text

# Text stemming
- Stemming reduces words to their root or base form by removing suffixes. For example:
 - "running" → "run"
 - "easily" → "easili"
 - "studies" → "studi"
* It simplifies words, which can help reduce vocabulary size in Natural Language Processing (NLP) tasks.


* **Input Review Text:** "The cats were running swiftly towards the forest."
* **Stemming Each Word and Joining Back:** "the cat were run swiftli toward the forest"


In [None]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer() # PorterStemmer is a popular stemming algorithm provided by the NLTK library. It applies a set of rules to systematically reduce words to their base form.
    text= ' '.join([ps.stem(word) for word in text.split()]) # For each word in the list, the stem() function of the PorterStemmer instance (ps) is applied. The result is a list of stemmed words. Then combines the stemmed words into a single string, separated by spaces.
    return text
#Apply function on review column
imdb_df['review']=imdb_df['review'].apply(simple_stemmer)


# **Removing stopwords**
* Stopwords are common words (e.g., "the", "is", "in", "and") that appear frequently in a language but often carry little meaningful information for tasks like text classification or sentiment analysis. Removing stopwords helps reduce the noise in the text.
* **Input Text:** "This is a great movie with wonderful acting and an amazing story."
* **Tokenize Text:** ["This", "is", "a", "great", "movie", "with", "wonderful", "acting", "and", "an", "amazing", "story"]
* **Filter Stopwords:** ["great", "movie", "wonderful", "acting", "amazing", "story"]
* **Join Tokens:** "great movie wonderful acting amazing story"


In [None]:
#set stopwords to english
stop=set(stopwords.words('english')) # stopwords.words('english') loads the list of stopwords for the English language from the NLTK library.
# A set is used for efficient membership testing when checking if a word is a stopword.
print(stop) # Displays the list of stopwords to verify or understand what will be removed.

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text) # The input text is split into individual words (tokens) using tokenizer.tokenize.
    tokens = [token.strip() for token in tokens] # Any leading or trailing whitespace in tokens is removed.
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list] # Directly checks tokens without converting to lowercase.
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list] # Converts tokens to lowercase before checking membership in stopword_list.
    filtered_text = ' '.join(filtered_tokens) # Combines the filtered tokens into a single string, separated by spaces.
    return filtered_text # Outputs the text with stopwords removed.
#Apply function on review column
imdb_df['review'] = imdb_df['review'].apply(remove_stopwords) # Removes stopwords from the reviews and updates the column with the cleaned text.

{'now', 'as', 'any', 'yourselves', "hasn't", 'ourselves', 'o', 'again', 'be', 'having', 'where', 'our', "wouldn't", 'hasn', 're', 'up', 'until', 'above', "mustn't", 'himself', "needn't", 'just', 'those', 'hadn', 'own', 'it', 'same', 'only', 'about', 'his', 'were', 'aren', 'after', 'some', 'doesn', 'further', 'to', 'her', 'an', 'by', 'from', 'few', 'was', "doesn't", 'didn', 'under', "weren't", "she's", 'did', 'ours', 'over', 'but', "hadn't", 'what', 'because', 'how', 'being', 'against', 'been', 've', 'shan', 'has', 'your', 'm', 'into', 'each', 'in', 'other', 'hers', 'during', 'can', 'through', 'doing', 'and', 'that', 'you', 'their', 'this', 'very', 'should', 'll', "haven't", 'a', 'had', "mightn't", 'wouldn', 'nor', 'is', 'itself', 'are', 'do', 'or', 'why', 'than', 't', 'there', 'ain', 'out', 'too', 'so', "won't", 'does', 'which', 'he', 'couldn', 'haven', 'no', 'don', 'i', 'then', 'needn', 'ma', "you'll", "you've", 'isn', 's', "should've", 'yourself', 'd', 'will', 'with', 'of', 'when', '

# Split reviews in train & test

In [None]:
#normalized train reviews
norm_train_reviews=imdb_df.review[:40000] # Extracts the first 40,000 reviews from the review column of the imdb_data DataFrame.
norm_test_reviews=imdb_df.review[40000:]

# Bag of words Model
The code is converting text data (reviews) into a format called Bag of Words (BoW). This is a common way to prepare text for machine learning models. It counts how many times each word (or set of words) appears in the text.

```python
cv = CountVectorizer(min_df=0, max_df=1, binary=False, ngram_range=(1,3))
```

- **`min_df=0`:** No filtering for rarely used words. Even words that appear in just one document will be included in the vocabulary.

- **`max_df=1`:** No filtering for commonly used words.** Even words that appear in all documents will be included in the vocabulary.
#### Example:
If `max_df=0.8`, words that appear in **80% or more of the documents** will be ignored.


  - **`ngram_range=(1,3)`** means it considers:
    - Single words (unigrams, e.g., "love"),
    - Two-word combinations (bigrams, e.g., "love machine"),
    - Three-word combinations (trigrams, e.g., "machine learning is").
  - **`binary=False`** means it counts how many times each word or phrase appears.

* ngram_range=(1,3) means it considers: Single words (unigrams, e.g., "love"), Two-word combinations (bigrams, e.g., "love machine"), Three-word combinations (trigrams, e.g., "machine learning is")
* binary=False means it counts how many times each word or phrase appears.

```python
cv_train_reviews = cv.fit_transform(norm_train_reviews)
```
- **`fit_transform`** does two things:
  - **`fit`**: It learns all the unique words/phrases from the training data.
  - **`transform`**: It creates a table (matrix) where:
    - Rows = individual reviews,
    - Columns = words or phrases from the vocabulary,
    - Each cell = how many times a word/phrase appears in that review.

```python
cv_test_reviews = cv.transform(norm_test_reviews)
```
- The test data is transformed using the same vocabulary as the training data.

```python
print('BOW_cv_train:', cv_train_reviews.shape)
print('BOW_cv_test:', cv_test_reviews.shape)
```
- **Shape** tells you:
  - **Rows**: Number of reviews in the dataset.
  - **Columns**: Total number of unique words/phrases found in the training data.

Example:
If the training data has 100 reviews and 5000 unique words/phrases:
- Output: `BOW_cv_train: (100, 5000)`

---

### Summary Example:
Imagine you have these two reviews:

1. "I love learning."
2. "Learning is fun."

**Vocabulary (with ngrams):**
- Unigrams: `["i", "love", "learning", "is", "fun"]`
- Bigrams: `["i love", "love learning", "learning is", "is fun"]`
- Trigrams: `["i love learning", "learning is fun"]`

**Matrix:**

|                | i | love | learning | is | fun | i love | love learning | learning is | learning is fun |
|----------------|---|------|----------|----|-----|--------|---------------|-------------|-----------------|
| Review 1:      | 1 | 1    | 1        | 0  | 0   | 1      | 1             | 0           | 0               |
| Review 2:      | 0 | 0    | 1        | 1  | 1   | 0      | 0             | 1           | 1               |

This is how your text is converted into numbers for machine learning models!




In [None]:
cv=CountVectorizer(min_df=0.0,max_df=1.0,binary=False,ngram_range=(1,3))
cv_train_reviews=cv.fit_transform(norm_train_reviews)
cv_test_reviews=cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)


BOW_cv_train: (40000, 7164332)
BOW_cv_test: (10000, 7164332)


# **Labeling the sentiment text**
```python
lb = LabelBinarizer()
```
- The `LabelBinarizer` is a utility that converts categorical labels (e.g., 'positive' and 'negative') into binary format:
  - One class will be represented as `1`.
  - The other class will be represented as `0`.

```python
sentiment_data = lb.fit_transform(imdb_data['sentiment'])
```
- **What it does**:
  - The `fit_transform()` method combines two actions:
    1. **`fit`**: Identifies the unique labels in the `sentiment` column (e.g., 'positive' and 'negative').
    2. **`transform`**: Converts each label into its binary representation.
  - For example:
    - If the sentiment column contains:
      ```python
      ['positive', 'negative', 'positive', 'negative']
      ```
      The binary representation will be:
      ```python
      [[1],
       [0],
       [1],
       [0]]
      ```

- **Extensibility**:
  - If the dataset contained more than two unique sentiment labels, you would use tools like `OneHotEncoder` for multi-class encoding instead.





In [None]:
#labeling the sentient data
lb=LabelBinarizer()
#transformed sentiment data
sentiment_data=lb.fit_transform(imdb_df['sentiment'])
print(sentiment_data.shape)

(50000, 1)


### **How to Verify What `0` Means**
Here, the index of the sentiment corresponds to the encoded value:
- `'negative'` → `0`
- `'positive'` → `1`



In [None]:
print("Mapping: ", {i: sentiment for i, sentiment in enumerate(lb.classes_)})

Mapping:  {0: 'negative', 1: 'positive'}


## **Split the sentiment tdata**

In [None]:
#Spliting the sentiment data
train_sentiments=sentiment_data[:40000]
test_sentiments=sentiment_data[40000:]
print(train_sentiments)
print(test_sentiments)

[[1]
 [1]
 [1]
 ...
 [1]
 [0]
 [0]]
[[0]
 [0]
 [0]
 ...
 [0]
 [0]
 [0]]


## **Modelling the dataset using Logistic Regression Model**

```python
lr = LogisticRegression(penalty='l2', max_iter=500, C=1, random_state=42)
```
- **`LogisticRegression`**:
  - A supervised learning algorithm used for classification tasks, such as binary sentiment classification (e.g., positive vs. negative).
  - Logistic regression predicts the probability of a class and outputs `1` or `0` (or probabilities for multi-class problems).

- **Parameters**:
  - `penalty='l2'`: Adds **L2 regularization** to prevent overfitting by penalizing large coefficients. This is a regularization term that helps the model generalize better.
  - `max_iter=500`: Sets the maximum number of iterations for the optimization algorithm to converge. Increasing this ensures the model has enough time to find the optimal solution.
  - `C=1`: Inverse of regularization strength (smaller values mean stronger regularization). A value of `1` applies moderate regularization.
  - `random_state=42`: Sets a random seed for reproducibility of the results.

In [None]:
# Define the Logistic Regression Model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
# Fit the Model to the Training Data
lr_bow=lr.fit(cv_train_reviews,train_sentiments)
print(lr_bow)

LogisticRegression(C=1, max_iter=500, random_state=42)


In [None]:
#Predicting the model for bag of words
lr_bow_predict=lr.predict(cv_test_reviews)
print(lr_bow_predict)

[0 0 1 ... 1 0 0]


In [None]:
#Accuracy score for bag of words
lr_bow_score=accuracy_score(test_sentiments,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)

lr_bow_score : 0.9


In [None]:
#Classification report for bag of words
lr_bow_report=classification_report(test_sentiments,lr_bow_predict,target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.90      0.90      0.90      4993
    Negative       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



In [None]:
#confusion matrix for bag of words
cm_bow=confusion_matrix(test_sentiments,lr_bow_predict,labels=[1,0])
print(cm_bow)

[[4515  492]
 [ 508 4485]]


In [None]:
# Create a DataFrame to combine reviews, actual sentiments, and predictions
results_df = pd.DataFrame({
    'Review': test_reviews.reset_index(drop=True),  # Reset index of test_reviews
    'Actual Sentiment': test_sentiments.flatten(),     # Ground truth labels (flattened)
    'Predicted Sentiment': lr_bow_predict    # Predictions made by the model
})

# View the first few rows of the DataFrame
print(results_df.head())

                                              Review  Actual Sentiment  \
0  First off I want to say that I lean liberal on...                 0   
1  I was excited to see a sitcom that would hopef...                 0   
2  When you look at the cover and read stuff abou...                 0   
3  Like many others, I counted on the appearance ...                 0   
4  This movie was on t.v the other day, and I did...                 0   

   Predicted Sentiment  
0                    0  
1                    0  
2                    1  
3                    0  
4                    0  


In [None]:
# Inspect a specific review (e.g., at index 0)
sample_index = 0
print("Review:", results_df.loc[sample_index, 'Review'])
print("Actual Sentiment:", results_df.loc[sample_index, 'Actual Sentiment'])
print("Predicted Sentiment:", results_df.loc[sample_index, 'Predicted Sentiment'])

Review: First off I want to say that I lean liberal on the political scale and I found the movie offensive. I managed to watch the whole doggone disgrace of a film . This movie brings a low to original ideas. Yes it was original thus my 2 stars instead of 1. Are our film writers that uncreative that they can only come up with this?? Acting was horrible , and the characters were unlikeable for the most part. The lead lady in the story had no good qualities at all. They made her bf into some sort of a bad guy and I did not see that at all. Maybe I missed something , I do not know.He was the most down to earth, relevant character in the movie. I did not shell out any money for this garbage. I almost wish PETA would come to the rescue of this awful, offensive movie and form a protest. DISGUSTING thats all I have to say anymore !
Actual Sentiment: 0
Predicted Sentiment: 0


## **Modelling the dataset using Random Forest Classifier Model**




In [None]:
from sklearn.ensemble import RandomForestClassifier
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
# Train the model on the Bag-of-Words data
rf.fit(cv_train_reviews, train_sentiments)
# Predict the sentiments for the test data
rf_bow_predict = rf.predict(cv_test_reviews)

In [None]:
#Accuracy score for bag of words
rf_bow_score=accuracy_score(test_sentiments,rf_bow_predict)
print("rf_bow_score :",rf_bow_score)


rf_bow_score : 0.8568


In [None]:
#confusion matrix for bag of words
cm_bow=confusion_matrix(test_sentiments,rf_bow_predict,labels=[1,0])
print(cm_bow)

[[4447  560]
 [ 872 4121]]


In [None]:
# Create a DataFrame to combine reviews, actual sentiments, and predictions
results_df = pd.DataFrame({
    'Review': test_reviews.reset_index(drop=True),  # Reset index of test_reviews
    'Actual Sentiment': test_sentiments.flatten(),     # Ground truth labels (flattened)
    'Predicted Sentiment': rf_bow_predict    # Predictions made by the model
})

# View the first few rows of the DataFrame
print(results_df.head())

                                              Review  Actual Sentiment  \
0  First off I want to say that I lean liberal on...                 0   
1  I was excited to see a sitcom that would hopef...                 0   
2  When you look at the cover and read stuff abou...                 0   
3  Like many others, I counted on the appearance ...                 0   
4  This movie was on t.v the other day, and I did...                 0   

   Predicted Sentiment  
0                    0  
1                    0  
2                    1  
3                    0  
4                    0  


In [None]:
# Inspect a specific review (e.g., at index 10)
sample_index = 10
print("Review:", results_df.loc[sample_index, 'Review'])
print("Actual Sentiment:", results_df.loc[sample_index, 'Actual Sentiment'])
print("Predicted Sentiment:", results_df.loc[sample_index, 'Predicted Sentiment'])

Review: "Mr. Bug Goes To Town" was the last major achievement the Fleischer studios produced. The quality of the Superman series produced at the same time is evident in this extraordinary film.<br /><br />The music and lyrics by Frank Loesser and Hoagy Carmichael (with assistance by Flieshcer veteran Sammy Timberg are quite good, but not as much as the scoring of the picture by Leigh Harline who also scored Snow White for Disney. Harline's "atmospheric music" is superb, and a treat for the ears.<br /><br />The layout and staging of the picture was years ahead of it's time, and once again the Fleischer's background artists outdid themselves. The techincolored beauty of the film cannot be denied, and while Hoppity the grasshopper is the star, the characters of Swat the Fly and Smack the Mosquito steal the picture. Swat's voicing by Jack Mercer (of Popeye fame) is priceless. Kenny Gardner (brother-in-law) of Guy Lombardo...and a featured vocalist in his band...does his usual pleasant job 