#  Naive Bayes Classification for Sentiment Analysis on IMDB Movie Reviews

**Description:**

In this notebook, we explore the power of Naive Bayes for sentiment analysis, a fundamental task in natural language processing (NLP). We'll dive into the world of movie reviews, using the popular IMDB dataset to train and evaluate two variations of the Naive Bayes classifier:

* Multinomial Naive Bayes: This model is well-suited for text classification where word frequency matters. We'll see how it performs in distinguishing positive and negative movie reviews based on the occurrence of words.
* Bernoulli Naive Bayes: This model focuses on the presence or absence of words, making it a good choice when the mere existence of certain words is indicative of sentiment. We'll compare its performance to the Multinomial variant.

**Key Highlights:**

* Data Preprocessing: We'll walk through essential steps to clean and prepare the text data, including tokenization, stop word removal, and handling special characters. Understanding these steps is crucial for any NLP project.
* Feature Extraction (Vectorization): We'll demonstrate how to convert raw text into numerical representations that Naive Bayes can understand. We'll use techniques like CountVectorizer to create feature vectors based on word frequencies.
* Model Training and Evaluation: We'll train both Multinomial and Bernoulli Naive Bayes models on the IMDB dataset and evaluate their performance using metrics like accuracy, precision, recall, and F1-score. We'll discuss the strengths and weaknesses of each model.
* Comparison and Insights: We'll analyze the results, comparing the two Naive Bayes variations and drawing insights into their suitability for sentiment analysis tasks.


**Who Should Read This:**

This notebook is perfect for:

* Beginners in NLP: If you're new to natural language processing, this is a great starting point to understand fundamental concepts and build a practical classifier.
* Anyone interested in Sentiment Analysis: Whether you're a data scientist, marketer, or simply curious about how machines understand emotions in text, this notebook offers a hands-on approach to sentiment analysis.
* Fans of Naive Bayes: If you want to see how this simple yet effective algorithm can be applied to real-world text data, this is the notebook for you.

**Let's get started and uncover the sentiment hidden within movie reviews!**

****

**Import the required library**

In [2]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os


# Preprocessing data

**1. Reading dataset**

In [3]:
data_dir = '/kaggle/input/imdb-review/aclImdb'

In [4]:
folders = [
    ('train', 'neg', 'negative'),
    ('train', 'pos', 'positive'),
    ('test', 'neg', 'negative'),
    ('test', 'pos', 'positive')
]
data = []

In [5]:
for split, sentiment, label in folders:
    folder_path = os.path.join(data_dir, split, sentiment)
    for filename in os.listdir(folder_path):
        with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
            text = file.read()
            data.append({ 'split': split, 'label': label,'text': text})
df = pd.DataFrame(data)


Data infor

In [6]:
print(df.tail())

      split     label                                               text
49995  test  positive  typically, a movie can have factors like "arou...
49996  test  positive  Like the first film in this series (SLAUGHTER,...
49997  test  positive  The problem with so many people watching this ...
49998  test  positive  Not a bad MOW. I was expecting another film ba...
49999  test  positive  Despite myself, I really kinda like this movie...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   split   50000 non-null  object
 1   label   50000 non-null  object
 2   text    50000 non-null  object
dtypes: object(3)
memory usage: 1.1+ MB


**Here are some samples of negative reviews in the test kit**

In [8]:
test_negative_reviews = df[(df['split'] == 'test') & (df['label'] == 'negative')]

In [9]:
test_negative_reviews.head()

Unnamed: 0,split,label,text
25000,test,negative,Committed doom and gloomer Peter Watkins goes ...
25001,test,negative,Most critics have written devastating about th...
25002,test,negative,Did I waste my time. This is very pretentious ...
25003,test,negative,What a stinker!!! I swear this movie was writt...
25004,test,negative,Ever had one of those nights when you couldn't...


 **2. Separate training and test sets:**

In [10]:
from sklearn.model_selection import train_test_split
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

 20% of the data will be selected as a test set and the remaining 80% will be  a training set.
 ran_state = 1 will return the same dataset when random the next time.
 

In [11]:
pd.DataFrame({'text':X_train,'Label':y_train}).head(10)

Unnamed: 0,text,Label
18165,Roman Polanski plays Trelkovsky who rents an a...,positive
36059,I saw this movie last night at the Berlinale a...,negative
13242,This program is a lot of fun and the title son...,positive
32985,"I love the first and third Beastmasters, but t...",negative
41133,Ghoulies IV may not be the best out of the ser...,positive
9273,In the small American town of Meadowvale Dr. A...,negative
12784,"this movie probably had a $750 budget, and sti...",positive
43992,"Great just great! The West Coast got ""Dirty"" H...",positive
33452,"When I first saw this movie, I said to myself,...",negative
6342,Masters of Horror: The Screwfly Solution start...,negative


**3. Transform text into a characteristic matrix**

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

**How's it work ?**

**+ fit_tranform**

CountVectorizer will turn all the words in the given set into 1 dictionary, where keys are words and values are the number of occurrences of that word in the set.
**Ex if the doct is"Hello world, this is a beautiful world"
then the dict will be
{"hello":1,"world":2,"this" : 1....}**

this is the first 10 element of dict:

In [13]:
cnt = 0
for word,i in vectorizer.vocabulary_.items():
    print(f'{word} : {i}')
    cnt +=1
    if(cnt==10): break

roman : 69798
polanski : 62985
plays : 62640
trelkovsky : 84294
who : 90343
rents : 68110
an : 3970
apartment : 4711
in : 40845
france : 31869


+ **tranform**


Create a vector whose size coincides with the frequency vector (dictionary), bearing the value 1 where it appears in the dictionary, and 0 if it does not appear

In [14]:
new_sentence = ["in an apartment"]  # Đặt chuỗi vào danh sách
new_sentence_vector = vectorizer.transform(new_sentence).toarray()
print(new_sentence_vector)


[[0 0 0 ... 0 0 0]]


It returns a sparse matrix where the elements corresponding to the words in the new sentence are 1, and the rest are 0. These are the non-zero elements in the vector.

In [15]:
nonZero = new_sentence_vector[new_sentence_vector !=0 ]
print(nonZero)

[1 1 1]


# 2. Naive Bayes model training


**In the Naive Bayes algorithm, we need to calculate the probability of a sentence x being positive (p) or negative (n). To determine this, we calculate the probability of sentence x being positive (P(p|x)) or negative (P(n|x)), compare them, and if P(p|x) > P(n|x), then the sentence is positive, otherwise it is negative.**


To calculate the probability P(p|x) (*P(n|x) is calculated similarly*), we use Bayes' theorem:

* P(p|x) = P(x|p) * P(p) / P(x)  or, in general terms:
     P(a|b) = P(b|a) * P(a) / P(b)
* Therefore, the goal of the model is to calculate P(x|p), P(p), and P(x).

# Key Concepts of the Naive Bayes Algorithm in Sentiment Classification (Positive/Negative)

1. step 1.  Classification: For a given sentence "x," the algorithm calculates the probability that it is positive (p(x|p)) and the probability that it is negative (p(x|n)). By comparing these probabilities, the sentence is classified as positive if p(x|p) > p(x|n), and negative otherwise.
2. step 2.  Training: During the training phase, the model pre-computes necessary values before encountering the sentence to be classified. To calculate p(x|p) in step 1, Bayes' theorem is applied: P(p|x) = P(x|p) * P(p) / P(x). The prior probability P(p) can be determined beforehand. To streamline the calculation of P(x|p) and P(x) for future sentences, the training phase also calculates the probability of each word in the vocabulary belonging to the positive (p) or negative (n) class. These probabilities are stored in a likelihood dictionary.

1. Note 1: Likelihood Smoothing: In the likelihood dictionary, some words may not appear in any positive (p) sentences, resulting in a zero probability when calculating their likelihood in the positive class. To prevent this, we add a smoothing factor (typically 1) to all word counts. Since each word receives this +1 smoothing, the total number of smoothing additions equals the vocabulary size. Therefore, the denominator of the probability calculation must also be increased by the vocabulary size.

2. Note 2: Logarithmic Calculation: To calculate the probability of a sentence belonging to class (p) or (n), the standard formula is P(x|p) * P(p) / P(x). The prior probability P(p) is pre-computed and stored in a dictionary. A useful trick is to use the logarithm of the likelihood values (log(likelihood)) to avoid underflow issues when multiplying very small probabilities. Additionally, logarithms transform the complex multiplication of probabilities into a simpler addition of log-probabilities. The constant P(x) is the same for all classes and can be omitted from the comparison. We only need to calculate P(x|p) * P(p) using log addition.

In [16]:
import numpy as np

def train_multinomial_nb(X_train, y_train):
    # After TranFrom, train_test is a 2-dimensional dataframe with rows
        #as sentences in the TRAIN set, columns as words in the dictionary
    classes = np.unique(y_train)
    #Next, we will caculate P(p) (and P(n) when using in negative case)
    # priors is a dict have keys is nag or pos, values is P(n) and P(p)
    priors = {c: np.sum(y_train == c) / len(y_train) for c in classes}
    
    
    #Next is p(x|p), which is calculated as the product of p(xi|p) according
    # to the Bayesian independent variables hypothesis. And in this step
    # we will calculate the probability of all i words
    # in the dictionary under the condition p (or n).
        
    likelihoods = {}
    for c in classes:
        
        #Ex c = positive
        #class_docs is a 2-dimensional dataframe 
        #whose rows are positive review sentences encoded into vector, 
        #columns are words in the dictionary
        class_docs = X_train[y_train == c]
        
        #Calculate the total number of words in the positive class,
        #then divide the total number of times the word i
        #appears in the positive class
        #by the total number of words that will have P(xi|p)
        total_words = class_docs.sum()
        
        
        #There are words that do not appear in the positive class, 
        #so the value = 0 and cause an error. So we +1 on all words 
        #to avoid the value 0 (Laplace smoothing):
        #        class_docs.sum(axis=0) + 1 
        
        #However, the total number of words should also increase these 1s,
        #and the total number of increases will be the total number of words 
        # in the dictionary:
        #           total_words + X_train.shape[1]
        
        likelihoods[c] = (class_docs.sum(axis=0) + 1) / (total_words + X_train.shape[1])
    return priors, likelihoods


**Evaluate model with test set**

In [17]:
def evaluate_multinomial_nb(X_test, y_test, priors, likelihoods):
    y_pred = []
    for doc in X_test:
        #To avoid that the probabilities of actions that are too small
        #to multiply each other will be close to zero,
        #we calculate their probabilities multiplied by each other
        
        #We will have a dictionary consisting of posterior probabilities
        #of each word in the dictionary, provided that they belong to the
        #positive or negative class, respectively
        probs = {c: np.log(priors[c]) for c in priors}
        
        
        for i, count in zip(doc.indices, doc.data):
            for c in priors:
                probs[c] += np.log(likelihoods[c][0, i]) * count
        #The predicted result will be based on max (p(pos) ,p(neg))
        y_pred.append(max(probs, key=probs.get))
        
    correct = np.sum(y_pred == y_test)
    accuracy = correct / len(y_test)
    return accuracy

**Evaluate model with new review**


In [18]:
def predict_sentiment(review, vectorizer, priors, likelihoods):
    X_new = vectorizer.transform([review])
    
    #Initialization of the logarithmic probability dictionary. 
        #Initially, each class is assigned a logarithmic value of 
        #its priori probability. The use of logarithms helps avoid underflow 
        #when calculating with very small probabilities.
    probs = {c: np.log(priors[c]) for c in priors}
    
    #data: An array containing nonzero values of the matrix.
    #indices: An array containing column indexes of other values that 
    #do not correspond in data
    for i, count in zip(X_new.indices, X_new.data):
        for c in priors:
            
            probs[c] += np.log(likelihoods[c][0, i]) * count
    predicted_class = max(probs, key=probs.get)
    return predicted_class

In [19]:

priors, likelihoods = train_multinomial_nb(X_train_counts, y_train)
accuracy = evaluate_multinomial_nb(X_test_counts, y_test, priors, likelihoods)
print(f"Accuracy on test set: {accuracy:.2f}")



Accuracy on test set: 0.84


# **Testing in new review**

In [26]:
# Predict new review, feel free to test your review
new_review = '''"The Grand Illusion" is a masterpiece of cinematic storytelling. The film's deliberate pacing allows for nuanced character development and the exploration of complex themes. The performances are superb, and the cinematography is visually stunning. The film's emotional impact is profound, leaving a lasting impression on the viewer.'''
predicted_sentiment = predict_sentiment(new_review, vectorizer, priors, likelihoods)
print(f"Predicted sentiment: {predicted_sentiment}")

Predicted sentiment: positive


In [27]:
# Predict new review, feel free to test your review
new_review = '''Overhyped and underwhelming, "The Grand Illusion" is a disappointment. The film's dialogue is stilted, and the acting is uninspired. The cinematography is unoriginal, relying on tired tropes and cliches. The film's attempt to tackle complex social issues is superficial and ultimately fails to deliver any meaningful insights.'''
predicted_sentiment = predict_sentiment(new_review, vectorizer, priors, likelihoods)
print(f"Predicted sentiment: {predicted_sentiment}")

Predicted sentiment: negative
