# Sentiment analysis of IMDB comments using a Bayesian classifier (from scratch)

![IMDB Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/IMDB_Logo_2016.svg/250px-IMDB_Logo_2016.svg.png)

**What's IMDB?**

### From wikipedia (https://en.wikipedia.org/wiki/IMDb):

> "IMDb (also known as the Internet Movie Database) is an online database, owned by Amazon, of information related to films, television programs, home videos, video games, and streaming content online â€“ including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is owned and operated by IMDb.com, Inc., a subsidiary of Amazon. As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users."

## The task:

### Given a set of text and its classification (positive or negative), build a Bayesian classifier without any type of machine learing framework or libraries (sklearn, keras etc)... ok ok, I'll use pandas... but it's basic, ok?

## Attention: This work has a purely didactic purpose

## 1st ask:
### Please: vote positively if you liked this work

![upvote](https://i.imgflip.com/39skgl.jpg)


## 2nd ask:
### Please: comment critically (doubts or suggestions) if you found something that can be improved in this work


## Table of Contents

### 1. [Probability *super quick* review](#quickreview)
### 2. [Exploratory Data Analysis](#eda)
### 3. [Data preparation](#datapreparation)
### 4. [Naive bayes method](#bayes)


# Probability *super quick* review  <a name="quickreview"></a>

**Q1: What is the simplest definition of probability?**

**A**: It is a number that varies between 0 and 1 that indicates the ratio between the **number of elements in the desired outcome** AND the **total number of outcomes**
e.g.:

* Given a 6-sided "fair die": what is the probability that in 1 throw it will result in number 3?

* Possible results: {1, 2, 3, 4, 5, 6} (6 elements)
* Desired outcome: {3} (1 element)
* Answer: 1 outcome in 6 possible outcomes or 1/6 or ~ 0.167

**Q2 What is conditional probability?**

**A** It is the calculation that takes place when we want to find the probability of an outcome GIVEN that a previous related event occurred

e.g.: What is the probability of the outcome being 3 GIVEN you know that the result is odd?

* Possible results: {1, 2, 3, 4, 5, 6} (6 elements)
* Possible odd results: {1, 3, 5}
* Desired outcome: {3} (1 element)
* Answer: 1 outcome in 3 odd results or 1/3 or ~ 0.333

**Formally:**

#### $ p(A | B) = p(A \cap B)/p(B)$
or 
#### $ p(B | A) = p(A \cap B)/p(A)$


**Q3 How can we derive from the definition of conditional probability to the bayes probability?**

**A** Like this:

#### $ p(A | B) = p(A \cap B)/p(B)$ => $ p(A | B).p(B) = p(A \cap B)$

and

#### $ p(B | A) = p(A \cap B)/p(A)$ => $ p(B | A).p(A) = p(A \cap B)$

so...

####  p(A | B).p(B) = p(B | A).p(A)

or...

###  $p(A | B)= \frac{p(B | A).p(A)}{p(B)}$


And that's enough to get the job started ...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

pd.options.display.max_colwidth = 150

In [None]:
# graphics imports
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from wordcloud import WordCloud,STOPWORDS

# Natural language tool kits
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# download stopwords
nltk.download('stopwords')

# string operations
import string 
import re

# general imports
import math

In [None]:
# load data
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

# Exploratory Data Analysis  <a name="eda"></a>

## 1st question: what is the length of the reviews?

In [None]:

lens = df['review'].str.len()

fig = go.Figure()
fig.add_trace(
    go.Histogram(x=lens, xbins=dict(size=200))
    )
fig.update_layout(title='Length of reviews', 
                    xaxis_title="Length",
                    yaxis_title="# of reviews")
plotly.offline.iplot(fig)

## 2nd question: what is the length of the reviews per outcome?

In [None]:
poslens = df[df['sentiment']=='positive']['review'].str.len()
neglens = df[df['sentiment']=='negative']['review'].str.len()
fig = go.Figure()
fig.add_trace(
    go.Histogram(x=poslens, xbins=dict(size=200), name='positive'),
    )
fig.add_trace(
    go.Histogram(x=neglens, xbins=dict(size=200), name='negative'),
    )
fig.update_layout(title='Length of reviews', 
                    xaxis_title="Length",
                    yaxis_title="# of reviews",)
plotly.offline.iplot(fig)

## 3rd question: what will we see in the word cloud? (review content)


In [None]:
df_pos = df[df['sentiment']=='positive']['review']

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      height=2000
                      ).generate(" ".join(df_pos))

plt.figure(1,figsize=(15, 15))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()

In [None]:
df_neg = df[df['sentiment']=='negative']['review']

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      height=2000
                      ).generate(" ".join(df_neg))

plt.figure(1,figsize=(15, 15))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()

## 4th question: How is the database balancing?

In [None]:
# the text mode is enough...
df['sentiment'].value_counts()

# Data preparation (cleaning, transformation...)<a name="datapreparation"></a>

## Data cleaning: removing unnecessary items from the text


**Q: What are the unnecessary items?**

**A:** Stowords, punctuation, numbers, html tags

In [None]:
# show the reviews again... 
df[['review']].head(20)

### Step 1: Transform to lowercase

In [None]:
df['review_lw'] = df['review'].str.lower()
df[['review','review_lw']].head(10)

### Step 2: remove stopwords 'n punctuation

In [None]:
sw = stopwords.words('english')

print(f'Stopwords sample: {sw[0:10]}')
print(f'Number of stopwords: {len(sw)}')

In [None]:
print(f'Punctuation {string.punctuation}')

In [None]:
def transform_text(s):
    
    # remove html
    html=re.compile(r'<.*?>')
    s = html.sub(r'',s)
    
    # remove numbers
    s = re.sub(r'\d+', '', s)
    
    # remove punctuation
    # remove stopwords
    tokens = nltk.word_tokenize(s)
    
    new_string = []
    for w in tokens:
        # remove words with len = 2 AND stopwords
        if len(w) > 2 and w not in sw:
            new_string.append(w)
    
    
    
    s = ' '.join(new_string)
    s = s.strip()

    exclude = set(string.punctuation)
    s = ''.join(ch for ch in s if ch not in exclude)
    
    return s.strip()

In [None]:
transform_text('there is a tree near <br/> the river 123! see')

In [None]:
df['review_sw'] = df['review_lw'].apply(transform_text)
df[['review','review_lw', 'review_sw']].head(20)

### Step 3: lemmatizer

In [None]:
lemmatizer = WordNetLemmatizer() 

print(lemmatizer.lemmatize("rocks", pos="v"))
print(lemmatizer.lemmatize("gone", pos="v"))

In [None]:
def lemmatizer_text(s):
    tokens = nltk.word_tokenize(s)
    
    new_string = []
    for w in tokens:
        lem = lemmatizer.lemmatize(w, pos="v")
        # exclude if lenght of lemma is smaller than 2
        if len(lem) > 2:
            new_string.append(lem)
    
    s = ' '.join(new_string)
    return s.strip()

In [None]:
df['review_lm'] = df['review_sw'].apply(lemmatizer_text)
df[['review','review_lw', 'review_sw', 'review_lm']].head(20)

## Now I wanna see again the word cloud with treated text

In [None]:
df_pos = df[df['sentiment']=='positive']['review_lm']

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      height=2000
                      ).generate(" ".join(df_pos))

plt.figure(1,figsize=(15, 15))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()

In [None]:
df_neg = df[df['sentiment']=='negative']['review_lm']

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      height=2000
                      ).generate(" ".join(df_neg))

plt.figure(1,figsize=(15, 15))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()

# Naive bayes method <a name="bayes"></a>


![](https://timtyler.org/bayesianism/graphics/bayes_theorem.jpg)

### The classifier below was inspired by the video below ... watch! it is very cool!!! (Thanks statquest!)


[![](http://img.youtube.com/vi/O2L2Uv9pdDA/0.jpg)](http://www.youtube.com/watch?v=O2L2Uv9pdDA "Naive Bayes...clearly explained!!!")

We will now calculate the probability of each revision to be positive or negative using the naive bayes method. According to the formula in section 1 we have:


## $p(A | B)= \frac{p(B | A).p(A)}{p(B)}$

We can transform the above formula into the format ...

## $p(positive | w_1, w_2, ...w_n)= \frac{p(w_1 | positive).p(w_2 | positive)....p(w_n | positive).p(positive)}{p(w_1, w_2, ...w_n)}$

... in the same way

## $p(negative | w_1, w_2, ...w_n)= \frac{p(w_1 | negative).p(w_2 | negative)....p(w_n | positive).p(negative)}{p(w_1, w_2, ...w_n)}$

Where $w_1, w_2, ...w_n$ are the words of the review.

Note that the denominator of both probabilities are the same ... so we can remove it to preserve processing. Thus, instead of calculating the probabilities, we will calculate a score that will be proportional to the probability:


### $p(positive | w_1, w_2, ...w_n) \: \alpha \: p(w_1 | positive).p(w_2 | positive)....p(w_n | positive).p(positive)$

### $p(negative | w_1, w_2, ...w_n) \: \alpha \: p(w_1 | negative).p(w_2 | negative)....p(w_n | negative).p(negative)$


As we are going to work with really really small numbers, we may be interested in working with its logarithm, thus transforming multiplication into sum:

### $log(p(positive | w_1, w_2, ...w_n)) \: \alpha \: log(p(w_1 | positive)) + log (p(w_2 | positive)) .... log(p(w_n | positive)) + log(p(positive))$

### $log(p(negative | w_1, w_2, ...w_n)) \: \alpha \: log(p(w_1 | negative)) + log (p(w_2 | negative)) .... log(p(w_n | negative)) + log(p(negative))$


**Question: How do we know if a comment is positive or negative?**

**Answer:** So simples...

if $log(p(positive | w_1, w_2, ...w_n)) > log(p(negative | w_1, w_2, ...w_n))$ it is a positive review, otherwise... it is a negative review


Follows the reference to the classifier code (it has been simplified for teaching purposes):

ref.: https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

## Train and test dataset

Now...we will separate the dataset in 2 parts: train and test. As we know, the base is balanced (50% for each outcome), then we will take the first 70% of each outcome for training and the remaining 30% for testing.

In [None]:
# There are 25.000 reviews for each outcome, so we can use the first 17.500 (70%) for training and 7.500 (30%) remaining for testing

# Train dataset (first 17.500 rows)
pos_train = df[df['sentiment']=='positive'][['review_lm', 'sentiment']].head(17500)
neg_train = df[df['sentiment']=='negative'][['review_lm', 'sentiment']].head(17500)


# Test dataset (last 7.500 rows)
pos_test = df[df['sentiment']=='positive'][['review_lm', 'sentiment']].tail(7500)
neg_test = df[df['sentiment']=='negative'][['review_lm', 'sentiment']].tail(7500)

# put all toghether again...
train_df = pd.concat([pos_train, neg_train]).sample(frac = 1).reset_index(drop=True)
test_df = pd.concat([pos_test, neg_test]).sample(frac = 1).reset_index(drop=True)


In [None]:
train_df.head()

In [None]:
test_df.head()

## The fit method

Now it's time to implement the method to "train" the model.
Thanks again https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

In [None]:
def get_word_counts(words):
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0.0) + 1.0
    return word_counts

def fit(df_fit):
    num_messages = {}
    log_class_priors = {}
    word_counts = {}
    vocab = set()
 
    n = df_fit.shape[0]
    num_messages['positive'] = df_fit[df_fit['sentiment']=='positive'].shape[0]
    num_messages['negative'] = df_fit[df_fit['sentiment']=='negative'].shape[0]
    log_class_priors['positive'] = math.log(num_messages['positive'] / n)
    log_class_priors['negative'] = math.log(num_messages['negative'] / n)
    word_counts['positive'] = {}
    word_counts['negative'] = {}
 
    for x, y in zip(df_fit['review_lm'], df_fit['sentiment']):
        
        counts = get_word_counts(nltk.word_tokenize(x))
        for word, count in counts.items():
            if word not in vocab:
                vocab.add(word)
            if word not in word_counts[y]:
                word_counts[y][word] = 0.0
 
            word_counts[y][word] += count
    
    return word_counts, log_class_priors, vocab, num_messages

In [None]:
word_counts, log_class_priors, vocab, num_messages = fit(train_df)

### Let's see some results of fit method ...

In [None]:
word_count_df = pd.DataFrame(word_counts).fillna(0).sort_values(by='positive', ascending=False).reset_index()
word_count_df

In [None]:
# Let's see how some words are distributed
word_count_sample_df = word_count_df.head(5000)
fig = go.Figure(go.Scatter(
    x = word_count_sample_df['positive'],
    y = word_count_sample_df['negative'],
    text = word_count_sample_df['index'],
    mode='markers'
))
fig.update_layout(title='Word distribution sample', 
                xaxis_title="Positive word count",
                yaxis_title="Negative word count",)

plotly.offline.iplot(fig)         

## The predict method

Now let's create the method to predict whether the reviews are positive or negative

In [None]:
def predict(df_predict, vocab, word_counts, num_messages, log_class_priors):
    result = []
    for x in df_predict:
        counts = get_word_counts(nltk.word_tokenize(x))
        positive_score = 0
        negative_score = 0
        for word, _ in counts.items():
            if word not in vocab: continue
            
            # add Laplace smoothing
            log_w_given_positive = math.log((word_counts['positive'].get(word, 0.0) + 1) / (num_messages['positive'] + len(vocab)) )
            log_w_given_negative= math.log((word_counts['negative'].get(word, 0.0) + 1) / (num_messages['negative'] + len(vocab)) )
 
            positive_score += log_w_given_positive
            negative_score += log_w_given_negative
 
        positive_score += log_class_priors['positive']
        negative_score += log_class_priors['negative']
 
        if positive_score > negative_score:
            result.append('positive')
        else:
            result.append('negative')
    return result

In [None]:
result = predict(test_df['review_lm'], vocab, word_counts, num_messages, log_class_priors)
result[0:10] # result sample...

## Now let's finally measure the accuracy of the model ...

### Accuracy

In [None]:
y_true = test_df['sentiment'].tolist()

acc = sum(1 for i in range(len(y_true)) if result[i] == y_true[i]) / float(len(y_true))
print("{0:.4f}".format(acc))

### Confusion matrix

In [None]:
y_actu = pd.Series(y_true, name='Real')
y_pred = pd.Series(result, name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)
df_confusion = df_confusion / df_confusion.sum(axis=1) * 100
df_confusion.round(2)

In [None]:
def plot_confusion_matrix(df_confusion, title='Confusion matrix'):
    plt.matshow(df_confusion) # imshow
    #plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    #plt.tight_layout()
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)    

plot_confusion_matrix(df_confusion)

# Thank you!!!
![](https://i.pinimg.com/originals/2c/2e/ef/2c2eef8da1285d958914eef079f9b70c.jpg)