# Multinomial and Bernoulli Naive Bayes

For understanding Multinomial and Bernoulli Naive Bayes, we will start with a small example and understand the end to end process. In another notebook, we will build a full-fledged email spam classifier.

To start with, let's take a few sentences and classify them in two different classes - education or cinema. Each sentence will represent one document. In real-world cases, a document be any piece of text such as an email, a news article, a book review, a tweet etc. The analysis and the algorithm involved doesn’t depend on the type of document we use.

The notebook is divided into the following sections:

- Importing and preprocessing data
- Building the model: Multinomial Naive Bayes
- Building the model: Bernoulli Naive Bayes

In [1]:
# importing necessary libraries

import numpy as np
import pandas as pd

import sklearn
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

### Importing and preprocessing data

In [2]:
# import train data

train_data = pd.read_csv('example_train.csv')
train_data.head()

Unnamed: 0,Document,Class
0,Upgrad is a great educational institution.,education
1,Educational greatness depends on ethics,education
2,A story of great ethics and educational greatness,education
3,Sholey is a great cinema,cinema
4,good movie depends on good story,cinema


In [3]:
# Map target variable into classes

train_data['Class'] = train_data['Class'].map({'cinema': 0, 'education': 1})

In [4]:
train_data.head()

Unnamed: 0,Document,Class
0,Upgrad is a great educational institution.,1
1,Educational greatness depends on ethics,1
2,A story of great ethics and educational greatness,1
3,Sholey is a great cinema,0
4,good movie depends on good story,0


In [5]:
# change X_train and y_train to numpy array as we are going to do text classification

X_train = np.array(train_data['Document'])
y_train = np.array(train_data['Class'])

In [6]:
X_train, y_train

(array(['Upgrad is a great educational institution.',
        'Educational greatness depends on ethics',
        'A story of great ethics and educational greatness',
        'Sholey is a great cinema', 'good movie depends on good story'],
       dtype=object),
 array([1, 1, 1, 0, 0], dtype=int64))

In [7]:
# convert X_train_np into tokens

cv = CountVectorizer(stop_words='english')
X_train_tok = cv.fit(X_train).vocabulary_

In [8]:
X_train_tok

{'upgrad': 11,
 'great': 5,
 'educational': 2,
 'institution': 7,
 'greatness': 6,
 'depends': 1,
 'ethics': 3,
 'story': 10,
 'sholey': 9,
 'cinema': 0,
 'good': 4,
 'movie': 8}

In [9]:
X_train = cv.transform(X_train)
print(X_train)

  (0, 2)	1
  (0, 5)	1
  (0, 7)	1
  (0, 11)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (1, 6)	1
  (2, 2)	1
  (2, 3)	1
  (2, 5)	1
  (2, 6)	1
  (2, 10)	1
  (3, 0)	1
  (3, 5)	1
  (3, 9)	1
  (4, 1)	1
  (4, 4)	2
  (4, 8)	1
  (4, 10)	1


### Preparing test data

In [10]:
test_data = pd.read_csv('example_test.csv')
test_data.head()

Unnamed: 0,Document,Class
0,very good educational institution,education


In [11]:
# map target variables with int classes

test_data['Class'] = test_data['Class'].map({'cinema': 0, 'education': 1})

In [12]:
test_data.head()

Unnamed: 0,Document,Class
0,very good educational institution,1


In [13]:
# compute X_test and y_test

X_test = np.array(test_data['Document'])
y_test = np.array(test_data['Class'])

In [14]:
X_test, y_test

(array(['very good educational institution'], dtype=object),
 array([1], dtype=int64))

In [15]:
X_test = cv.transform(X_test)
print(X_test)

  (0, 2)	1
  (0, 4)	1
  (0, 7)	1


### Building the model: Multinomial Naive Bayes

In [16]:
# build model

mnb = MultinomialNB()
mnb = mnb.fit(X_train, y_train)

In [17]:
# predict the probability

proba = mnb.predict_proba(X_test)
print("probability of test document belonging to class CINEMA" , proba[:,0])
print("probability of test document belonging to class EDUCATION" , proba[:,1])

probability of test document belonging to class CINEMA [0.32808399]
probability of test document belonging to class EDUCATION [0.67191601]


### Building the model: Bernoulli Naive Bayes

In [18]:
# build model

bnb = BernoulliNB()
bnb = bnb.fit(X_train, y_train)

In [20]:
# predict the probability

proba = bnb.predict_proba(X_test)
print("probability of test document belonging to class CINEMA" , proba[:,0])
print("probability of test document belonging to class EDUCATION" , proba[:,1])

probability of test document belonging to class CINEMA [0.2326374]
probability of test document belonging to class EDUCATION [0.7673626]


# Comprehension - Naive Bayes for Text Classification

In [22]:
data = pd.read_csv('comp_data.csv')
data.head()

Unnamed: 0,Doc.No.,Document,Class
0,0,Coffee Tea Soup Coffee Coffee,Hot
1,1,Coffee is hot and so is Soup and Tea,Hot
2,2,Espresso is a hot Coffee and not a Tea,Hot
3,3,Coffee is neither Tea nor Soup,Hot
4,4,Sprite Pepsi Cold Coffee and cold Tea,Cold


In [23]:
data['Class'] = data['Class'].map({'Hot': 0, 'Cold': 1})

In [25]:
X_train = np.array(data['Document'])
y_train = np.array(data['Class'])

#### Q1) How many words will be there in the dictionary vector without stop words?

In [32]:
cv = CountVectorizer(stop_words='english')
voc = cv.fit(X_train).vocabulary_
len(voc)

8

**Q2) What will be the feature vector after transforming the document:**

**“Coffee is neither Tea nor Soup”  look like?**

**The words in the dictionary are ordered in the way shown below :**

**coffee | cold | espresso | hot | pepsi | soup | sprite | tea**

In [33]:
voc

{'coffee': 0,
 'tea': 7,
 'soup': 5,
 'hot': 3,
 'espresso': 2,
 'sprite': 6,
 'pepsi': 4,
 'cold': 1}

In [37]:
doc_mat = cv.transform(X_train)
doc_mat.toarray()

array([[3, 0, 0, 0, 0, 1, 0, 1],
       [1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1],
       [1, 2, 0, 0, 1, 0, 1, 1]], dtype=int64)

**Q3) What will be the feature vector after transforming the document:**

**“I hate cold Coffee but love Tea and hot Coffee”  look like?**

**The words in the dictionary are ordered in the way shown below :**

**coffee | cold | espresso | hot | pepsi | soup | sprite | tea**

In [38]:
doc_mat = cv.transform(['I hate cold Coffee but love Tea and hot Coffee'])
doc_mat.toarray()

array([[2, 1, 0, 1, 0, 0, 0, 1]], dtype=int64)