# Naive Bayes

## What is it

* Unlike KNN and Decision Trees classification algorithms that give a hard decision of which class does a sample belong to, instead Naive Bayes algorithm gives a probability of the sample belongs to a class. Naive Bayes is based on Bayes theory where we calculate the probabilities of ***x*** belongs to each class, and choose the one with the highest probability. The word "Naive" indicates the assumption that each feature is independent.

#### Input
1. A labeled dataset ***A***.
2. An unlabeled datas sample ***x***.

#### Output
1. Probabilities of ***x*** belongs to each class.

## How does it work
* We use the chain rule to calculate the probability of ***x*** belongs to a class ***c_i***:

    ***p(c_i|x)*** = ***p(x|c_i)p(c_i) / p(x)***
    
    and because of the independence assumption, ***p(x|c_i)*** is equal to ***p(x_0|c_i)p(x_1|c_i)...p(x_n|c_i)***, which equals to:
    
    ***log(p(x_0|c_i)) + log(p(x_1|c_i)) + ... + log(p(x_n|c_i))***
    
    Since ***p(x)*** is the same for all classes, we only need to calculate ***p(x|c_i)p(c_i)***.

In [13]:
import numpy as np

def train_naive_bayes(one_hot_matrix, labels):
    '''
    A simple binary classification
    '''
    # number of documents (samples)
    num_docs = len(one_hot_matrix)
    # number of vocabulary
    num_vocab = len(one_hot_matrix[0])
    # p(c_1), since binary, p(c_0) = 1 - p(c_1)
    p1 = sum(labels) / num_docs

    # word counts of class 0
    num_p0 = np.ones(num_vocab)  # [0,0......] -> [1,1.....]
    # word counts of class 1
    num_p1 = np.ones(num_vocab)

    #  total number of words in each class
    denom_p0 = 2.0  # 2 is to avoid 0
    denom_p1 = 2.0  # 2 is to avoid 0
    for i in range(num_docs):
        if labels[i] == 1:
            # if belongs to class 1
            # add word counts
            num_p1 += one_hot_matrix[i]
            # sum total number of words
            denom_p1 += sum(one_hot_matrix[i])
        else:
            # if belongs to class 0
            num_p0 += one_hot_matrix[i]
            denom_p0 += sum(one_hot_matrix[i])
    # class 1，[log(p(x_0|c_1)),log(p(x_1|c_1)),...]
    p1_vect = np.log(num_p1 / denom_p1)
    # class 0，[log(p(x_0|c_0)),log(p(x_1|c_0)),...]
    p0_vect = np.log(num_p0 / denom_p0)
    return p0_vect, p1_vect, p1

In [22]:
def classify_naive_bayes(data_vect, p0_vect, p1_vect, p1):
    """
    使用算法: 
        # 将乘法转换为加法
        multiplication: p(c_i|x) = p(x|c_i)p(c_i)/p(x)
        addition: p(x_0|c_i)*p(x_1|c_i)*...*p(c_i) -> log(p(x_0|c_i)))+log(p(x_1|c_i))+....+log(p(ci))
    :param data_vect: one-hot vector to classify
    :param p0_vect: [log(p(x_0|c_0)),log(p(x_1|c_0)),...] for class 0
    :param p1_vect: [log(p(x_0|c_1)),log(p(x_1|c_1)),...] for class 1
    :param p1: probability of class 1
    """
    # why data_vect * p1_vect
    # p1_vect contains the log probability of each word in vocabulary that belongs in class 1 documents
    # and data_vect is a one-hot vector, where words that are not in data_vect are represented as 0
    # while words that are there are represented by 1.
    # by multiply them together, we get the log probability of each word in data_vect
    p1 = sum(data_vect * p1_vect) + np.log(p1) 
    p0 = sum(data_vect * p0_vect) + np.log(1 - p1) 
    if p1 > p0:
        return 1
    else:
        return 0

## Simple example

In [23]:
def create_dataset():
    dataset_x = [['my', 'dog', 'has', 'flea',
                  'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him',
                  'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute',
                  'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how',
                  'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    dataset_y = [0, 1, 0, 1, 0, 1]  #1 is abusive, 0 not
    return dataset_x, dataset_y

def get_vocabulary(dataset_x):
    '''
    Create unique vocabulary list
    '''
    vocab = set()
    for doc in dataset_x:
        vocab = vocab | set(doc)
    return list(vocab)

def words_to_one_hot(vocab: list, sentence: list):
    '''
    Create one-hot encodings for sentence
    '''
    one_hot= [0 for x in vocab]
    for word in sentence:
        if word in vocab:
            one_hot[vocab.index(word)] = 1
        else:
            print('OOV: %s' % word)
    return one_hot

In [24]:
dataset_x, dataset_y = create_dataset()
dataset_x

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

In [25]:
vocabulary = get_vocabulary(dataset_x)
vocabulary

['problems',
 'mr',
 'flea',
 'ate',
 'has',
 'so',
 'dog',
 'my',
 'is',
 'worthless',
 'licks',
 'food',
 'please',
 'stupid',
 'help',
 'to',
 'cute',
 'him',
 'stop',
 'take',
 'maybe',
 'steak',
 'how',
 'I',
 'not',
 'love',
 'posting',
 'park',
 'garbage',
 'dalmation',
 'quit',
 'buying']

In [26]:
one_hot_vectors = []
for doc in dataset_x:
    one_hot_vectors.append(words_to_one_hot(vocabulary, doc))
one_hot_vectors[0]

[1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [27]:
p0_vect, p1_vect, p1 = train_naive_bayes(np.array(one_hot_vectors),
                                         np.array(dataset_y))
print(p0_vect, p1_vect, p1)

[-2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936
 -2.56494936 -1.87180218 -2.56494936 -3.25809654 -2.56494936 -3.25809654
 -2.56494936 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -2.15948425
 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -3.25809654 -3.25809654 -3.25809654 -2.56494936
 -3.25809654 -3.25809654] [-3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -1.94591015 -3.04452244 -3.04452244 -1.94591015 -3.04452244 -2.35137526
 -3.04452244 -1.65822808 -3.04452244 -2.35137526 -3.04452244 -2.35137526
 -2.35137526 -2.35137526 -2.35137526 -3.04452244 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -2.35137526 -2.35137526 -2.35137526 -3.04452244
 -2.35137526 -2.35137526] 0.5


In [28]:
test_sample = ['love', 'my', 'dalmation']
test_one_hot = np.array(words_to_one_hot(vocabulary, test_sample))
print(test_sample, 'classified as: ',
      classify_naive_bayes(test_one_hot, p0_vect, p1_vect, p1))
test_sample = ['stupid', 'garbage']
test_one_hot = np.array(words_to_one_hot(vocabulary, test_sample))
print(test_sample, 'classified as: ',
      classify_naive_bayes(test_one_hot, p0_vect, p1_vect, p1))

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1
