<a href="https://colab.research.google.com/github/ziqlu0722/Machine-Learning/blob/master/NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NaiveBayes

## 1. Concept

* Assume: Feature Independence
* $P(x1, x2, ..., xd \mid y) = P(x1\mid y) * P(x2\mid y) * ... * P(xd \mid y)$

This assumption doesn't really hold, but Naive Bayes still work in many cases, unless the assumption is completely broken



### 1.2 How to Calculate the Simple Uni-dimentional Density Function?

* #### Option 1: Model

> Apply an imposed model, calculate the maximum likelihood parameters for the model
* gaussian, bernoullim, binomial, exponential
* mixture of distributions

* #### Option 2: Histogram

> Bucket/cluster/bin and count feature values in each bucket/cluster/bin, and convert counts into probability

### 1.3 But there are some defects with Naive Bayes to be solved

* ### Problem 1: constant feature

> if $x_j$ is constant, then some estimates is unusable

> solutions:
  1. control the parameters
  2. smoothing, convert the counts into probabilities
  3. feature selection, exclude this feature

* ### Problem 2: zero probability
This situation is common for sparse features, e.g. document data

* ### Solution: Two Ways of Smoothing:

  * $M$: # of observations

  * $N$: # of features

  * $t_i$: # of observations for the ith feature

  * $P(i)$: Original Probability = $t_)/M$

  (1). Laplace Smoothing: $ (t_i + 1 \epsilon) / (M + N \epsilon)$
  
  (2). Background + Foreground: $ \lambda * P(i) + (1- \lambda) * Q(i)$

> *$Q(i)$: Sumation of $t_i^j$ / $M_i^j$ over $j$ experiments (from prior knowledge or preivous experiments)

###1.4 Major Types of Naive

* **Multinomial Naive Bayes:**

> This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

* **Bernoulli Naive Bayes:**

> This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

* **Gaussian Naive Bayes:**

> When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

> *from [here](https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c)*

##1. Import Data

In [729]:
import numpy as np
import collections

data = []

with open('drive/Data/spambase/spambase.txt') as file:
  for line in file:
    line_split = line.strip().split(',')
    data.append([float(c) for c in line_split])
#     data.append(line.strip().split(','))

print('There are {} data instances'. format(len(data))) 
print(collections.Counter(np.array(data)[:,-1]))

There are 4601 data instances
Counter({0.0: 2788, 1.0: 1813})


## 2. Help Functions

###2.1. K-Folds Cross Validation

In [0]:
from random import randrange

# Split a dataset into k folds
def cross_validation_split(dataset, folds=5):

  dataset_split = []     
  dataset_copy = list(dataset)
  fold_size = int(len(dataset) / folds)
  for i in range(folds):
    fold = []
    while len(fold) < fold_size:
      index = randrange(len(dataset_copy))
      fold.append(dataset_copy.pop(index))
    dataset_split.append(fold)

  idx = 0
  test_set = []
  train_set = []
  for i in range(folds):
    test_set.append(dataset_split[idx])
    train_set_j = []
    train_set_k = []
    for j in dataset_split[:idx] + dataset_split[idx + 1:]:
      for k in j:
        train_set_k.append(k)
    train_set.append(train_set_k)
    idx += 1
  
  return train_set, test_set

In [0]:
# split data

num_folds = 10
train_set, test_set = cross_validation_split(data, folds = num_folds)

### 2.2. Metrics

In [0]:
class Metric:
  def __init__(self, predict, label):    
    self.predict = predict
    self.label = label
    self.num_obs = len(label)
  
  def acc(self):
    predict = [np.argmin(_) for _ in self.predict]
    err = 0
    for i in range(len(self.label)):
      if predict[i] != self.label[i]:
          err += 1
    acc = err/len(self.label)
    print('Accuracy---{}'.format(acc))
    return acc 

## 3. Build Algorithm

###3.1. Bernoulli NaiveBayes

In [0]:
import numpy as np

class BernoulliNB:
  
  def __init__(self, alpha = 1.0):
    self.alpha = alpha
    
  
  def fit(self, train_data):
    x = np.array(train_data)[:,:-1]
    y = np.array(train_data)[:,-1]
    num_samples = x.shape[0]
    num_features = x.shape[1]
    
    # split data for each class:
    self.input = {'c0': x[y==0], 'c1':x[y==1]}

    # calculate the prior as the frequency of class over all data points
    self.prior = {'c0': 1 - y.mean(), 'c1': y.mean()}
    
    # calculate the mean for each x set
    self.mu = x.mean(axis = 0)
    
    # calculate the size for each x set
    self.size = {'c0': x[y==0].shape[0], 'c1': x[y==1].shape[0]}
    
    # calculate probability of each word consider smoothing
    ''' smoothed_prob = (count + 1 * alpha) / (total_count + total_count_features * alpha)'''
    
    self.log_smoothed_p = {'c0': np.log((np.abs((np.count_nonzero(self.input['c0'] >= self.mu, axis = 0))) + self.alpha*1) / (self.size['c0'] + self.alpha*num_features)), \
                           'c1': np.log((np.abs((np.count_nonzero(self.input['c1'] >= self.mu, axis = 0))) + self.alpha*1) / (self.size['c1'] + self.alpha*num_features))}
    
    return self
    
  def predict(self, test_data):
    x_test = np.array(test_data)[:,:-1]
    label = np.array(test_data)[:,-1]
    prob_test = {'c0':np.where(x_test >= self.mu, self.log_smoothed_p['c0'], 1 - self.log_smoothed_p['c0']), \
              'c1':np.where(x_test >= self.mu, self.log_smoothed_p['c1'], 1 - self.log_smoothed_p['c1'])} 
    predict = []
    for i in range(x_test.shape[0]):
      marginal_prob_c0 = np.sum(prob_test['c0'][i]) + np.log(self.prior['c0']) 
      marginal_prob_c1 = np.sum(prob_test['c1'][i]) + np.log(self.prior['c1']) 
      predict.append([marginal_prob_c0, marginal_prob_c1])
    
    return predict

      
  def evaluate(self, test_data):
    test_x = np.array(test_data)[:,:-1]
    label = np.array(test_data)[:,-1]
    predict = self.predict(test_data)
    metric = Metric(predict, label)
    return metric

In [1034]:
# set hyper parameters
alpha = 200

# train
for i in range(num_folds):
  acc = BernoulliNB(alpha = alpha).fit(train_set[i]).evaluate(test_set[i]).acc()
print('The Accuracy Over 10 folds Cross Validation is {}'.format(np.mean(acc)))

Accuracy---0.8978260869565218
Accuracy---0.8934782608695652
Accuracy---0.9021739130434783
Accuracy---0.8782608695652174
Accuracy---0.8956521739130435
Accuracy---0.8760869565217392
Accuracy---0.8739130434782608
Accuracy---0.9347826086956522
Accuracy---0.9130434782608695
Accuracy---0.9043478260869565
The Accuracy Over 10 folds Cross Validation is 0.9043478260869565


### 3.2. Gaussian NaiveBayes

In [0]:
import numpy as np

class GaussianNB:
  
  def __init__(self, epsilon = 1.0):
    # Epsilon is for smoothing the variance
    self.epsilon = epsilon
    
  def smoothing(self, std):
    std_smoothed = np.where(std > self.epsilon, std, self.epsilon)
    return std_smoothed
 
  def fit(self, train_data):
    x = np.array(train_data)[:,:-1]
    y = np.array(train_data)[:,-1]
    num_samples = x.shape[0]
    num_features = x.shape[1]
    self.c = ['c0','c1']
    
    # split data for each class:
    self.input = {'c0': x[y==0], 'c1':x[y==1]}

    # calculate the prior as the frequency of class over all data points
    self.prior = [1-y.mean(), y.mean()]
    
    # calculate the mean for each x set
    self.mu = x.mean(axis = 0)
    
    # calculate the gaussian parameters for each feature   
    
    self.gaussian_mean = [self.input['c0'].mean(axis = 0), \
                          self.input['c1'].mean(axis = 0)]
    
    self.gaussian_std = [self.smoothing(np.std(self.input['c1'], axis = 0)), \
                         self.smoothing(np.std(self.input['c1'], axis = 0))]
 
    return self
  
  def _gaussian(self, x, mean, std):
    p = np.abs(np.exp(-((x-mean)**2/(2*std**2)))/(np.sqrt(2*np.pi)*std))
    log_p = np.log(p)
    
    return log_p
    
  def predict(self, test_data):
    x_test = np.array(test_data)[:,:-1]
    label = np.array(test_data)[:,-1] 
    
    pred = []
    for i in range(x_test.shape[0]):
      pred.append(np.sum(np.array([[self._gaussian(x_test[i][j], mu[j], std[j]) \
                                    for mu, std in list(zip(self.gaussian_mean, self.gaussian_std))] \
                                    for j in range(x_test.shape[1])]), axis = 0) + np.array(self.prior))
    return pred
      
  def evaluate(self, test_data):
    test_x = np.array(test_data)[:,:-1]
    label = np.array(test_data)[:,-1]
    predict = self.predict(test_data)
    metric = Metric(predict, label)
    return metric

In [1022]:
# set hyper parameters
epsilon = 1

# train
for i in range(num_folds):
  acc = GaussianNB(epsilon = epsilon).fit(train_set[i]).evaluate(test_set[i]).acc()
print('The Accuracy Over 10 folds Cross Validation is {}'.format(np.mean(acc)))

Accuracy---0.8282608695652174
Accuracy---0.8043478260869565
Accuracy---0.8369565217391305
Accuracy---0.7913043478260869
Accuracy---0.8326086956521739
Accuracy---0.8347826086956521




Accuracy---0.8065217391304348
Accuracy---0.8282608695652174
Accuracy---0.808695652173913
Accuracy---0.8326086956521739
The Accuracy Over 10 folds Cross Validation is 0.8326086956521739


### 3.3. Multinomial NaiveBayes

In [0]:
import numpy as np

class MultinomialNB:
  
  def __init__(self, alpha, num_bins):
    '''alpha is the param for Laplace Smoothing'''
    self.alpha = alpha
    self.num_bins = num_bins
    
  
  def fit(self, train_data):
    x = np.array(train_data)[:,:-1]
    y = np.array(train_data)[:,-1]
    num_samples = x.shape[0]
    self.num_features = x.shape[1]
    self.num_class = len(np.unique(y))
    
    # class
    self.c = ['c0', 'c1']
    
    # split data for each class:
    self.input = {'c0': x[y==0], 'c1':x[y==1]}

    # calculate the prior as the frequency of class over all data points
    self.prior = {'c0': 1 - y.mean(), 'c1': y.mean()}
    
    # calculate the size for each x set
    self.size = {'c0': x[y==0].shape[0], 'c1': x[y==1].shape[0]}
    
    # calculate the split point for each feature
    # add 1e-10 here to include the max value into the bins
    self.mu = x.mean(axis = 0)
    self.bins = (x.max(axis = 0) + 1e-10 - x.min(axis = 0) + 1)/self.num_bins
    start = x.min(axis = 0).reshape(1, self.num_features)
    llist = []
    for i in range(self.num_bins):
      split = np.array((i + 1) * self.bins).reshape(1, self.num_features)
      llist.append(split)
    self.threshold = np.r_[start, np.concatenate(llist)]
    
    # calculate log smoothed bin probability of each feature consider smoothing
    ''' smoothed_prob = (count + 1 * alpha) / (total_count + total_count_features * alpha)'''
    self.log_prob = {}
    for c in self.c:
      prob = []
      for i in range(self.num_bins):
        match_mat = (self.input[c] >= self.threshold[i]) * (self.input[c] < self.threshold[i+1])
        prob.append(np.log((np.count_nonzero(match_mat, axis = 0) + self.alpha*1)/(self.size[c] + self.alpha * self.num_features)))
      self.log_prob[c] = np.array(prob)
    
    return self
    
  def predict(self, test_data):
    x_test = np.array(test_data)[:,:-1]
    label_test = np.array(test_data)[:,-1]
    
    prob = []
    for i in range(self.num_bins):
      match_mat = (x_test >= self.threshold[i]) * (x_test < self.threshold[i+1])  
      prob.append(match_mat)  
    # transpose here to make the iteration of instance at the uppermost layer
    prob = np.array(prob).transpose(1,0,2)
    
    # get margainal prob for each instance
    prob_test = {}
    for c in self.c:     
      prob_test[c] = np.array(prob * self.log_prob[c])
         
    predict = []
    
    for i in range(x_test.shape[0]):
      marginal_prob_c0 = np.sum(prob_test['c0'][i]) + np.log(self.prior['c0']) 
      marginal_prob_c1 = np.sum(prob_test['c1'][i]) + np.log(self.prior['c1']) 
      predict.append([marginal_prob_c0, marginal_prob_c1])
    
    return predict
      
  def evaluate(self, test_data):
    test_x = np.array(test_data)[:,:-1]
    label = np.array(test_data)[:,-1]
    metric = Metric(self.predict(test_data), label)
    return metric

In [1009]:
# set hyper parameters
alpha = 1
num_bins = 4

# train
for i in range(num_folds):
  acc = MultinomialNB(alpha = alpha, num_bins = num_bins).fit(train_set[i]).evaluate(test_set[i]).acc()
print('The Accuracy Over 10 folds Cross Validation is {}'.format(np.mean(acc)))

Accuracy---0.6543478260869565
Accuracy---0.7021739130434783
Accuracy---0.6739130434782609
Accuracy---0.717391304347826
Accuracy---0.7217391304347827
Accuracy---0.7043478260869566
Accuracy---0.691304347826087
Accuracy---0.7108695652173913
Accuracy---0.658695652173913
Accuracy---0.7369565217391304
The Accuracy Over 10 folds Cross Validation is 0.7369565217391304


In [1010]:
# set hyper parameters
alpha = 2
num_bins = 9

# train
for i in range(num_folds):
  acc = MultinomialNB(alpha = alpha, num_bins = num_bins).fit(train_set[i]).evaluate(test_set[i]).acc()
print('The Accuracy Over 10 folds Cross Validation is {}'.format(np.mean(acc)))

Accuracy---0.8239130434782609
Accuracy---0.8021739130434783
Accuracy---0.8217391304347826
Accuracy---0.8478260869565217
Accuracy---0.8326086956521739
Accuracy---0.8304347826086956
Accuracy---0.8130434782608695
Accuracy---0.85
Accuracy---0.8260869565217391
Accuracy---0.8456521739130435
The Accuracy Over 10 folds Cross Validation is 0.8456521739130435
