# Practical Machine Learning
# Lab 2

## Naive Bayes

We are going to classify the MNIST data using the Naive Bayes classifier from the **scikit-learn** library.

## Bayes' theorem

![bayes_rule.png](attachment:bayes_rule.png) 

## Naive Bayes

Naives Bayes method is a supervised learning algorithm based on applying Bayes' theorem with the '*naive*' assumption of conditional independence between every pair of features given the value of the class variable.

Let *X* be a feature vector $X=\{x_1, x_2,..., x_n\}$ and $y_k$ be a class variable, the predicted label ($y_{hat}$) is:
$$y_{hat} = argmax_{y_k} P(y_k) \prod_{i=1}^{i=n}P(x_i | y_k) $$
where $P(y_k)$ is the likelihood of class $y_k$ and $P(x_i | y_k)$ is the likelihood of feature $x_i$ in class $y_k$.

### Gaussian Naive Bayes

The likelihood of the features is assumed to be Gaussian:
$$P(x_i | y_k) = \frac{1}{\sqrt{2\pi \sigma^2_{y_k}}} exp(- \frac{x_i - \mu_{y_k}}{2\sigma^2_{y_k}}) $$

where $\sigma_{y_k} $ and $\mu_{y_k}$ are estimated using maximum likelihood.

### Multinomial Naive Bayes

It is used for multinomially distributed data and $P(x_i | y_k)$ is the probability of feature $i$ appearing in a sample belonging to class $y_k$.


$$P(x_i | y_k) = \frac{number-of-examples-in-class-y_k-that-have-x_i}{number-of-examples-in-class-y_k}$$




## How to use scikit-learn

In [0]:
# import the library
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

##### define the model
model = KNeighborsClassifier(n_neighbors=7, metric='minkowski') 

##### train the model
model.fit(X, y)

##### predict the labels
predicted_labels = model.predict(X_test)

##### compute the accuracy
accuracy = model.score(X_test, y_test)

# Execises

### 1. Compute the accuracy of multinomial naive bayes classifier on the MNIST subset.

In [0]:
# load data
train_images = np.load('data/train_images.npy') # load training images
train_labels = np.load('data/train_labels.npy') # load training labels
test_images = np.load('data/test_images.npy') # load testing images
test_labels = np.load('data/test_labels.npy') # load testing labels

# write your code here
clf = MultinomialNB()
clf.fit(train_images, train_labels)
print('accuracy =', clf.score(test_images, test_labels))


accuracy = 0.846


In [0]:
def value_to_bin(x, num_bins=3, min_value=0, max_value=255):  
    bins = np.linspace(min_value, max_value + 1, num=num_bins + 1)   
    x = np.digitize(x, bins)    
    return x - 1

In [0]:
x = np.array([0, 1, 2, 120,  240,  255, 256])
print(value_to_bin(x))

[0 0 0 1 2 2 3]


In [0]:


train_images_cat = value_to_bin(train_images, 9) 
print(train_images.min(), train_images.max())
test_images_cat = value_to_bin(test_images, 9)

clf = MultinomialNB()
clf.fit(train_images_cat, train_labels)
print(clf.score(test_images_cat, test_labels))

0.0 255.0
0.842
