# Seminar 01: Naive Bayes from scratch

Today we will write Naive Bayes classifier supporting different feature probabilities

## Loading data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

First to load dataset we're going to use [`sklearn`](https://scikit-learn.org/stable/) package which we will extensively use during the whole course.

`sklearn` implement most of classical and frequently used algorithms in Machine Learning. Also it provides [User Guide](https://scikit-learn.org/stable/user_guide.html) describing principles of every bunch of algorithms implemented.

As an entry point to main `sklearn`'s concepts we recommend [getting started tutorial](https://scikit-learn.org/stable/getting_started.html) (check it out yourself). [Further tutorials](https://scikit-learn.org/stable/tutorial/index.html) can also be handy to develop your skills.

First functionality we use is cosy loading of [common datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets). All we need to do is just one function call.

Object generated by [`load_iris`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) is described as:

> Dictionary-like object, the interesting attributes are:
>
> ‘data’, the data to learn,
>
>‘target’, the classification labels,
>
>‘target_names’, the meaning of the labels,
>
>‘feature_names’, the meaning of the features,
>
>‘DESCR’, the full description of the dataset,
>
>‘filename’, the physical location of iris csv dataset (added in version 0.20)

Let's see what we have

In [None]:
dataset = datasets.load_iris()

print(dataset.DESCR)

If you aren't familiar with Iris dataset - take a minute to read description above =) (as always [more info about it in Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set))

__TL;DR__ 150 objects equally distributed over 3 classes each described with 4 continuous features

Just pretty table to look at:

In [None]:
# for now you don't need to understand what happens in this code - just look at the table
ext_target = dataset.target[:, None]
pd.DataFrame(
    np.concatenate((dataset.data, ext_target, dataset.target_names[ext_target]), axis=1),
    columns=dataset.feature_names + ['target label', 'target name'],
)

Now give distinct names to the data we will use

In [None]:
features = dataset.data
target = dataset.target

features.shape, target.shape

__Please, remember!!!__

Anywhere in our course we have an agreement to shape design matrix (named `features` in code above) as 

`(#number_of_items, #number_of_features)`

## Visualize dataset

Our dataset has 4 dimensions however humans are more common to 3 or even 2 dimensional data, so let's plot first 3 features colored with labels values

In [None]:
from mpl_toolkits.mplot3d import Axes3D

In [None]:
fig = plt.figure(figsize=(8, 8))

ax = Axes3D(fig)

ax.scatter(features[:, 0], features[:, 1], features[:, 3], c=target, marker='o')
ax.set_xlabel(dataset.feature_names[0])
ax.set_ylabel(dataset.feature_names[1])
ax.set_zlabel(dataset.feature_names[2])

plt.show()

Then have a look on feature distributions

In [None]:
# remember this way to make subplots! It could be useful for you later in your work

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

for i, axis in enumerate(axes.flat):
    axis.hist(features[:, i])
    axis.set_xlabel(dataset.feature_names[i])
    axis.set_ylabel('number of objects')

Note that every plot above have own scale

## Classifier implementation

Since we aiming to implement Naive Bayes algorithm first we need some prior distribution defined.

The most common distribution is (of course) Gaussian and it's params are mean and standard deviation. Let's implement class taking list of feature values, estimating distribution params and able to give probability density of any given feature value.

In [None]:
class GaussianDistribution:
    def __init__(self, feature):
        '''
        Args:
            feature: column of design matrix, represents all available values
                of feature to model
        '''
        self.mean = feature.mean()
        self.std = feature.std()

    def log_proba(self, value):
        '''Logarithm of probability density at value'''
        return # <YOUR CODE HERE>
    
    def proba(self, value):
        return # <YOUR CODE HERE>

In [None]:
assert np.allclose(
    GaussianDistribution(features[:, 2]).proba(features[:5, 2]),
    np.array([0.19195815, 0.19195815, 0.18463525, 0.19924939, 0.19195815])
), 'Something wrong with the GaussianDistribution class'

Next step is to implement classifier itself.

![title](https://www.saedsayad.com/images/Bayes_rule.png)

In [None]:
from scipy.special import logsumexp


class NaiveBayes():
    def fit(self, data, labels, distributions=None):
        self.unique_labels = np.unique(labels)
        
        distributions = distributions or [GaussianDistribution] * data.shape[1]
        self.label_likelihood = {}
        for label in self.unique_labels:
            distr_for_column = []
            for column_index in range(data.shape[1]):
                feature_column = data[labels == label, column_index]
                distr = distributions[column_index](feature_column)
                distr_for_column.append(distr)
            self.label_likelihood[label] = distr_for_column

        self.label_prior = {
            # <YOUR CODE HERE>
        }

    def predict_log_proba(self, batch):
        class_log_probas = np.zeros((batch.shape[0], len(self.unique_labels)))
        for label_idx, label in enumerate(self.unique_labels):
            for idx in range(batch.shape[1]):
                # All loglikelihood for every feature w.r.t. fixed label
                class_log_probas[:, label_idx] += # <YOUR CODE HERE>
            # Add log proba of label prior
            class_log_probas[:, label_idx] += # <YOUR CODE HERE>

        for idx in range(batch.shape[1]):
        # If you want to get probabilities, you need to substract the log proba for every feature
            class_log_probas -= # <YOUR CODE HERE>
        return class_log_probas

In [None]:
nb = NaiveBayes()
nb.fit(features, target, distributions=[GaussianDistribution]*4)

In [None]:
nb_proba = nb.predict_log_proba(features)
nb_proba

## Compare with reference implementation

In [None]:
from sklearn.naive_bayes import GaussianNB

external_nb = GaussianNB()

external_nb.fit(features, target)

In [None]:
ext_nb_proba = external_nb.predict_proba(features)
ext_nb_proba

In [None]:
nb_proba - ext_nb_proba

## Advanced distribution for NaiveBayes

Although we do love Gaussian distribution it is still unimodal while our features are substantially multimodal (see histograms above). So we have to implement more robust distribution estimator - Kernel Density Estimator (KDE).

Idea for this method is simple: we assign some probability density to a region around actual observation. (We will return to density estimation methods to describe them carefully later in this course).

Fortunately `sklearn` have KDE implemented for us already. All it needs is vector of feature values.

In [None]:
from sklearn.neighbors import KernelDensity

In [None]:
kde = KernelDensity(kernel='gaussian')

In [None]:
kde.fit(features[:, 2].reshape((-1, 1)))

In [None]:
class GaussianKDE:
    def __init__(self, feature):
        self.kde = KernelDensity()
        self.kde.fit(feature.reshape((-1, 1)))

    def log_proba(self, value):
        return self.kde.score_samples(value.reshape((-1, 1)))

    def proba(self, value):
        return np.exp(self.log_proba(value))

In [None]:
a = GaussianKDE(features[:, 2])

In [None]:
a.proba(features[:5, 2])

Now let's compare the classifiers using number of errors ;)