# Naive Bayes from Scratch

Em um problema de classificação, com $k$ classes $y_{1}, y_{2}, ..., y_{k}$ e $n$ atributos $x_{1}, x_{2}, ..., x_{n}$, o que se busca é encontrar a probabilidade de uma entrada, com seus atributos, pertencer à classe $i$, isto é, $P(C=y_{i}|x_{1}, x_{2}, ..., x_{n})$.

Do Teorema de Bayes, $$P(C=y_{i}|x_{1}, x_{2}, ..., x_{n}) = \frac{P(x_{1}, x_{2}, ..., x_{n}|C=y_{i})P(C=y_{i})}{P(x_{1}, x_{2}, ..., x_{n})}$$

* $P(C=y_{i}|x_{1}, x_{2}, ..., x_{n})$: Probabilidade a posteriori
* $P(C=y_{i})$: Probabilidade a priori - $\frac{\#dados da classe y_{i}}{\#dados totais}$
* $P(x_{1}, x_{2}, ..., x_{n}|C=y_{i})$: Verossimilhança - aqui se faz a suposição de que os atributos são independentes, logo, $$P(x_{1}, x_{2}, ..., x_{n}|C=y_{i})=P(x_{1}|C=y_{i})P(x_{2}|C=y_{i})...P(x_{n}|C=y_{i})$$
* $P(x_{1}, x_{2}, ..., x_{n})$: Evidência - é um termo constante e, por isso, pode ser desprezado dos cálculos.

Portanto, $$P(C=y_{i}|x_{1}, x_{2}, ..., x_{n}) = P(C=y_{i})P(x_{1}|C=y_{i})P(x_{2}|C=y_{i})...P(x_{n}|C=y_{i}).$$

Geralmente, para cada tipo de atributo, utiliza-se uma abordagem diferente:
* Binário: Distribuição binomial
* Categórica: Distribuição multinomial
* Numérica: Distribuição Gaussiana   

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

In [2]:
path = '~/ML-AZ/census.csv'
df = pd.read_csv(path)

df.head()

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
x_census = df.iloc[:, 0:14].values
y_census = df.iloc[:, 14].values

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

#OneHotEncoder cria uma coluna dummy para cada atributo nominal único, de
#modo a não criar uma ordem ao atribuir labels do tipo 0, 1, 3, 4...
#ColumnTransformer transforma seletivamente as colunas de um array multi-atributos
onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')
x_census = onehotencorder.fit_transform(x_census).toarray()
print(x_census.shape)
df_census = pd.DataFrame(data=x_census)

labelencoder = LabelEncoder()
y_census = labelencoder.fit_transform(y_census)
print(y_census.shape)
y_census = pd.DataFrame(data=y_census)

(32561, 108)
(32561,)


In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(df_census, y_census, test_size = 0.25, random_state = 0)

In [6]:
print('Training examples shape is: ',X_train.shape)
print('Training labels shape is: ',Y_train.shape)
print('Test examples shape is: ',X_test.shape)
print('Test labels shape is: ',Y_test.shape)

Training examples shape is:  (24420, 108)
Training labels shape is:  (24420, 1)
Test examples shape is:  (8141, 108)
Test labels shape is:  (8141, 1)


In [7]:
def getParam(data):
    return [np.mean(data), np.std(data)]

In [8]:
def gaussianProb(x, mean_, std_):
    variance = float(std_)**2
    const = (2*np.pi*variance+1e-0)**.5
    return np.exp(-(float(x)-float(mean_))**2/(2*variance+1e-0))/const

In [9]:
class NaiveBayes:
    def __init__(self, unique_classes):  # inicilizar a classe com os labels únicos

        self.classes = unique_classes

    def train(self, df, y):
        """
        O treinamento consiste, basicamente, em construir uma matriz em que cada linha representa uma classe.
        Os elementos de cada coluna são a média e o desvio padrão da coluna de atributo da respectiva classe.

        Entradas: Dataset de treino e coluna com as classes.
        Saída: Não há.

        """

        self.df = df
        self.y = y
        self.df['y'] = self.y

        print('-------- Start Training --------')
        print('Training with ', len(df), ' examples and ',
              len(self.classes), ' classes.')

        self.labelList = []
        self.prioriList = []
        for lbls in self.classes:

            # seleciona apenas os exemplos cujas classes são lbls
            df_byClass = self.df.loc[self.df.iloc[:, -1] == lbls]
            # armazena a proporção de cada classe
            self.prioriList.append(np.log(len(df_byClass)/len(self.df)))

            # para cada classe, armazena a distribuição de cada coluna
            distList = [getParam(df_byClass.iloc[:, cols])
                        for cols in range(len(df_byClass.columns[:-1]))]

            self.labelList.append(distList)

        print('Log priori probability for class 0: ', self.prioriList[0])
        print('Log priori probability for class 1: ', self.prioriList[1])
        # print(self.labelList)
        print('--------  End Training  --------')

    def prob(self, xx):

        multList = []
        for i in range(len(self.classes)):

            # print(self.labelList[i])
            sum_ = 0
            for cols in range(len(self.labelList[i])):
                #print('Parametros: ', self.labelList[i][cols][0], self.labelList[i][cols][1])
                sum_ += np.log(gaussianProb(xx[cols],
                                            self.labelList[i][cols][0],
                                            self.labelList[i][cols][1]) + 1e-12)
                # print(sum_)

            multList.append(self.prioriList[i]+sum_)
            # print(multList)
        return multList

    def predict(self, x):

        x = x.reset_index(drop=True)
        yhat = [self.classes[np.argmax(
            self.prob(x.iloc[rows].tolist()))] for rows in range(len(x))]

        return yhat

In [10]:
labels = list(np.unique(Y_train))
classificador = NaiveBayes(labels)

In [11]:
classificador.train(X_train, Y_train)

-------- Start Training --------
Training with  24420  examples and  2  classes.
Log priori probability for class 0:  -0.2743398634655972
Log priori probability for class 1:  -1.427423528119907
--------  End Training  --------


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.df['y'] = self.y


In [12]:
resultados = classificador.predict(X_test)
#resultados

In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score
precisao = accuracy_score(Y_test, resultados)
precisao

0.8193096671170618