# Example Challenge Naïve Bayes Decisor

# PSF-PSP 2019-2020

## Grado Ingenería Biomédica - Biomedical Engineering Degree

### Universidad Rey Juan Carlos


### Authors

#### Óscar Barquero Pérez (<oscar.barquero@urjc.es>), Rebeca Goya Esteban (<rebeca.goyaesteban@urjc.es>), Miguel Ángel Cámara Vázquez (<miguelangel.camara@urjc.es>)

#### Today

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Licencia de Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />Este obra está bajo una <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">licencia de Creative Commons Reconocimiento-NoComercial-CompartirIgual 4.0 Internacional</a>. 

## Notebook

In this notebook we are going to develop a very easy Naïve Bayes detector to work on the challenge. Naïve Bayes is a MAP Bayes detector that assumes independence on the features (it is naïve in that sense). Given the circumstances, we are going to use sklearn implementation instead of our own implementation.

## Decisor Bayesiano -- Naïve Bayes

En este apartado del Lab vamos a *implementar* un decisor Bayesiano. El alumno recordará de las clases de teoría que un decisor Bayesiano MAP, para un caso binacio, tenía la siguiente estructura:

$$P(H_1|\boldsymbol{x}) \mathop{\gtrless}^{D_1}_{D_0}P(H_0|\boldsymbol{x})$$

Equivalentemente,

$$P(\boldsymbol{x}|H_1)P(H_1) \mathop{\gtrless}^{D_1}_{D_0}P(\boldsymbol{x}|H_0) P(H_0)$$

La dificultad fundamental de este decisor es ser capaz de calcular el likelihood que resulta ser una pdf conjuta condicionada

$$p(x_1,\ldots,x_n|H_i)?????$$

En este punto es en el que podemos hacer asunciones simplificando nuestro modelo. En concreto, vamos a hacer una asunción **naïf** sobre la relación entre las características $x_i$. **Vamos a asumir que las pdfs de las características condicionas son indpendientes**. De esta forma:

$$p(x_1,\ldots,x_n|H_i)=p(x_1|H_i)p(x_2|H_i)\cdots p(x_n|H_i)$$

Las pdfs de cada una de las características condiconiadas suelen ser:
* Binomial pdf, cuando la característica es binaria (yes or no)
* Multinomial pdf, cuando la característica tiene diferentes niveles (categorical variable)
* Gaussian pdf, cuando la característica es numérica. Ojo, se pueden realizar transformaciones de la variable para conseguir normalizarla (log, etc)

Por ejemplo, supongamos que la pdf de la característica j-ésima es una pdf Gaussiana. De esta forma corresponderá a la siguiente ecuación:

$$p(x_j|H_i)= \frac{1}{\sqrt{2\pi\sigma_j^2}}e ^{-1/2\frac{(x_j-mu_j)^2}{\sigma_j^2}}$$

En esta pdf hay dos parámetros cuyo valor desconocemos: $\mu_j$, $\sigma^2_{j}$.

En la fase de training, utilizaremos los datos que tenemos para realizar la estimación de dichos parámetros. Vamos a asumir independiencia en los datos (también), de forma que se pueden utilizar los siguientes estimadores máximos verosimiles:

$$\hat{\mu}_j = \frac{1}{N_{train}}\sum^{N_{train}}_{k=1}x_{k,j}$$
$$\hat{\sigma}^2_j = \frac{1}{N_{train}}\sum^{N_{train}}_{k=1}(x_{k,j}-\hat{\mu}_j)^2$$


El último parámetro que tenemos que estimar serían las probabilidades a priori para cada clase. Esto debe hacerse también con los datos de entrenamiento.

### Read data

In this step we are going to read our data. Since the aim of the notebook is to show how to use a Naïve Bayes, the features we are going to obtain using signal processing are going to be just the samples in our sequence. In order to keep the number of features small we are going to use onlye the first 50 samples.

In the following cells we are going to define a function to extract the samples from a patient in the dataset.

In [4]:
%matplotlib inline

%load_ext autoreload
%autoreload 2

import os
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
def get_features_one_subject(sub_id):
    """
    Extract first 50 samples from X axis in right foot sensor.
    """
    
    #assume we are in the folder with the subjects. 
    
    #Get inside subjet folder
    right_foot = np.loadtxt(sub_id+'\\PD.txt',skiprows=1)
    
    signal = right_foot[:,0] #x axis
    
    return signal[:100]
    
    
    

In [28]:
#get into de data folder

pwd = os.getcwd()
path = 'C:\\Users\\riul0\\Desktop\\Physiological signals and processing\\challeng\\Data\\Training\\'

os.chdir(path)

import glob

subjects = os.listdir()

#subjects.pop(subjects.index('.DS_Store')) #keep this only if your are using mac system

print(subjects)

X = []

for sub in subjects:
    
    signal = get_features_one_subject(sub)
    
    X.append(signal)
    
X = np.array(X)

['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '6', '60', '7', '8', '9']


In [29]:
print(X.shape)

(60, 100)


Note that X is going to be our matrix of data:
 * *Rows*: subjects
 * *Features*: 50 first samples.
 
Note also that you have to use subjects list to match the class for each subject. Important, subject list is unsorted.

We are going to first sort the subjects list before implementing the Naïve Bayes Detector, we have to order also the X matrix (rows exchange) accordingly.

In [45]:
#convert subjects list into an integer numpy array

subjects = np.asarray(subjects,dtype = int)

print(subjects)

#sort subjects and get the sorted index

sorted_idx = np.argsort(subjects)

subjects_s = subjects[sorted_idx]

print(subjects_s)

#Sort X matrix, row exchanges

X_s = X[sorted_idx,:]

print(X_s[-2,:5]) #id 59
print(X_s[58,:5]) #id 59 because 0 index is equal a id 1 and so on
print(X[0,:5])



[ 1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23 24 25 26 27 28 29  3 30
 31 32 33 34 35 36 37 38 39  4 40 41 42 43 44 45 46 47 48 49  5 50 51 52
 53 54 55 56 57 58 59  6 60  7  8  9]
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60]
[0.72001955 0.92396387 1.04339737 1.19983166 1.44950265]
[0.72001955 0.92396387 1.04339737 1.19983166 1.44950265]
[1.96684025 2.33229933 2.17437078 1.4850388  2.03176323]


In [41]:
print(X_s[0,:])

[ 1.96684025  2.33229933  2.17437078  1.4850388   2.03176323  1.33381438
  0.74213112  0.29864905  0.2613164   0.42963489  0.19909943  0.47547822
  0.65118796  0.67530567  0.79753349  1.08456384  1.24923236  1.47868653
  1.8547614   2.28140199  2.63765537  2.92819578  2.07678628  0.67664052
  0.75447847  0.71689387  0.99024994  1.91589253  0.80978786  1.07909644
  0.91989717  0.98850765  1.05654524  0.989506    1.03508108  1.02343782
  1.01007346  1.02693072  1.00092083  0.95226873  0.92738134  0.92738134
  0.96365429  0.9745363   0.96630292  0.95881796  0.96906888  1.03740952
  1.08231901  1.13986972  1.25336543  1.37647619  1.42105328  1.40918164
  1.61420329  1.95805362  2.28905332  2.11035807  1.82332478  2.11324198
  2.03173954  1.30052276  0.46481567  0.50410835  0.54125859  0.14371019
  0.25323915  0.47446006  0.58431152  0.66166653  0.83831015  1.07978016
  1.31734483  1.62247887  2.32107131  3.06643867  3.18034273  1.99398737
  0.198872   -0.170167    2.52764933  2.31845485  1

In [43]:
print(X_s[0:])

[[1.96684025 2.33229933 2.17437078 ... 0.9408541  0.97079376 1.01645163]
 [0.67467895 0.91166285 1.22602938 ... 1.43560715 1.52093526 1.60447845]
 [0.92738124 1.07558264 1.18935339 ... 1.39171796 1.43693507 1.51485382]
 ...
 [1.93335427 1.47490327 1.7424887  ... 2.10967291 1.9977247  1.73076263]
 [0.72001955 0.92396387 1.04339737 ... 1.92241851 2.45879399 2.49247906]
 [2.25894848 2.2044954  2.2157229  ... 0.37724172 0.61675659 0.25841034]]


Next, we are going to implement the Naïve Bayes Detector. We are going to need the labels. Since we don't have any other information, we are going to obtain Prior Probabilities from labels.

In [31]:
from sklearn.naive_bayes import GaussianNB

#back to the orinigial folder
os.chdir(pwd)


h = np.loadtxt('C:\\Users\\riul0\\Desktop\\Physiological signals and processing\\challeng\\Training.csv',skiprows=1,delimiter=',')

Prior_0 = np.sum(h[:,1]==0)/len(h)
Prior_1 = np.sum(h[:,1]==1)/len(h)
Prior_2 = np.sum(h[:,1]==2)/len(h)


#naive bayes model

nb_detector = GaussianNB(priors = [Prior_0,Prior_1,Prior_2])

We have declared the naive bayes detector. Next we need to obtain the mean and std of the gaussians for each feature and for each class. This is make using fit method

In [32]:
nb_detector.fit(X_s,h[:,1])

GaussianNB(priors=[0.3333333333333333, 0.3333333333333333, 0.3333333333333333],
           var_smoothing=1e-09)

In [33]:
#We can check some parameters
#For example, the mean and variance of the Gaussian, for the first class first feature

print('Mean: %.2f'%nb_detector.theta_[0,0])
print('Variance: %.2f'%nb_detector.sigma_[0,0])

Mean: 1.40
Variance: 0.34


Now that we have our detector implemented, next step is to detect. In this part, which is performed by the predict method, we just get one sample $x*$ and compare the posteriors:

$$D_i = max_i \left(P(H_0|x*), P(H_1|x*), P(H_2|x*)\right)$$

We are going to test for the first subject id=1. For this subject we know that the class is 2. 

Be aware that we are using the same data we used for training, so results on this data are overfitted, results should be worse for the test.

In [34]:
x_1 = X_s[0,:]

D = nb_detector.predict(x_1[np.newaxis,:])

print('Detection: %d'%D[0])
print('Hypothesis H: %d' %h[0,1])

Detection: 2
Hypothesis H: 2


Let's check the accuracy (probability of get right the class)

In [35]:
D = nb_detector.predict(X_s)

print('ACC = %.2f'%np.mean(D == h[:,1]))

ACC = 0.68


## Results

This simple method allowed to obtain a pretty decent result, at least in trainig. Let's hope this method won't win the challenge.