# Actividad Naïve Bayes
En esta actividad, usted implementará el algoritmo Naïve Bayes, para usarlo sobre la base de datos del enunciado. Esta corresponde a un conjunto de datos sobre estrellas y las características de las ondas de luz que emite, tales como amplitud, varianza, entre otras. La tarea será clasificar estas estrellas de acuerdo a estas features. La columna a predecir corresponde a la denominada `label`.

## Preprocesamiento

1. Normalice las columnas de *features*. No normalice la columna `label`.
2. Separe los datos en 70% para entrenamiento y 30% para evaluación.

Puede usar librerías externas para realizar estos dos pasos.

In [2]:
# Librerías básicas para la actividad. Puede prescindir de algunas o añadir más si lo requiere.
import numpy as np
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.metrics import classification_report

In [3]:
DATA_URL = "https://gist.githubusercontent.com/diflores/9d5f62c38718cc6128cf6f19d822a025/raw/f2db6632b8d57812cbb41d9598a67469acdbabaa/naive_db.csv"

In [9]:
data = pd.read_csv(DATA_URL)
por_normalizar = ["Amplitude", "Std", "PeriodLS", "Mean", "MaxSlope", "Meanvariance", "LinearTrend"]
data[por_normalizar] = data[por_normalizar].apply(lambda x: (x - x.min())/(x.max()-x.min()))
data

Unnamed: 0,label,Amplitude,Std,PeriodLS,Mean,MaxSlope,Meanvariance,LinearTrend
0,5,0.112269,0.125596,0.017509,0.126103,0.000215,0.150121,0.423558
1,4,0.146218,0.150753,0.000017,0.862854,0.006795,0.127887,0.450049
2,2,0.014592,0.014232,0.013491,0.352734,0.000029,0.015043,0.462669
3,3,0.100655,0.108427,0.000039,0.807383,0.002514,0.093636,0.450919
4,2,0.001489,0.001681,0.030278,0.348337,0.001224,0.001721,0.444873
...,...,...,...,...,...,...,...,...
995,5,0.078023,0.082672,0.003575,0.260801,0.001876,0.092092,0.209445
996,0,0.213222,0.187659,0.000404,0.839469,0.018958,0.161093,0.452667
997,3,0.272781,0.274041,0.000017,0.905673,0.002996,0.230009,0.446884
998,2,0.001191,0.001292,0.024169,0.371093,0.001801,0.001206,0.444760


In [67]:
train, test = train_test_split(data, test_size=0.3)
train

Unnamed: 0,label,Amplitude,Std,PeriodLS,Mean,MaxSlope,Meanvariance,LinearTrend
592,0,0.196992,0.177612,0.000171,0.944873,0.096609,0.146208,0.449198
547,4,0.128946,0.135694,0.000014,0.828653,0.209013,0.116549,0.367132
549,4,0.060453,0.076840,0.000020,0.412139,0.012320,0.079185,0.488697
712,2,0.006849,0.005645,0.146379,0.363495,0.001316,0.005843,0.445321
86,2,0.060155,0.078090,0.043459,0.201438,0.002928,0.089833,0.443047
...,...,...,...,...,...,...,...,...
881,2,0.005658,0.005747,0.000056,0.280430,0.000400,0.006524,0.444310
180,4,0.088743,0.111025,0.000013,0.671756,0.033945,0.101639,0.497492
546,2,0.002680,0.003269,0.013895,0.384234,0.000660,0.003219,0.443204
382,5,0.031269,0.028583,0.025213,0.238146,0.001463,0.032475,0.807215


## Implementación

1. Cree una función `get_priors`, donde obtenga los *priors* de cada clase ($\frac{cantidad\_de\_datos\_del\_label\_l}{ total\_de\_datos}$).
2. Implemente una función `predict`, donde obtenga los numeradores de *posterior* y con eso realice la clasificación. Recuerde que la fórmula del numerador de la *posterior* para un label `l`:
$Prior(label)*p(Amplitude|l)*p(Std|l)*p(Mean|label)*p(MaxSlope|l)*p(MeanVariance|l)*p(LinearTrend|l)$.

Cada $p$, para un único dato `test`, una feature `f` y una label `l` corresponde a:

$ \frac{exp{(-(test\_value\_of\_feature\_f - train\_mean\_of\_feature\_f\_and\_label\_l)^2/(2 \cdot train\_variance\_of\_feature\_f\_and\_label\_l))}}{\sqrt{2\pi \cdot train\_variance\_of\_feature\_f\_and\_label\_l}}$

Ejemplo: imaginemos una fila `label` 0, y su $Amplitude = 0.08$. Imaginemos también que el valor de la media de la columna *Amplitude*, cuando el `label` es 0, es igual a 0.105777. Adicionalmente, el valor de la varianza en el mismo caso es 0.829561.

$p(Amplitude=0.08|label=0) = \frac{exp{(-(0.08 - 0.105777)^2)/(2 \cdot 0.829561)}}{\sqrt{2\pi \cdot 0.829561}}$

Por lo tanto, ustedes para un único dato `test`, tendrán que calcular todos los $p$ y ponderar el valor de *prior* por la multiplicación entre estos.

Finalmente, la categoría de una fila de entrenamiento estará dada por el `label` que tenga el numerador de *posterior* más alto.

**Importante**: la implementación debe ser realizada por usted. Esto es, queda estrictamente prohibido usar implementaciones de Naïve Bayes que se encuentren en librerías externas.

In [57]:
def get_priors(df):
    grouped = df[["label", "Amplitude"]].groupby(["label"]).count().apply(lambda x: x/len(df.index)).to_dict()["Amplitude"]
    return grouped

def predict(row: pd.Series, priors) -> int:
    label_data = train[train["label"] == row["label"]]
    por_aplicar = ["Amplitude", "Std", "PeriodLS", "Mean", "MaxSlope", "Meanvariance", "LinearTrend"] 
    posteriores = {}
    for label in priors:
        result = priors[label]
        label_data = train[train["label"] == label]
        for feature in por_aplicar:
            por_mult = np.e ** -((row[feature] - label_data[feature].mean())**2/(2*label_data[feature].var()))
            result *= por_mult/np.sqrt(2*np.pi*label_data[feature].var())
        posteriores[label] = result
    return max(posteriores, key=lambda x: posteriores[x])
        
priors = get_priors(train)

## Evaluación

1. Aplique el algoritmo sobre su conjunto de evaluación. Use la función `apply` de `pandas`.
2. Imprima el *classification report* de `sklearn`. Comente sobre el desempeño del clasificador.

In [70]:
predicted = test.apply(lambda x: predict(x, priors), axis=1)
print(classification_report(test["label"], predicted))

              precision    recall  f1-score   support

           0       0.85      0.41      0.55        69
           2       0.88      0.98      0.93        61
           3       0.64      0.92      0.76        64
           4       0.85      0.92      0.88        48
           5       0.96      0.91      0.94        58

    accuracy                           0.81       300
   macro avg       0.84      0.83      0.81       300
weighted avg       0.83      0.81      0.80       300

