# Práctica 6: Clasificadores de Distancia y Bayesianos

## Intrucciones:
1. Selecciona un dataset y realiza el pre-procesamiento que consideres conveniente.
2. Aplica los siguientes modelos de clasificación:
    - 1NN
    - KNN con K={3,5,7,9}
    - Naive Bayes
3. Con los métodos de validación:
    - Hold-out 70/30
    - 10-Fold Cross-Validation
    - Leave-One-Out

In [7]:
from joblib import load
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut, cross_validate
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report, matthews_corrcoef, 
                             cohen_kappa_score, balanced_accuracy_score,
                             make_scorer)
import numpy as np
import pandas as pd

In [8]:
models = {
    '1NN': KNeighborsClassifier(n_neighbors=1),
    '3NN': KNeighborsClassifier(n_neighbors=3),
    '5NN': KNeighborsClassifier(n_neighbors=5),
    '7NN': KNeighborsClassifier(n_neighbors=7),
    '9NN': KNeighborsClassifier(n_neighbors=9),
    'Naive Bayes': GaussianNB()
}

In [9]:
def evaluate_model(y_true, y_pred, y_proba=None):
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Balanced Accuracy': balanced_accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred, average='weighted', zero_division=0),
        'Recall': recall_score(y_true, y_pred, average='weighted', zero_division=0),
        'F1-Score': f1_score(y_true, y_pred, average='weighted', zero_division=0),
        'MCC': matthews_corrcoef(y_true, y_pred),
        'Cohen Kappa': cohen_kappa_score(y_true, y_pred)
    }
    
    if y_proba is not None:
        try:
            if len(np.unique(y_true)) == 2:
                metrics['ROC-AUC'] = roc_auc_score(y_true, y_proba[:, 1])
            else:
                metrics['ROC-AUC'] = roc_auc_score(y_true, y_proba, 
                                                   average='weighted', 
                                                   multi_class='ovr')
        except:
            metrics['ROC-AUC'] = np.nan
    
    return metrics

## Predictive Mainteinance Dataset

Se importan los datasets pre procesados y serializados como archivos joblib

In [10]:
X_mainteinance_filepath = 'datasets/mainteinance/X_mainteinanceDataset.joblib'
y_mainteinance_filepath = 'datasets/mainteinance/y_mainteinanceDataset.joblib'

In [11]:
X_m = load(X_mainteinance_filepath)
y_m = load(y_mainteinance_filepath)

In [12]:
print(X_m.info())
print(y_m.unique())

<class 'pandas.core.frame.DataFrame'>
Index: 9976 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   air_temperature      9976 non-null   float64
 1   process_temperature  9976 non-null   float64
 2   rotational_speed     9976 non-null   float64
 3   torque               9976 non-null   float64
 4   tool_wear            9976 non-null   float64
 5   L                    9976 non-null   float64
 6   M                    9976 non-null   float64
dtypes: float64(7)
memory usage: 623.5 KB
None
[0 3 5 2 4 1]


## HOLD-OUT 70/30

In [13]:
X_m_train, X_m_test, y_m_train, y_m_test = train_test_split(X_m, y_m, test_size=0.3, 
                                                    random_state=42, stratify=y_m)

results_holdout = []
for name, model in models.items():
    model.fit(X_m_train, y_m_train)
    y_m_pred = model.predict(X_m_test)
    y_m_proba = model.predict_proba(X_m_test) if hasattr(model, 'predict_proba') else None
    
    metrics = evaluate_model(y_m_test, y_m_pred, y_m_proba)
    metrics['Model'] = name
    results_holdout.append(metrics)
    
    print(f"\n{name}:")
    print(f"  Accuracy_m:          {metrics['Accuracy']:.4f}")
    print(f"  Balanced Accuracy_m: {metrics['Balanced Accuracy']:.4f}")
    print(f"  Precision:         {metrics['Precision']:.4f}")
    print(f"  Recall:            {metrics['Recall']:.4f}")
    print(f"  F1-Score:          {metrics['F1-Score']:.4f}")
    print(f"  MCC:               {metrics['MCC']:.4f}")
    print(f"  Cohen Kappa:       {metrics['Cohen Kappa']:.4f}")
    if 'ROC-AUC' in metrics and not np.isnan(metrics['ROC-AUC']):
        print(f"  ROC-AUC:           {metrics['ROC-AUC']:.4f}")
    
    cm = confusion_matrix(y_m_test, y_m_pred)
    print(f"  Confusion Matrix:\n{cm}")

df_holdout = pd.DataFrame(results_holdout)
print(df_holdout.to_string(index=False))


1NN:
  Accuracy_m:          0.9642
  Balanced Accuracy_m: 0.4573
  Precision:         0.9642
  Recall:            0.9642
  F1-Score:          0.9640
  MCC:               0.4169
  Cohen Kappa:       0.4165
  ROC-AUC:           0.6978
  Confusion Matrix:
[[2846   17    6    7    7   13]
 [  17   15    0    0    0    0]
 [  11    0   12    0    0    0]
 [  14    0    0   10    0    0]
 [   4    0    0    0    1    0]
 [  11    0    0    0    0    2]]

3NN:
  Accuracy_m:          0.9709
  Balanced Accuracy_m: 0.3090
  Precision:         0.9620
  Recall:            0.9709
  F1-Score:          0.9637
  MCC:               0.3761
  Cohen Kappa:       0.3287
  ROC-AUC:           0.7965
  Confusion Matrix:
[[2884    7    3    0    0    2]
 [  25    7    0    0    0    0]
 [  15    0    8    0    0    0]
 [  17    0    0    7    0    0]
 [   5    0    0    0    0    0]
 [  13    0    0    0    0    0]]

5NN:
  Accuracy_m:          0.9726
  Balanced Accuracy_m: 0.2834
  Precision:         0.9625


### Observaciones
**Balanced accuracy baja: debido al desbalance de clases**

**Mejor modelo: 5NN**
- 97.6% acc
- Balance entre métricas generales
- 0.8188 ROC-AUC (moderado - bueno)

**Mejor ROC-AUC Naive Bayes**

**Modelos KNN con K alto predicen la clase mayoritaria**
**Modelos KNN ocn K bajo predicen minorías pero son inestables**

## 10-Fold Cross-validation

In [14]:
from sklearn.metrics import make_scorer

scoring = {
    'accuracy': 'accuracy',
    'balanced_accuracy': 'balanced_accuracy',
    'precision': make_scorer(precision_score, average='weighted', zero_division=0),
    'recall': make_scorer(recall_score, average='weighted', zero_division=0),
    'f1': make_scorer(f1_score, average='weighted', zero_division=0),
    'roc_auc': 'roc_auc_ovr_weighted'
}

results_cv = []
for name, model in models.items():
    try:
        cv_results = cross_validate(model, X_m, y_m, cv=10, scoring=scoring, 
                                    error_score='raise')
        
        metrics = {
            'Model': name,
            'Accuracy': cv_results['test_accuracy'].mean(),
            'Accuracy_std': cv_results['test_accuracy'].std(),
            'Balanced Accuracy': cv_results['test_balanced_accuracy'].mean(),
            'Balanced Accuracy_std': cv_results['test_balanced_accuracy'].std(),
            'Precision': cv_results['test_precision'].mean(),
            'Precision_std': cv_results['test_precision'].std(),
            'Recall': cv_results['test_recall'].mean(),
            'Recall_std': cv_results['test_recall'].std(),
            'F1-Score': cv_results['test_f1'].mean(),
            'F1-Score_std': cv_results['test_f1'].std(),
            'ROC-AUC': cv_results['test_roc_auc'].mean(),
            'ROC-AUC_std': cv_results['test_roc_auc'].std()
        }
        results_cv.append(metrics)
        
        print(f"\n{name}:")
        print(f"  Accuracy:          {metrics['Accuracy']:.4f} (+/- {metrics['Accuracy_std']:.4f})")
        print(f"  Balanced Accuracy: {metrics['Balanced Accuracy']:.4f} (+/- {metrics['Balanced Accuracy_std']:.4f})")
        print(f"  Precision:         {metrics['Precision']:.4f} (+/- {metrics['Precision_std']:.4f})")
        print(f"  Recall:            {metrics['Recall']:.4f} (+/- {metrics['Recall_std']:.4f})")
        print(f"  F1-Score:          {metrics['F1-Score']:.4f} (+/- {metrics['F1-Score_std']:.4f})")
        print(f"  ROC-AUC:           {metrics['ROC-AUC']:.4f} (+/- {metrics['ROC-AUC_std']:.4f})")
    except Exception as e:
        print(f"\n{name}: Error - {str(e)}")

df_cv = pd.DataFrame(results_cv)
print(df_cv[['Model', 'Accuracy', 'Balanced Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']].to_string(index=False))


1NN:
  Accuracy:          0.9530 (+/- 0.0147)
  Balanced Accuracy: 0.3717 (+/- 0.0502)
  Precision:         0.9608 (+/- 0.0049)
  Recall:            0.9530 (+/- 0.0147)
  F1-Score:          0.9547 (+/- 0.0089)
  ROC-AUC:           0.6583 (+/- 0.0325)

3NN:
  Accuracy:          0.9657 (+/- 0.0097)
  Balanced Accuracy: 0.2886 (+/- 0.0410)
  Precision:         0.9575 (+/- 0.0085)
  Recall:            0.9657 (+/- 0.0097)
  F1-Score:          0.9580 (+/- 0.0071)
  ROC-AUC:           0.7495 (+/- 0.0503)

5NN:
  Accuracy:          0.9683 (+/- 0.0050)
  Balanced Accuracy: 0.2595 (+/- 0.0255)
  Precision:         0.9521 (+/- 0.0053)
  Recall:            0.9683 (+/- 0.0050)
  F1-Score:          0.9579 (+/- 0.0035)
  ROC-AUC:           0.7996 (+/- 0.0424)

7NN:
  Accuracy:          0.9693 (+/- 0.0037)
  Balanced Accuracy: 0.2500 (+/- 0.0256)
  Precision:         0.9524 (+/- 0.0057)
  Recall:            0.9693 (+/- 0.0037)
  F1-Score:          0.9581 (+/- 0.0030)
  ROC-AUC:           0.8136 (+/- 

### Observaciones
**Se confirma la tendencia de los valores de K para NN**
**Naive Bayes muy superior a todo lo demás en ROC-AUC**

**Modelos K altos consistentes entre los folds (+-0.0..37)**
**Naive Bayes presenta muy alta varianza entre los folds (+-0.0234)**

**Todas las métricas fueron ligeramente superiores en hold out**
**Los folds evidencían aún más la inestabilidad de K bajos en NN**

**Mejor modelo 7NN**
- 96.93% acc

## Leave one out (LOO)

In [27]:
loo = LeaveOneOut()
results_loo = []

for name, model in models.items():
    print(f"\nEvaluando {name}...")
    try:
        scores = cross_val_score(model, X_m, y_m, cv=loo, scoring='accuracy')
        
        metrics = {
            'Model': name,
            'Accuracy': scores.mean(),
            'Accuracy_std': scores.std()
        }
        results_loo.append(metrics)
        
        print(f"  Accuracy: {metrics['Accuracy']:.4f} (+/- {metrics['Accuracy_std']:.4f})")
    except Exception as e:
        print(f"  Error: {str(e)}")

df_loo = pd.DataFrame(results_loo)
print(f"\n{'=' * 80}")
print("RESUMEN LEAVE-ONE-OUT")
print(df_loo.to_string(index=False))

summary = pd.DataFrame({
    'Model': [m['Model'] for m in results_holdout],
    'Hold-Out': [m['Accuracy'] for m in results_holdout],
    '10-Fold CV': [m['Accuracy'] for m in results_cv],
    'LOO': [m['Accuracy'] for m in results_loo]
})
print(summary.to_string(index=False))

LEAVE-ONE-OUT

Evaluando 1NN...
  Accuracy: 0.9662 (+/- 0.1807)

Evaluando 3NN...
  Accuracy: 0.9724 (+/- 0.1637)

Evaluando 5NN...
  Accuracy: 0.9716 (+/- 0.1660)

Evaluando 7NN...
  Accuracy: 0.9716 (+/- 0.1660)

Evaluando 9NN...
  Accuracy: 0.9704 (+/- 0.1694)

Evaluando Naive Bayes...
  Accuracy: 0.9583 (+/- 0.1999)

RESUMEN LEAVE-ONE-OUT
      Model  Accuracy  Accuracy_std
        1NN  0.966219      0.180665
        3NN  0.972434      0.163726
        5NN  0.971632      0.166022
        7NN  0.971632      0.166022
        9NN  0.970429      0.169400
Naive Bayes  0.958300      0.199903
      Model  Hold-Out  10-Fold CV      LOO
        1NN  0.964250    0.952990 0.966219
        3NN  0.970932    0.965720 0.972434
        5NN  0.972603    0.968325 0.971632
        7NN  0.971266    0.969327 0.971632
        9NN  0.970264    0.969127 0.970429
Naive Bayes  0.961577    0.953594 0.958300


### Observaciones
**Desviaciones estandar muy altas**
**Mejor modelo 3NN**
- Baja inestabilidad
- Alata accuracy

## Particle Dataset

In [16]:
X_particle_filepath = './datasets/particle/X_particleDataset.joblib'
y_particle_filepath = './datasets/particle/y_particleDataset.joblib'

In [22]:
X_p = load(X_particle_filepath)
y_p = load(y_particle_filepath).astype(int)

In [25]:
print(X_p.info())
print(y_p.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 30 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   DER_mass_MMC                 250000 non-null  float64
 1   DER_mass_transverse_met_lep  250000 non-null  float64
 2   DER_mass_vis                 250000 non-null  float64
 3   DER_pt_h                     250000 non-null  float64
 4   DER_deltaeta_jet_jet         250000 non-null  float64
 5   DER_mass_jet_jet             250000 non-null  float64
 6   DER_prodeta_jet_jet          250000 non-null  float64
 7   DER_deltar_tau_lep           250000 non-null  float64
 8   DER_pt_tot                   250000 non-null  float64
 9   DER_sum_pt                   250000 non-null  float64
 10  DER_pt_ratio_lep_tau         250000 non-null  float64
 11  DER_met_phi_centrality       250000 non-null  float64
 12  DER_lep_eta_centrality       250000 non-null  float64
 13 

## HOLD-OUT 70/30

In [28]:
X_p_train, X_p_test, y_p_train, y_p_test = train_test_split(X_p, y_p, test_size=0.3, 
                                                    random_state=42, stratify=y_p)

results_holdout = []
for name, model in models.items():
    model.fit(X_p_train, y_p_train)
    y_p_pred = model.predict(X_p_test)
    y_p_proba = model.predict_proba(X_p_test) if hasattr(model, 'predict_proba') else None
    
    metrics = evaluate_model(y_p_test, y_p_pred, y_p_proba)
    metrics['Model'] = name
    results_holdout.append(metrics)
    
    print(f"\n{name}:")
    print(f"  Accuracy:          {metrics['Accuracy']:.4f}")
    print(f"  Balanced Accuracy: {metrics['Balanced Accuracy']:.4f}")
    print(f"  Precision:         {metrics['Precision']:.4f}")
    print(f"  Recall:            {metrics['Recall']:.4f}")
    print(f"  F1-Score:          {metrics['F1-Score']:.4f}")
    print(f"  MCC:               {metrics['MCC']:.4f}")
    print(f"  Cohen Kappa:       {metrics['Cohen Kappa']:.4f}")
    if 'ROC-AUC' in metrics and not np.isnan(metrics['ROC-AUC']):
        print(f"  ROC-AUC:           {metrics['ROC-AUC']:.4f}")
    
    cm = confusion_matrix(y_p_test, y_p_pred)
    print(f"  Confusion Matrix:\n{cm}")

df_holdout = pd.DataFrame(results_holdout)
print(df_holdout.to_string(index=False))


1NN:
  Accuracy:          0.7402
  Balanced Accuracy: 0.7163
  Precision:         0.7427
  Recall:            0.7402
  F1-Score:          0.7414
  MCC:               0.4288
  Cohen Kappa:       0.4286
  ROC-AUC:           0.7163
  Confusion Matrix:
[[39066 10234]
 [ 9249 16451]]

3NN:
  Accuracy:          0.7725
  Balanced Accuracy: 0.7467
  Precision:         0.7721
  Recall:            0.7725
  F1-Score:          0.7723
  MCC:               0.4942
  Cohen Kappa:       0.4942
  ROC-AUC:           0.8084
  Confusion Matrix:
[[40855  8445]
 [ 8618 17082]]

5NN:
  Accuracy:          0.7856
  Balanced Accuracy: 0.7592
  Precision:         0.7845
  Recall:            0.7856
  F1-Score:          0.7850
  MCC:               0.5215
  Cohen Kappa:       0.5215
  ROC-AUC:           0.8358
  Confusion Matrix:
[[41566  7734]
 [ 8343 17357]]

7NN:
  Accuracy:          0.7915
  Balanced Accuracy: 0.7641
  Precision:         0.7898
  Recall:            0.7915
  F1-Score:          0.7906
  MCC:     

### Observaciones

**Mejor modelo: 9NN**
- 79.55% acc
- 76.77% balanced acc
- 0.85% ROC-AUC

**Naive bayes no resalta tanto debido a que el desbalance no es tan amplio y solo hay dos clases**


## 10-Fold Cross-validation

In [29]:
results_cv = []
for name, model in models.items():
    try:
        cv_results = cross_validate(model, X_p, y_p, cv=10, scoring=scoring, 
                                    error_score='raise')
        
        metrics = {
            'Model': name,
            'Accuracy': cv_results['test_accuracy'].mean(),
            'Accuracy_std': cv_results['test_accuracy'].std(),
            'Balanced Accuracy': cv_results['test_balanced_accuracy'].mean(),
            'Balanced Accuracy_std': cv_results['test_balanced_accuracy'].std(),
            'Precision': cv_results['test_precision'].mean(),
            'Precision_std': cv_results['test_precision'].std(),
            'Recall': cv_results['test_recall'].mean(),
            'Recall_std': cv_results['test_recall'].std(),
            'F1-Score': cv_results['test_f1'].mean(),
            'F1-Score_std': cv_results['test_f1'].std(),
            'ROC-AUC': cv_results['test_roc_auc'].mean(),
            'ROC-AUC_std': cv_results['test_roc_auc'].std()
        }
        results_cv.append(metrics)
        
        print(f"\n{name}:")
        print(f"  Accuracy:          {metrics['Accuracy']:.4f} (+/- {metrics['Accuracy_std']:.4f})")
        print(f"  Balanced Accuracy: {metrics['Balanced Accuracy']:.4f} (+/- {metrics['Balanced Accuracy_std']:.4f})")
        print(f"  Precision:         {metrics['Precision']:.4f} (+/- {metrics['Precision_std']:.4f})")
        print(f"  Recall:            {metrics['Recall']:.4f} (+/- {metrics['Recall_std']:.4f})")
        print(f"  F1-Score:          {metrics['F1-Score']:.4f} (+/- {metrics['F1-Score_std']:.4f})")
        print(f"  ROC-AUC:           {metrics['ROC-AUC']:.4f} (+/- {metrics['ROC-AUC_std']:.4f})")
    except Exception as e:
        print(f"\n{name}: Error - {str(e)}")

df_cv = pd.DataFrame(results_cv)
print(df_cv[['Model', 'Accuracy', 'Balanced Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']].to_string(index=False))


1NN:
  Accuracy:          0.7423 (+/- 0.0033)
  Balanced Accuracy: 0.7187 (+/- 0.0036)
  Precision:         0.7449 (+/- 0.0032)
  Recall:            0.7423 (+/- 0.0033)
  F1-Score:          0.7435 (+/- 0.0032)
  ROC-AUC:           0.7187 (+/- 0.0036)

3NN:
  Accuracy:          0.7748 (+/- 0.0020)
  Balanced Accuracy: 0.7489 (+/- 0.0026)
  Precision:         0.7743 (+/- 0.0021)
  Recall:            0.7748 (+/- 0.0020)
  F1-Score:          0.7745 (+/- 0.0020)
  ROC-AUC:           0.8114 (+/- 0.0021)

5NN:
  Accuracy:          0.7876 (+/- 0.0022)
  Balanced Accuracy: 0.7612 (+/- 0.0030)
  Precision:         0.7864 (+/- 0.0024)
  Recall:            0.7876 (+/- 0.0022)
  F1-Score:          0.7869 (+/- 0.0023)
  ROC-AUC:           0.8382 (+/- 0.0023)

7NN:
  Accuracy:          0.7936 (+/- 0.0024)
  Balanced Accuracy: 0.7667 (+/- 0.0032)
  Precision:         0.7920 (+/- 0.0026)
  Recall:            0.7936 (+/- 0.0024)
  F1-Score:          0.7927 (+/- 0.0025)
  ROC-AUC:           0.8508 (+/- 

### Observaciones
**Mejor modelo confirmado: 9NN**
- 79.75% acc
- 77.03% balanced acc
- 0.8581 ROC-AUC

## Leave one out (LOO)

In [None]:
loo = LeaveOneOut()
results_loo = []

for name, model in models.items():
    print(f"\nEvaluando {name}...")
    try:
        scores = cross_val_score(model, X_p, y_p, cv=loo, scoring='accuracy')
        
        metrics = {
            'Model': name,
            'Accuracy': scores.mean(),
            'Accuracy_std': scores.std()
        }
        results_loo.append(metrics)
        
        print(f"  Accuracy: {metrics['Accuracy']:.4f} (+/- {metrics['Accuracy_std']:.4f})")
    except Exception as e:
        print(f"  Error: {str(e)}")

df_loo = pd.DataFrame(results_loo)
print(f"\n{'=' * 80}")
print("RESUMEN LEAVE-ONE-OUT")
print(df_loo.to_string(index=False))

summary = pd.DataFrame({
    'Model': [m['Model'] for m in results_holdout],
    'Hold-Out': [m['Accuracy'] for m in results_holdout],
    '10-Fold CV': [m['Accuracy'] for m in results_cv],
    'LOO': [m['Accuracy'] for m in results_loo]
})
print(summary.to_string(index=False))


Evaluando 1NN...


## Observaciones
Debido al tamaño del dataset, no es viable aplicar leave one out (>7 horas de intento de ejecución)

# Conclusiones
Utilizar clasifiadores básicos como Bayesiano y KNN tiene sus ventajas y también sus limitacioens. Dependeindo de la dimensionalidad, volumen y cercanía geometrica de las características de cada una de las clases es posbile que utilizar estos métodos de clasifiación mediante aprendizaje máquina sean efectivos, sin embargo, son muy propensos a verse afectados por cuestiones como desbalance de clases, alta varianza, outliers. Etc. Como se observó en el primer dataset, la métrica de precisión individualmente puede ser engañosa, por lo que vale la pena tomar en cuenta alternativas.