### Centro Universitário Senac  
**Professor:** Rafael Cóbe  
**Disciplina:** Introdução ao Aprendizado de Máquina  

### Exercício 5 - **Validação Cruzada: Avaliando a performance dos estimadores**

**Renato Calabro (calabro@live.com)**
**Ágata Oliveira (agata.aso@hotmail.com)**
**Lucas Parisi (parisi.lucas@gmail.com)**
**Douglas Carvalho Rocha (douglas.particular@gmail.com)**
**Angel Guillermo Morales Romero (aguilhermemr@gmail.com)**

In [54]:
import warnings
import pickle
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


warnings.filterwarnings('ignore')

# Exercício

O conjunto de dados de notas de banco envolve a previsão da autenticidade de uma determinada nota de banco com base em uma série de medidas tiradas de uma fotografia.

Trata-se de um problema de classificação binária (2 classes). O número de observações para cada classe não é equilibrado. Há 1.372 observações com 4 variáveis de entrada e 1 variável de saída. Para obter mais informações, consulte [este link](http://archive.ics.uci.edu/ml/datasets/banknote+authentication).


## Obtendo os dados:

In [1]:
!mkdir -p ../datasets/crossvalidation
!wget -c http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt -O ../datasets/crossvalidation/data_banknote_authentication.txt

--2025-06-26 16:20:16--  http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘../datasets/crossvalidation/data_banknote_authentication.txt’

../datasets/crossva     [  <=>               ]  45.31K   119KB/s    in 0.4s    

2025-06-26 16:19:29 (119 KB/s) - ‘../datasets/crossvalidation/data_banknote_authentication.txt’ saved [46400]



In [8]:
banknote_df = pd.read_csv("../datasets/crossvalidation/data_banknote_authentication.txt", header = None)

In [13]:
display(banknote_df.head(3))
display(banknote_df.info())
display(banknote_df.describe())

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1372 non-null   float64
 1   1       1372 non-null   float64
 2   2       1372 non-null   float64
 3   3       1372 non-null   float64
 4   4       1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


None

Unnamed: 0,0,1,2,3,4
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


#### Exercícios

##### Crie classificadores de [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) e de regressão logística;

In [17]:
X = banknote_df.iloc[:, :-1]
y = banknote_df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [35]:
def create_pipeline(model):
    return Pipeline([
        ("scaler", StandardScaler()),
        ("classifier", model)
    ])

def evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    model_name = model.__class__.__name__
    if (model_name == "Pipeline"):
        model_name = model.named_steps["classifier"].__class__.__name__

    print(f"{model_name} - Classification Report:")
    print(classification_report(y_test, y_pred))
    
    print(f"{model_name} - Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    return model

In [36]:
logreg = create_pipeline(LogisticRegression())
evaluate(logreg, X_train, y_train, X_test, y_test)

LogisticRegression - Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.96      0.98       229
           1       0.95      1.00      0.98       183

    accuracy                           0.98       412
   macro avg       0.98      0.98      0.98       412
weighted avg       0.98      0.98      0.98       412

LogisticRegression - Confusion Matrix:
[[220   9]
 [  0 183]]


In [37]:
nbayes = create_pipeline(GaussianNB())
evaluate(nbayes, X_train, y_train, X_test, y_test)

GaussianNB - Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.89      0.88       229
           1       0.85      0.84      0.85       183

    accuracy                           0.86       412
   macro avg       0.86      0.86      0.86       412
weighted avg       0.86      0.86      0.86       412

GaussianNB - Confusion Matrix:
[[203  26]
 [ 30 153]]


#### Realize validações cruzadas em ambos os modelos para selecionar os melhores modelos treinados;
- Criar a matriz de confusão para mostrar as diferenças entre parâmetros

In [47]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "LogisticRegression": LogisticRegression(max_iter=1000, random_state=42),
    "GaussianNB": GaussianNB()
}

best_pipelines = {}

for name, model in models.items():
    pipeline = create_pipeline(model)
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="accuracy")
    print(f"{name} - CV Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
    
    # Treinamento com todo o treino após CV
    pipeline.fit(X_train, y_train)
    best_pipelines[name] = pipeline

LogisticRegression - CV Accuracy: 0.9833 ± 0.0083
GaussianNB - CV Accuracy: 0.8396 ± 0.0101


In [48]:
for name, pipeline in best_pipelines.items():
    evaluate(pipeline, X_train, y_train, X_test, y_test)

LogisticRegression - Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.96      0.98       229
           1       0.95      1.00      0.98       183

    accuracy                           0.98       412
   macro avg       0.98      0.98      0.98       412
weighted avg       0.98      0.98      0.98       412

LogisticRegression - Confusion Matrix:
[[220   9]
 [  0 183]]
GaussianNB - Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.89      0.88       229
           1       0.85      0.84      0.85       183

    accuracy                           0.86       412
   macro avg       0.86      0.86      0.86       412
weighted avg       0.86      0.86      0.86       412

GaussianNB - Confusion Matrix:
[[203  26]
 [ 30 153]]


#### Salvar o melhor modelo usando a biblioteca Pickle Python (consulte [este link](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/) para referência)

In [61]:
def generate_model_filename(pipeline, prefix="model_", suffix="v1"):
    model = pipeline.named_steps["classifier"]
    model_name = model.__class__.__name__.lower()

    # Parâmetros relevantes para o nome (hardcoded por modelo)
    important_params = {
        "logisticregression": ["C", "penalty", "solver"],
        "gaussiannb": []  # normalmente sem parâmetros importantes a destacar
    }

    selected_keys = important_params.get(model_name, [])
    params = model.get_params()

    param_parts = []
    for key in selected_keys:
        value = params.get(key)
        if value is not None:
            part = f"{key[:4]}{str(value).lower()}"  # ex: C1.0, solvliblinear
            param_parts.append(part)

    param_str = "-".join(param_parts)
    filename = f"{prefix}_{model_name}-{param_str}-{suffix}.pkl"
    return filename.replace(" ", "")



In [63]:
models_path =  Path("../models")

logreg_pipeline = best_pipelines["LogisticRegression"]
filename = models_path / generate_model_filename(logreg_pipeline)
print("Salvando em:", filename)

with open(filename, "wb") as f:
    pickle.dump(logreg_pipeline, f)

Salvando em: ../models/model__logisticregression-C1.0-penal2-solvlbfgs-v1.pkl


### 🆚 Comparativo: `pickle` vs `joblib`

| Critério                    | `pickle`                                               | `joblib`                                                    |
|----------------------------|--------------------------------------------------------|-------------------------------------------------------------|
| **Funcionalidade**         | Serialização geral de objetos Python                   | Otimizado para objetos com grandes arrays (NumPy, Pandas)   |
| **Performance**            | Mais lento e arquivos maiores com arrays grandes       | Mais rápido e eficiente com grandes estruturas de dados     |
| **Grandes Modelos**        | Ineficiente com dados científicos volumosos            | Ideal para modelos grandes e dados densos                   |
| **Compatibilidade**        | Biblioteca padrão do Python                            | Requer instalação via `pip`                                 |
| **Comunidade**             | Muito ampla, uso geral                                 | Popular em ML; recomendado oficialmente pelo `scikit-learn` |
| **Uso comum em ML**        | Presente, mas não ideal para grandes modelos           | Padrão de fato em projetos com `scikit-learn`               |

### ✅ Recomendação

- Use **`pickle`** para objetos pequenos ou projetos simples sem dependências.
- Prefira **`joblib`** para modelos grandes com `NumPy`, `Pandas`, ou `scikit-learn`.
