
# Capítulo 4 — Aprendizado Supervisionado (Classificação)

**Curso:** CECIERJ – IA e ML para Soluções Práticas  
**Objetivo:** aplicar algoritmos de classificação supervisionada com Scikit-learn, avaliando desempenho com métricas adequadas.

---
## Passos abordados
1. Carregar dataset (Breast Cancer).  
2. *Train/Test Split*.  
3. Treinar modelos: **Logistic Regression, KNN, Decision Tree, SVM**.  
4. Avaliar com métricas (accuracy, precision, recall, F1, ROC AUC).  
5. Comparar resultados em DataFrame.  
6. Ajustar hiperparâmetros com `GridSearchCV`.  
7. Visualizações: matriz de confusão, curva ROC.  
8. Conclusões sobre trade-offs dos modelos.


In [None]:

import pandas as pd
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer(as_frame=True)
df = ds.frame.copy()
X = df.drop(columns=["target"])
y = df["target"]
print("Shape:", X.shape, "Target balance:", y.value_counts().to_dict())


In [None]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

models = {
    "LogReg": Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression(max_iter=1000))]),
    "KNN": Pipeline([("scaler", StandardScaler()), ("clf", KNeighborsClassifier())]),
    "Tree": DecisionTreeClassifier(random_state=42),
    "SVM": Pipeline([("scaler", StandardScaler()), ("clf", SVC(probability=True))]),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:,1] if hasattr(model,"predict_proba") else None
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba) if y_proba is not None else None
    results[name] = dict(accuracy=acc, precision=prec, recall=rec, f1=f1, roc_auc=auc)

pd.DataFrame(results).T


In [None]:

from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid = {
    "clf__C": [0.1, 1, 10],
    "clf__kernel": ["linear","rbf"],
    "clf__gamma": ["scale", 0.01]
}
svm_pipe = Pipeline([("scaler", StandardScaler()), ("clf", SVC(probability=True))])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(svm_pipe, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
gs.fit(X_train, y_train)
print("Melhores params SVM:", gs.best_params_, "ROC AUC:", gs.best_score_)
best_svm = gs.best_estimator_


In [None]:

from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
import matplotlib.pyplot as plt

y_pred = best_svm.predict(X_test)
y_proba = best_svm.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred)).plot()
plt.title("Matriz de Confusão (SVM otimizado)")
plt.show()

RocCurveDisplay.from_estimator(best_svm, X_test, y_test)
plt.title("Curva ROC (SVM otimizado)")
plt.show()



---
## Conclusões
- Modelos lineares (LogReg) são simples e eficientes, boa baseline.  
- KNN sensível à escala → normalize sempre.  
- Árvores capturam não linearidades, mas podem sobreajustar.  
- SVM com tuning pode atingir maior ROC AUC.  
- Escolha depende de trade-off entre interpretabilidade, custo e métrica-alvo.
