
# Capítulo 5 — Aprendizado Supervisionado (Regressão)

**Curso:** CECIERJ – IA e ML para Soluções Práticas  
**Objetivo:** prever variáveis contínuas (ex.: preço, tempo) e avaliar modelos de regressão com métricas adequadas.

---
## Passos abordados
1. Carregar dataset (Diabetes).  
2. *Train/Test Split*.  
3. Baseline (média).  
4. Modelos: **Linear, Ridge, Lasso, RandomForest**.  
5. Métricas: **MAE, MSE, RMSE, R²**.  
6. `GridSearchCV` para RandomForest.  
7. Importância de atributos (`permutation_importance`).  
8. Salvar o melhor modelo (`joblib`).


In [None]:

import numpy as np, pandas as pd
from sklearn.datasets import load_diabetes

ds = load_diabetes()
X = pd.DataFrame(ds.data, columns=ds.feature_names)
y = pd.Series(ds.target, name="target")
print("Shape:", X.shape)
X.head()


In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [None]:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def report(y_true, y_pred, label):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"{label:12s} | MAE={mae:.3f} | MSE={mse:.1f} | RMSE={rmse:.3f} | R2={r2:.3f}")
    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

baseline_pred = np.repeat(y_train.mean(), len(y_test))
baseline = report(y_test, baseline_pred, "Baseline")


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

candidates = {
    "Linear": Pipeline([("scaler", StandardScaler()), ("mdl", LinearRegression())]),
    "Ridge":  Pipeline([("scaler", StandardScaler()), ("mdl", Ridge())]),
    "Lasso":  Pipeline([("scaler", StandardScaler()), ("mdl", Lasso(max_iter=5000))]),
    "RF":     Pipeline([("mdl", RandomForestRegressor(random_state=42))]),
}

metrics = {}
for name, pipe in candidates.items():
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    metrics[name] = report(y_test, pred, name)

pd.DataFrame(metrics).T


In [None]:

from sklearn.model_selection import GridSearchCV

grid = {
    "mdl__n_estimators": [100, 300],
    "mdl__max_depth": [None, 10, 20],
    "mdl__min_samples_split": [2, 5]
}
gs = GridSearchCV(
    candidates["RF"], grid, scoring="neg_root_mean_squared_error", cv=5, n_jobs=-1
)
gs.fit(X_train, y_train)
print("Melhores hiperparâmetros RF:", gs.best_params_)
best_rf = gs.best_estimator_
pred = best_rf.predict(X_test)
best_metrics = report(y_test, pred, "RF tuned")


In [None]:

import matplotlib.pyplot as plt
from sklearn.inspection import permutation_importance

r = permutation_importance(best_rf, X_test, y_test, n_repeats=10, random_state=42)
imp = pd.Series(r.importances_mean, index=X.columns).sort_values(ascending=False)

plt.figure()
imp.head(10).plot(kind="bar")
plt.title("Importância de atributos (Permutation) — Top 10")
plt.xlabel("Atributo"); plt.ylabel("Importância média")
plt.show()


In [None]:

import joblib, os
save_path = "/mnt/data/modelo_cap5_rf_tuned.joblib"
joblib.dump(best_rf, save_path)
print("Modelo salvo em:", save_path)



---
## Conclusões
- **Linear/Ridge/Lasso**: simples, rápidos e interpretáveis (atenção ao *scaling*).  
- **RandomForest**: captura relações não lineares; *tuning* melhora muito o RMSE.  
- Compare sempre com **baseline**; olhe **MAE, RMSE e R²**.  
- Use **permutation importance** para explicar variáveis relevantes.
