# 03 — Modeling + Validation (CV + GridSearch + Decision)

**Fase(s):** Modeling + Validation (Skills 04 e 05)

**Outputs esperados:**
- `artifacts/03/benchmark.csv`
- `artifacts/03/validation_summary.md`
- `models/best_model.pkl`

## Configuração do Projeto

Este notebook lê parâmetros comuns do arquivo `artifacts/00_config/params.json` para manter consistência entre fases.

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve() if Path('.').name == 'notebooks' else Path('.').resolve()
PARAMS_PATH = PROJECT_ROOT / 'artifacts' / '00_config' / 'params.json'

with open(PARAMS_PATH, 'r', encoding='utf-8') as f:
    params = json.load(f)

params

## 4. Modeling (Skill 04)

### Objetivo
Comparar baseline e benchmarks com Stratified CV.

In [None]:

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold, cross_validate, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# TODO: carregar ABT
# abt = pd.read_parquet(PROJECT_ROOT / params["paths"]["gold_dir"] / "abt_train.parquet")
# X = abt.drop(columns=[params["target"]])
# y = abt[params["target"]]


In [None]:

cv = StratifiedKFold(
    n_splits=params.get("cv_n_splits", 5),
    shuffle=True,
    random_state=params.get("random_state", 42)
)

scoring = {"roc_auc": "roc_auc", "accuracy": "accuracy", "f1": "f1"}

models = {
    "lr": LogisticRegression(max_iter=2000),
    "rf": RandomForestClassifier(random_state=params.get("random_state", 42)),
    "gb": GradientBoostingClassifier(random_state=params.get("random_state", 42)),
}


In [None]:

# TODO: rodar CV e montar benchmark
# results = []
# for name, model in models.items():
#     cv_res = cross_validate(model, X, y, cv=cv, scoring=scoring, n_jobs=-1)
#     results.append({...})
# benchmark = pd.DataFrame(results)
# benchmark


In [None]:

# TODO: salvar benchmark
# out = PROJECT_ROOT / "artifacts/03/benchmark.csv"
# out.parent.mkdir(parents=True, exist_ok=True)
# benchmark.to_csv(out, index=False)
# out


## 5. Validation (Skill 05)

### Objetivo
Validar o campeão vs critério e registrar Go/No-Go.

In [None]:

# TODO: GridSearch do campeão (exemplo LR)
# param_grid = {"C": [0.01, 0.1, 1, 10, 100]}
# grid = GridSearchCV(LogisticRegression(max_iter=2000), param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
# grid.fit(X, y)
# grid.best_params_, grid.best_score_


In [None]:

# TODO: salvar validation summary
# (PROJECT_ROOT / "artifacts/03/validation_summary.md").write_text("...", encoding="utf-8")


## ✅ Gate 03 — Modelo validado?
- [ ] Benchmark salvo
- [ ] Campeão escolhido e justificado
- [ ] Validation summary produzido