# Avaliacao Incremental de Books — Credit Risk FPD

**Objetivo**: Medir o impacto marginal de cada book (Recarga, Pagamento, Faturamento) na performance do modelo de FPD.

**Metodologia**:
- Step 1: Base (Cadastro + Telco) + Book Recarga (REC_*)
- Step 2: Base + Recarga + Book Pagamento (PAG_*)
- Step 3: Base + Recarga + Pagamento + Book Faturamento (FAT_*) — Full

**Modelos**: Logistic Regression (L1) + LightGBM (GBDT)

**Validacao**: Train / OOS (temporal) / OOT1 (202502) / OOT2 (202503)

**Metricas**: KS, AUC, Gini (rank-based, threshold-independent)

In [None]:
# =============================================================================
# IMPORTS E CONFIGURACAO
# =============================================================================
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import warnings
from datetime import date

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
from category_encoders import CountEncoder
from scipy.stats import ks_2samp
from pyspark.sql import functions as F
from pyspark.sql import Window

warnings.filterwarnings('ignore')

# Config centralizado
import sys; sys.path.insert(0, "/lakehouse/default/Files/projeto-final")
from config.pipeline_config import EXPERIMENT_NAME, SAFRAS

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s', datefmt='%H:%M:%S')
logger = logging.getLogger('incremental_eval')

print('Imports OK')

In [None]:
# =============================================================================
# FUNCOES UTILITARIAS (reutilizadas do baseline)
# =============================================================================

def ks_stat(y_true, y_score):
    """Calcula estatistica KS entre positivos e negativos."""
    from scipy.stats import ks_2samp
    pos_scores = y_score[y_true == 1]
    neg_scores = y_score[y_true == 0]
    if len(pos_scores) == 0 or len(neg_scores) == 0:
        logger.warning('KS indefinido: pos=%d, neg=%d', len(pos_scores), len(neg_scores))
        return np.nan
    ks, _ = ks_2samp(pos_scores, neg_scores)
    return ks


def compute_metrics(y_true, y_score):
    """Calcula metricas rank-based: KS, AUC, Gini.
    
    Nao inclui Precision/Recall/F1 pois dependem de threshold
    e sao inadequados para targets desbalanceados (FPD ~5-15%).
    """
    auc = roc_auc_score(y_true, y_score)
    ks = ks_stat(y_true, y_score)
    gini = 2 * auc - 1
    return {
        'KS': round(ks, 4) if not np.isnan(ks) else 0.0,
        'AUC': round(auc, 4),
        'Gini': round(gini, 4),
    }


def filter_xy_by_safra(X, y, list_safras):
    """Filtra X e y por lista de SAFRAs."""
    mask = X['SAFRA'].isin(list_safras)
    return X[mask].copy(), y[mask].copy()


def split_stratified_data(df, percent=0.25, target_col='FPD'):
    """Split estratificado PySpark por (SAFRA, FPD)."""
    w = Window.partitionBy('SAFRA', target_col).orderBy(F.rand(seed=42))
    df_ranked = df.withColumn('_rank', F.percent_rank().over(w))
    df_sample = df_ranked.filter(F.col('_rank') <= (1.0 - percent)).drop('_rank')
    df_oos = df_ranked.filter(F.col('_rank') > (1.0 - percent)).drop('_rank')
    return df_sample, df_oos


print('Funcoes utilitarias carregadas')

In [None]:
# =============================================================================
# FUNCOES NOVAS — AVALIACAO INCREMENTAL
# =============================================================================

# Colunas que NAO sao features (metadados, targets, keys)
NON_FEATURE_COLS = {
    'NUM_CPF', 'SAFRA', 'FPD', 'TARGET_SCORE_01', 'TARGET_SCORE_02',
    '_execution_id', '_data_inclusao', '_data_alteracao_silver',
    'DT_PROCESSAMENTO', 'DATADENASCIMENTO', 'FLAG_INSTALACAO',
}

# Prefixos dos books
BOOK_PREFIXES = ['REC_', 'PAG_', 'FAT_']


def get_feature_groups(columns):
    """Separa colunas em grupos: base (cadastro+telco), REC_, PAG_, FAT_.
    
    Args:
        columns: Lista de nomes de colunas.
    
    Returns:
        dict com keys 'base', 'REC', 'PAG', 'FAT' e listas de colunas.
    """
    groups = {'base': [], 'REC': [], 'PAG': [], 'FAT': []}
    for col in columns:
        if col in NON_FEATURE_COLS:
            continue
        if col.startswith('REC_'):
            groups['REC'].append(col)
        elif col.startswith('PAG_'):
            groups['PAG'].append(col)
        elif col.startswith('FAT_'):
            groups['FAT'].append(col)
        else:
            groups['base'].append(col)
    return groups


def build_increment_features(feature_groups, increment_id):
    """Retorna lista de features para o incremento especificado.
    
    Incrementos:
        1: base + REC_*
        2: base + REC_* + PAG_*
        3: base + REC_* + PAG_* + FAT_* (full)
    """
    features = list(feature_groups['base'])
    if increment_id >= 1:
        features += feature_groups['REC']
    if increment_id >= 2:
        features += feature_groups['PAG']
    if increment_id >= 3:
        features += feature_groups['FAT']
    return features


def build_pipeline(X, model_type='LR'):
    """Constroi pipeline sklearn com preprocessamento + modelo.
    
    Args:
        X: DataFrame de features (para detectar tipos).
        model_type: 'LR' para Logistic Regression, 'LGBM' para LightGBM.
    
    Returns:
        sklearn Pipeline configurado.
    """
    num_features = [c for c in X.select_dtypes(include=['int32', 'int64', 'float32', 'float64']).columns]
    cat_features = [c for c in X.select_dtypes(include=['object', 'category']).columns]
    
    num_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
    ])
    cat_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', CountEncoder(normalize=True, handle_unknown=0, handle_missing=0)),
    ])
    
    preprocessor = ColumnTransformer([
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features),
    ], remainder='drop')
    
    if model_type == 'LR':
        model = LogisticRegression(
            solver='liblinear', penalty='l1', C=0.1,
            max_iter=2000, tol=1e-3, class_weight='balanced', random_state=42
        )
    else:  # LGBM
        model = LGBMClassifier(
            objective='binary', boosting_type='gbdt',
            learning_rate=0.05, n_estimators=250, max_depth=7,
            colsample_bytree=0.8, subsample=0.8,
            random_state=42, n_jobs=-1, verbosity=-1,
        )
    
    return Pipeline([('prep', preprocessor), ('model', model)])


def train_and_evaluate_increment(
    X_train, y_train, X_oos, y_oos, X_oot, y_oot,
    features, increment_id, increment_name,
    safras_oot_detail=None
):
    """Treina LR + LGBM para um incremento, avalia em todos os splits, loga no MLflow.
    
    Args:
        X_train, y_train: Dados de treino.
        X_oos, y_oos: Dados out-of-sample.
        X_oot, y_oot: Dados out-of-time.
        features: Lista de features para este incremento.
        increment_id: 1, 2 ou 3.
        increment_name: Nome descritivo do incremento.
        safras_oot_detail: Dict {safra: (X, y)} para avaliacao por safra OOT.
    
    Returns:
        list[dict]: Resultados com metricas por modelo e split.
    """
    # Filtrar features
    keep_cols = [c for c in features if c in X_train.columns]
    # reset_index para garantir alinhamento posicional entre X e y
    X_tr = X_train[keep_cols].reset_index(drop=True)
    y_tr = y_train.reset_index(drop=True)
    X_os = X_oos[keep_cols].reset_index(drop=True)
    y_os = y_oos.reset_index(drop=True)
    X_ot = X_oot[keep_cols].reset_index(drop=True)
    y_ot = y_oot.reset_index(drop=True)
    
    results = []
    trained_models = {}
    
    for model_type in ['LR', 'LGBM']:
        model_label = f'LR_L1' if model_type == 'LR' else 'LGBM'
        run_name = f'{model_label}_increment{increment_id}'
        
        with mlflow.start_run(run_name=run_name, nested=True):
            mlflow.set_tags({
                'model_type': model_label,
                'increment_id': str(increment_id),
                'increment_name': increment_name,
                'n_features': str(len(keep_cols)),
            })
            mlflow.log_param('n_features', len(keep_cols))
            mlflow.log_param('increment_name', increment_name)
            mlflow.log_param('features', str(keep_cols[:20]) + '...' if len(keep_cols) > 20 else str(keep_cols))
            
            # Treinar
            pipe = build_pipeline(X_tr, model_type)
            pipe.fit(X_tr, y_tr)
            trained_models[model_type] = pipe
            
            # Avaliar em cada split
            for split_name, X_eval, y_eval in [
                ('Train', X_tr, y_tr),
                ('OOS', X_os, y_os),
                ('OOT', X_ot, y_ot),
            ]:
                scores = pipe.predict_proba(X_eval)[:, 1]
                metrics = compute_metrics(y_eval, scores)
                
                for metric_name, metric_val in metrics.items():
                    mlflow.log_metric(f'{split_name}_{metric_name}', metric_val)
                
                results.append({
                    'Increment': increment_id,
                    'Increment_Name': increment_name,
                    'N_Features': len(keep_cols),
                    'Model': model_label,
                    'Split': split_name,
                    **metrics,
                })
            
            # Avaliar por safra OOT individual
            if safras_oot_detail:
                for safra_name, (X_s, y_s) in safras_oot_detail.items():
                    X_s_inc = X_s[keep_cols].reset_index(drop=True)
                    y_s_inc = y_s.reset_index(drop=True)
                    scores_s = pipe.predict_proba(X_s_inc)[:, 1]
                    metrics_s = compute_metrics(y_s_inc, scores_s)
                    for metric_name, metric_val in metrics_s.items():
                        mlflow.log_metric(f'OOT_{safra_name}_{metric_name}', metric_val)
                    results.append({
                        'Increment': increment_id,
                        'Increment_Name': increment_name,
                        'N_Features': len(keep_cols),
                        'Model': model_label,
                        'Split': f'OOT_{safra_name}',
                        **metrics_s,
                    })
            
            # Log modelo
            mlflow.sklearn.log_model(pipe, f'model_{model_label}_inc{increment_id}')
    
    return results, trained_models


print('Funcoes de avaliacao incremental carregadas')

In [None]:
# =============================================================================
# MLFLOW SETUP
# =============================================================================
mlflow.set_experiment(EXPERIMENT_NAME)
# Desabilitar autolog — controle manual evita duplicacao de metricas/params
mlflow.autolog(disable=True)

print(f'MLflow experiment: {EXPERIMENT_NAME}')
print(f'Tracking URI: {mlflow.get_tracking_uri()}')

In [None]:
# =============================================================================
# CARREGAR DADOS DO GOLD FEATURE STORE
# =============================================================================
logger.info('Carregando feature store...')

df_spark = spark.sql('SELECT * FROM Gold.feature_store.clientes_consolidado')

# Filtrar apenas clientes com FLAG_INSTALACAO = 1 (clientes ativos)
df_spark_pos = df_spark.filter(F.col('FLAG_INSTALACAO') == 1)

# Remover colunas com > 75% missing
total = df_spark_pos.count()
cols_to_keep = []
for c in df_spark_pos.columns:
    null_pct = df_spark_pos.filter(F.col(c).isNull()).count() / total
    if null_pct <= 0.75:
        cols_to_keep.append(c)

df_spark_clean = df_spark_pos.select(cols_to_keep)

logger.info('Feature store: %d registros, %d colunas (apos filtro missing)', total, len(cols_to_keep))
print(f'Registros: {total:,}')
print(f'Colunas: {len(cols_to_keep)}')

In [None]:
# =============================================================================
# SPLIT TEMPORAL: SAMPLE (TRAIN+VAL) / OOS / OOT
# =============================================================================
safras_train_oos = [202410, 202411, 202412, 202501]
safras_oot = [202502, 202503]

# Separar OOT (temporal holdout)
df_train_pool = df_spark_clean.filter(F.col('SAFRA').isin(safras_train_oos))
df_oot_spark = df_spark_clean.filter(F.col('SAFRA').isin(safras_oot))

# Split estratificado train/OOS (75/25)
df_sample_spark, df_oos_spark = split_stratified_data(df_train_pool, percent=0.25)

logger.info('Convertendo para Pandas...')
df_sample = df_sample_spark.toPandas()
df_oos = df_oos_spark.toPandas()
df_oot = df_oot_spark.toPandas()

logger.info('Sample: %d rows, OOS: %d rows, OOT: %d rows', len(df_sample), len(df_oos), len(df_oot))
print(f'Sample (Train): {df_sample.shape}')
print(f'OOS: {df_oos.shape}')
print(f'OOT: {df_oot.shape}')

In [None]:
# =============================================================================
# PREPARAR SPLITS X/Y
# =============================================================================
target = 'FPD'

# Drop duplicates e NaN no target
df_sample = df_sample.drop_duplicates(subset=['NUM_CPF', 'SAFRA']).dropna(subset=[target])
df_oos = df_oos.drop_duplicates(subset=['NUM_CPF', 'SAFRA']).dropna(subset=[target])
df_oot = df_oot.drop_duplicates(subset=['NUM_CPF', 'SAFRA']).dropna(subset=[target])

# Separar X / y — SAFRA excluida para evitar leakage temporal (C1 fix)
drop_cols = ['NUM_CPF', 'SAFRA', 'FPD', 'TARGET_SCORE_01', 'TARGET_SCORE_02',
             '_execution_id', '_data_inclusao', '_data_alteracao_silver',
             'DT_PROCESSAMENTO', 'DATADENASCIMENTO', 'FLAG_INSTALACAO']

feature_cols = [c for c in df_sample.columns if c not in drop_cols]

X_train = df_sample[feature_cols].copy()
y_train = df_sample[target].astype(int).copy()

X_oos = df_oos[feature_cols].copy()
y_oos = df_oos[target].astype(int).copy()

X_oot = df_oot[feature_cols].copy()
y_oot = df_oot[target].astype(int).copy()

# OOT por safra individual (usar SAFRA do df original, nao do X filtrado)
safras_oot_detail = {}
for safra in safras_oot:
    mask = df_oot['SAFRA'] == safra
    safras_oot_detail[str(safra)] = (X_oot[mask].copy(), y_oot[mask].copy())

print(f'X_train: {X_train.shape}, FPD rate: {y_train.mean():.4f}')
print(f'X_oos:   {X_oos.shape}, FPD rate: {y_oos.mean():.4f}')
print(f'X_oot:   {X_oot.shape}, FPD rate: {y_oot.mean():.4f}')
for s, (x, y) in safras_oot_detail.items():
    print(f'  OOT {s}: {x.shape[0]:,} rows, FPD rate: {y.mean():.4f}')

In [None]:
# =============================================================================
# SEPARAR FEATURES POR PREFIXO (BOOK)
# =============================================================================
feature_groups = get_feature_groups(feature_cols)

print('Feature Groups:')
for group, cols in feature_groups.items():
    print(f'  {group}: {len(cols)} features')
    if len(cols) <= 10:
        print(f'    {cols}')
    else:
        print(f'    {cols[:5]} ... {cols[-3:]}')

print(f'\nTotal features disponiveis: {sum(len(v) for v in feature_groups.values())}')

In [None]:
# =============================================================================
# DEFINIR INCREMENTOS
# =============================================================================
INCREMENTS = [
    {'id': 1, 'name': 'Base + Recarga',           'books': ['REC']},
    {'id': 2, 'name': 'Base + Recarga + Pagamento', 'books': ['REC', 'PAG']},
    {'id': 3, 'name': 'Full (+ Faturamento)',       'books': ['REC', 'PAG', 'FAT']},
]

for inc in INCREMENTS:
    features = build_increment_features(feature_groups, inc['id'])
    print(f"Step {inc['id']}: {inc['name']} — {len(features)} features")

In [None]:
# =============================================================================
# LOOP PRINCIPAL: TREINAR E AVALIAR CADA INCREMENTO
# =============================================================================
all_results = []
all_models = {}

with mlflow.start_run(run_name='Incremental_Book_Evaluation') as parent_run:
    mlflow.set_tag('evaluation_type', 'incremental_books')
    mlflow.log_param('n_increments', len(INCREMENTS))
    mlflow.log_param('safras_train', str(safras_train_oos))
    mlflow.log_param('safras_oot', str(safras_oot))
    
    for inc in INCREMENTS:
        logger.info('='*60)
        logger.info('INCREMENTO %d: %s', inc['id'], inc['name'])
        logger.info('='*60)
        
        features = build_increment_features(feature_groups, inc['id'])
        logger.info('Features: %d', len(features))
        
        results, models = train_and_evaluate_increment(
            X_train, y_train,
            X_oos, y_oos,
            X_oot, y_oot,
            features=features,
            increment_id=inc['id'],
            increment_name=inc['name'],
            safras_oot_detail=safras_oot_detail,
        )
        
        all_results.extend(results)
        all_models[inc['id']] = models
        
        # Log resumo do incremento no parent run
        for r in results:
            if r['Split'] == 'OOT':
                mlflow.log_metric(f"inc{inc['id']}_{r['Model']}_KS_OOT", r['KS'])
                mlflow.log_metric(f"inc{inc['id']}_{r['Model']}_AUC_OOT", r['AUC'])
    
    # Log parent run metadata
    mlflow.log_param('parent_run_id', parent_run.info.run_id)

logger.info('Loop de avaliacao incremental concluido — %d resultados', len(all_results))
print(f'Total de resultados: {len(all_results)}')

In [None]:
# =============================================================================
# CONSOLIDAR RESULTADOS EM DATAFRAME
# =============================================================================
df_results = pd.DataFrame(all_results)

print('\n' + '='*80)
print('TABELA COMPLETA DE RESULTADOS')
print('='*80)
display(df_results.to_string(index=False))

# Pivot: KS por Incremento x Modelo x Split
df_ks_pivot = df_results.pivot_table(
    values='KS', index=['Increment', 'Increment_Name', 'N_Features'],
    columns=['Model', 'Split'], aggfunc='first'
)
print('\n' + '='*80)
print('TABELA KS (Incremento x Modelo x Split)')
print('='*80)
display(df_ks_pivot)

# Pivot: AUC por Incremento x Modelo x Split
df_auc_pivot = df_results.pivot_table(
    values='AUC', index=['Increment', 'Increment_Name', 'N_Features'],
    columns=['Model', 'Split'], aggfunc='first'
)
print('\n' + '='*80)
print('TABELA AUC (Incremento x Modelo x Split)')
print('='*80)
display(df_auc_pivot)

## Resultados — Contribuicao Marginal

In [None]:
# =============================================================================
# CONTRIBUICAO MARGINAL (DELTA ENTRE INCREMENTOS)
# =============================================================================
marginal_rows = []

for model in ['LR_L1', 'LGBM']:
    for split in ['Train', 'OOS', 'OOT']:
        prev_ks = None
        prev_auc = None
        for inc in INCREMENTS:
            row = df_results[
                (df_results['Increment'] == inc['id']) &
                (df_results['Model'] == model) &
                (df_results['Split'] == split)
            ]
            if row.empty:
                continue
            ks = row.iloc[0]['KS']
            auc = row.iloc[0]['AUC']
            gini = row.iloc[0]['Gini']
            delta_ks = round(ks - prev_ks, 4) if prev_ks is not None else None
            delta_auc = round(auc - prev_auc, 4) if prev_auc is not None else None
            marginal_rows.append({
                'Increment': inc['id'],
                'Name': inc['name'],
                'Model': model,
                'Split': split,
                'KS': ks,
                'AUC': auc,
                'Gini': gini,
                'Delta_KS': delta_ks,
                'Delta_AUC': delta_auc,
            })
            prev_ks = ks
            prev_auc = auc

df_marginal = pd.DataFrame(marginal_rows)

print('='*80)
print('CONTRIBUICAO MARGINAL POR BOOK')
print('='*80)
display(df_marginal.to_string(index=False))

# Resumo executivo
print('\n--- RESUMO EXECUTIVO ---')
for model in ['LR_L1', 'LGBM']:
    print(f'\nModelo: {model}')
    for split in ['OOS', 'OOT']:
        subset = df_marginal[(df_marginal['Model'] == model) & (df_marginal['Split'] == split)]
        for _, r in subset.iterrows():
            delta = f"(delta KS: {r['Delta_KS']:+.4f})" if r['Delta_KS'] is not None else '(baseline)'
            print(f"  Step {r['Increment']} [{split}]: KS={r['KS']:.4f} {delta}")

In [None]:
# =============================================================================
# VISUALIZACAO 1: KS EVOLUTION POR INCREMENTO
# =============================================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)
fig.suptitle('KS por Split — Evolucao Incremental de Books', fontsize=14, fontweight='bold')

for idx, split in enumerate(['Train', 'OOS', 'OOT']):
    ax = axes[idx]
    for model in ['LR_L1', 'LGBM']:
        subset = df_results[(df_results['Model'] == model) & (df_results['Split'] == split)]
        if not subset.empty:
            ax.plot(
                subset['Increment'].values,
                subset['KS'].values,
                marker='o', linewidth=2, markersize=8,
                label=model,
            )
            # Anotar valores
            for _, r in subset.iterrows():
                ax.annotate(f"{r['KS']:.3f}", (r['Increment'], r['KS']),
                           textcoords='offset points', xytext=(0, 10), ha='center', fontsize=9)
    
    ax.set_title(f'{split}', fontsize=12)
    ax.set_xlabel('Incremento')
    ax.set_xticks([1, 2, 3])
    ax.set_xticklabels(['Base+REC', '+PAG', '+FAT'], rotation=15)
    if idx == 0:
        ax.set_ylabel('KS')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('/tmp/fig_incremental_ks_evolution.png', dpi=150, bbox_inches='tight')
plt.show()
print('Grafico KS salvo')

In [None]:
# =============================================================================
# VISUALIZACAO 2: AUC PROGRESSION (BAR CHART COMPARATIVO)
# =============================================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)
fig.suptitle('AUC por Split — Evolucao Incremental de Books', fontsize=14, fontweight='bold')

colors = {'LR_L1': '#2196F3', 'LGBM': '#4CAF50'}
bar_width = 0.35

for idx, split in enumerate(['Train', 'OOS', 'OOT']):
    ax = axes[idx]
    for j, model in enumerate(['LR_L1', 'LGBM']):
        subset = df_results[(df_results['Model'] == model) & (df_results['Split'] == split)]
        if not subset.empty:
            x = np.arange(len(subset))
            bars = ax.bar(
                x + j * bar_width, subset['AUC'].values,
                bar_width, label=model, color=colors[model], alpha=0.85
            )
            for bar, val in zip(bars, subset['AUC'].values):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.003,
                       f'{val:.3f}', ha='center', fontsize=8)
    
    ax.set_title(f'{split}', fontsize=12)
    ax.set_xticks(np.arange(3) + bar_width / 2)
    ax.set_xticklabels(['Base+REC', '+PAG', '+FAT'], rotation=15)
    if idx == 0:
        ax.set_ylabel('AUC')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('/tmp/fig_incremental_auc_progression.png', dpi=150, bbox_inches='tight')
plt.show()
print('Grafico AUC salvo')

In [None]:
# =============================================================================
# VISUALIZACAO 3: CONTRIBUICAO MARGINAL (WATERFALL / BAR DELTA)
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Contribuicao Marginal por Book (Delta KS)', fontsize=14, fontweight='bold')

for idx, model in enumerate(['LR_L1', 'LGBM']):
    ax = axes[idx]
    subset = df_marginal[
        (df_marginal['Model'] == model) &
        (df_marginal['Split'] == 'OOT') &
        (df_marginal['Delta_KS'].notna())
    ]
    if not subset.empty:
        colors_bar = ['#4CAF50' if d > 0 else '#F44336' for d in subset['Delta_KS']]
        bars = ax.bar(
            range(len(subset)), subset['Delta_KS'].values,
            color=colors_bar, alpha=0.85, edgecolor='black', linewidth=0.5
        )
        for bar, val in zip(bars, subset['Delta_KS'].values):
            ax.text(bar.get_x() + bar.get_width()/2,
                   bar.get_height() + 0.001 * (1 if val >= 0 else -3),
                   f'{val:+.4f}', ha='center', fontsize=10, fontweight='bold')
    
    ax.set_title(f'{model} — OOT', fontsize=12)
    ax.set_xticks(range(len(subset)))
    ax.set_xticklabels(['+Pagamento', '+Faturamento'], rotation=0)
    ax.set_ylabel('Delta KS')
    ax.axhline(y=0, color='black', linewidth=0.8)
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('/tmp/fig_incremental_marginal_delta.png', dpi=150, bbox_inches='tight')
plt.show()
print('Grafico marginal delta salvo')

In [None]:
# =============================================================================
# VISUALIZACAO 4: FEATURE IMPORTANCE TOP-15 POR INCREMENTO (LGBM)
# =============================================================================
fig, axes = plt.subplots(1, 3, figsize=(20, 8))
fig.suptitle('Top 15 Features por Incremento (LGBM Importance)', fontsize=14, fontweight='bold')

for idx, inc in enumerate(INCREMENTS):
    ax = axes[idx]
    lgbm_model = all_models[inc['id']].get('LGBM')
    if lgbm_model is None:
        ax.set_title(f"Step {inc['id']}: N/A")
        continue
    
    # Extrair importancias
    booster = lgbm_model.named_steps['model']
    prep = lgbm_model.named_steps['prep']
    try:
        feature_names = prep.get_feature_names_out()
    except Exception:
        features_inc = build_increment_features(feature_groups, inc['id'])
        feature_names = [c for c in features_inc if c != 'SAFRA']
    
    importances = booster.feature_importances_
    n = min(len(feature_names), len(importances))
    df_imp = pd.DataFrame({
        'feature': list(feature_names)[:n],
        'importance': list(importances)[:n]
    }).sort_values('importance', ascending=True).tail(15)
    
    # Colorir por source
    colors_fi = []
    for f in df_imp['feature']:
        if 'REC_' in f or f.startswith('num__REC_') or f.startswith('cat__REC_'):
            colors_fi.append('#2196F3')  # Azul
        elif 'PAG_' in f or f.startswith('num__PAG_') or f.startswith('cat__PAG_'):
            colors_fi.append('#FF9800')  # Laranja
        elif 'FAT_' in f or f.startswith('num__FAT_') or f.startswith('cat__FAT_'):
            colors_fi.append('#9C27B0')  # Roxo
        else:
            colors_fi.append('#607D8B')  # Cinza (base)
    
    ax.barh(range(len(df_imp)), df_imp['importance'].values, color=colors_fi, alpha=0.85)
    ax.set_yticks(range(len(df_imp)))
    ax.set_yticklabels(df_imp['feature'].values, fontsize=7)
    ax.set_title(f"Step {inc['id']}: {inc['name']}\n({len(build_increment_features(feature_groups, inc['id']))} features)", fontsize=10)
    ax.set_xlabel('Importance')

# Legenda
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#607D8B', label='Base (Cadastro+Telco)'),
    Patch(facecolor='#2196F3', label='Recarga (REC_)'),
    Patch(facecolor='#FF9800', label='Pagamento (PAG_)'),
    Patch(facecolor='#9C27B0', label='Faturamento (FAT_)'),
]
fig.legend(handles=legend_elements, loc='lower center', ncol=4, fontsize=10)

plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.savefig('/tmp/fig_incremental_feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()
print('Grafico feature importance salvo')

In [None]:
# =============================================================================
# SWAP ANALYSIS POR INCREMENTO
# =============================================================================

def swap_analysis_incremental(y_true_1, scores_1, y_true_2, scores_2, top_pct=0.1):
    """Compara ranking entre dois conjuntos (OOT1 vs OOT2) para estabilidade.
    
    Mede se os clientes classificados como alto risco em OOT1 continuam
    sendo alto risco em OOT2 (swap-in/swap-out).
    """
    # Reset index para garantir alinhamento posicional com np.argsort
    y1 = y_true_1.reset_index(drop=True)
    y2 = y_true_2.reset_index(drop=True)
    s1 = np.array(scores_1)
    s2 = np.array(scores_2)

    n1 = int(len(s1) * top_pct)
    n2 = int(len(s2) * top_pct)
    
    # Top risco em cada periodo
    top_idx_1 = np.argsort(s1)[-n1:]
    top_idx_2 = np.argsort(s2)[-n2:]
    
    # FPD rate no top
    fpd_rate_top_1 = y1.iloc[top_idx_1].mean() if len(top_idx_1) > 0 else 0
    fpd_rate_top_2 = y2.iloc[top_idx_2].mean() if len(top_idx_2) > 0 else 0
    
    # Metricas gerais
    metrics_1 = compute_metrics(y1, s1)
    metrics_2 = compute_metrics(y2, s2)
    
    return {
        'KS_OOT1': metrics_1['KS'],
        'KS_OOT2': metrics_2['KS'],
        'Delta_KS': round(metrics_2['KS'] - metrics_1['KS'], 4),
        'AUC_OOT1': metrics_1['AUC'],
        'AUC_OOT2': metrics_2['AUC'],
        'Delta_AUC': round(metrics_2['AUC'] - metrics_1['AUC'], 4),
        'FPD_Rate_Top10_OOT1': round(fpd_rate_top_1, 4),
        'FPD_Rate_Top10_OOT2': round(fpd_rate_top_2, 4),
        'Delta_FPD_Top10': round(fpd_rate_top_2 - fpd_rate_top_1, 4),
    }


# Executar swap para cada incremento
swap_results = []

for inc in INCREMENTS:
    features = build_increment_features(feature_groups, inc['id'])
    keep_cols = [c for c in features if c in X_oot.columns]
    
    for model_type in ['LR', 'LGBM']:
        model_label = 'LR_L1' if model_type == 'LR' else 'LGBM'
        pipe = all_models[inc['id']][model_type]
        
        # Scores por safra OOT
        X_oot1, y_oot1 = safras_oot_detail['202502']
        X_oot2, y_oot2 = safras_oot_detail['202503']
        
        scores_oot1 = pipe.predict_proba(X_oot1[keep_cols].reset_index(drop=True))[:, 1]
        scores_oot2 = pipe.predict_proba(X_oot2[keep_cols].reset_index(drop=True))[:, 1]
        
        swap = swap_analysis_incremental(y_oot1, scores_oot1, y_oot2, scores_oot2)
        swap_results.append({
            'Increment': inc['id'],
            'Name': inc['name'],
            'Model': model_label,
            **swap
        })

df_swap = pd.DataFrame(swap_results)

print('='*80)
print('SWAP ANALYSIS — ESTABILIDADE TEMPORAL (OOT1 vs OOT2)')
print('='*80)
display(df_swap.to_string(index=False))

In [None]:
# =============================================================================
# VISUALIZACAO 5: SWAP COMPARISON
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Swap Analysis — Estabilidade OOT1 vs OOT2', fontsize=14, fontweight='bold')

for idx, metric in enumerate(['Delta_KS', 'Delta_FPD_Top10']):
    ax = axes[idx]
    for j, model in enumerate(['LR_L1', 'LGBM']):
        subset = df_swap[df_swap['Model'] == model]
        x = np.arange(len(subset))
        bars = ax.bar(
            x + j * 0.35, subset[metric].values,
            0.35, label=model, alpha=0.85
        )
        for bar, val in zip(bars, subset[metric].values):
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
                   f'{val:+.4f}', ha='center', fontsize=9)
    
    ax.set_title(metric.replace('_', ' '), fontsize=12)
    ax.set_xticks(np.arange(3) + 0.175)
    ax.set_xticklabels(['Base+REC', '+PAG', '+FAT'], rotation=0)
    ax.axhline(y=0, color='black', linewidth=0.8)
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('/tmp/fig_incremental_swap_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
print('Grafico swap salvo')

In [None]:
# =============================================================================
# EXPORT — CSVs E SUMMARY
# =============================================================================
import os
artifacts_dir = '/tmp/incremental_eval'
os.makedirs(artifacts_dir, exist_ok=True)

# CSVs
df_results.to_csv(f'{artifacts_dir}/incremental_evaluation_full_results.csv', index=False)
df_marginal.to_csv(f'{artifacts_dir}/incremental_marginal_contribution.csv', index=False)
df_swap.to_csv(f'{artifacts_dir}/swap_analysis_consolidated.csv', index=False)
df_ks_pivot.to_csv(f'{artifacts_dir}/incremental_comparison_ks.csv')
df_auc_pivot.to_csv(f'{artifacts_dir}/incremental_comparison_auc.csv')

# Log artefatos no MLflow
with mlflow.start_run(run_name='Incremental_Summary_Artifacts', nested=False):
    mlflow.set_tag('artifact_type', 'incremental_evaluation_summary')
    for fname in os.listdir(artifacts_dir):
        mlflow.log_artifact(f'{artifacts_dir}/{fname}', 'incremental_evaluation')
    
    # Log figuras
    for fig_path in [
        '/tmp/fig_incremental_ks_evolution.png',
        '/tmp/fig_incremental_auc_progression.png',
        '/tmp/fig_incremental_marginal_delta.png',
        '/tmp/fig_incremental_feature_importance.png',
        '/tmp/fig_incremental_swap_comparison.png',
    ]:
        if os.path.exists(fig_path):
            mlflow.log_artifact(fig_path, 'figures')

print(f'Artefatos exportados para: {artifacts_dir}')
print(f'CSVs: {len([f for f in os.listdir(artifacts_dir) if f.endswith(".csv")])}')
print('MLflow artifacts logged')

In [None]:
# =============================================================================
# HEATMAP RESUMO FINAL
# =============================================================================
fig, ax = plt.subplots(figsize=(12, 6))
fig.suptitle('Heatmap — KS por Incremento, Modelo e Split', fontsize=14, fontweight='bold')

heatmap_data = df_results.pivot_table(
    values='KS',
    index=['Increment', 'Model'],
    columns='Split',
    aggfunc='first'
)

# Ordenar colunas
col_order = ['Train', 'OOS', 'OOT']
heatmap_data = heatmap_data[[c for c in col_order if c in heatmap_data.columns]]

im = ax.imshow(heatmap_data.values, cmap='RdYlGn', aspect='auto', vmin=0.15, vmax=0.45)

ax.set_xticks(range(len(heatmap_data.columns)))
ax.set_xticklabels(heatmap_data.columns, fontsize=11)
ax.set_yticks(range(len(heatmap_data.index)))
ax.set_yticklabels([f'Step {i} — {m}' for i, m in heatmap_data.index], fontsize=10)

# Anotar valores
for i in range(len(heatmap_data.index)):
    for j in range(len(heatmap_data.columns)):
        val = heatmap_data.values[i, j]
        ax.text(j, i, f'{val:.3f}', ha='center', va='center', fontsize=11, fontweight='bold')

plt.colorbar(im, ax=ax, label='KS')
plt.tight_layout()
plt.savefig('/tmp/fig_incremental_summary_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print('Heatmap resumo salvo')

## Conclusoes

### Interpretacao dos Resultados

| Incremento | O que avalia |
|------------|-------------|
| Step 1 (Base + Recarga) | Poder preditivo do comportamento de recarga isolado |
| Step 2 (+ Pagamento) | Ganho marginal do historico de pagamentos |
| Step 3 (+ Faturamento) | Ganho marginal do perfil de faturamento |

### Criterios de Decisao

- **Delta KS > +0.02**: Book contribui significativamente
- **Delta KS entre -0.01 e +0.02**: Contribuicao marginal (considerar complexidade vs ganho)
- **Delta KS < -0.01**: Book pode estar adicionando ruido (investigar)
- **Delta FPD Top10 < |0.02|**: Modelo estavel temporalmente (swap aceitavel)

### Proximos Passos

1. Se algum book nao contribui, considerar remove-lo para simplicidade
2. Aplicar feature selection (IV + L1) no melhor incremento
3. Treinar modelo final com features selecionadas
4. Registrar modelo no MLflow Model Registry