# Credit Risk Pipeline Quickstart

This notebook exercises the **Unified Risk Pipeline** end-to-end using the bundled synthetic dataset.
The sample includes stratified monthly observations, calibration hold-outs, stage-2 data and a future
scoring batch so that every major pipeline stage can be validated quickly.

## 1. Imports and paths

All sample inputs live under `examples/data/credit_risk_sample`.

In [None]:
from pathlib import Path
import pandas as pd

from risk_pipeline.core.config import Config
from risk_pipeline.unified_pipeline import UnifiedRiskPipeline

BASE_DIR = Path('examples/data/credit_risk_sample')
DEV_PATH = BASE_DIR / 'development.csv'
CAL_LONG_PATH = BASE_DIR / 'calibration_longrun.csv'
CAL_RECENT_PATH = BASE_DIR / 'calibration_recent.csv'
SCORE_PATH = BASE_DIR / 'scoring_future.csv'
DICT_PATH = BASE_DIR / 'data_dictionary.csv'
OUTPUT_DIR = Path('output/credit_risk_sample_notebook')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

dev_df = pd.read_csv(DEV_PATH)
cal_long_df = pd.read_csv(CAL_LONG_PATH)
cal_recent_df = pd.read_csv(CAL_RECENT_PATH)
score_df = pd.read_csv(SCORE_PATH)
data_dictionary = pd.read_csv(DICT_PATH)

dev_df.head()

## 2. Quick sanity checks

In [None]:
dev_df['target'].value_counts(normalize=True).rename('default_rate')

In [None]:
dev_df.groupby('snapshot_month')['target'].mean().rename('monthly_default_rate')

## 3. Configure the pipeline

The configuration below enables tsfresh feature generation, dual modelling flow (WOE + raw),
calibration stages and risk band optimisation while remaining light enough for a laptop.

In [None]:
cfg = Config(
    target_column='target',
    id_column='customer_id',
    time_column='app_dt',
    create_test_split=True,
    stratify_test=True,
    oot_months=2,
    enable_dual=True,
    enable_tsfresh_features=True,
    enable_scoring=True,
    enable_stage2_calibration=True,
    output_folder=str(OUTPUT_DIR),
    n_risk_bands=6,
    risk_band_method='quantile',
    max_psi=0.6,
    selection_steps=['psi', 'univariate', 'iv', 'correlation', 'stepwise'],
    algorithms=['logistic', 'lightgbm'],
    use_optuna=False,
    calculate_shap=False,
    use_noise_sentinel=False,
    random_state=42,
)
cfg.model_type = ['LogisticRegression', 'LightGBM']

## 4. Run the unified pipeline

In [None]:
pipe = UnifiedRiskPipeline(cfg)
results = pipe.fit(
    dev_df,
    data_dictionary=data_dictionary,
    calibration_df=cal_long_df,
    stage2_df=cal_recent_df,
    score_df=score_df,
)

## 5. Inspect key outputs

In [None]:
best_model = results.get('best_model_name')
model_scores = results.get('model_results', {}).get('scores', {})
print(f'Best model: {best_model}')
pd.DataFrame(model_scores).T

In [None]:
feature_report = pipe.reporter.reports_.get('features')
feature_report.head() if feature_report is not None else 'No feature report available.'

In [None]:
calibration_report = pipe.reporter.reports_.get('calibration')
calibration_report

In [None]:
risk_bands = pipe.reporter.reports_.get('risk_bands_summary', {})
risk_bands

## 6. Generated files

In [None]:
sorted(p.relative_to(OUTPUT_DIR.parent) for p in OUTPUT_DIR.glob('**/*') if p.is_file())

## 7. Automating via script

`examples/quickstart_demo.py` mirrors the steps above so the flow can be validated headless
(e.g. in CI pipelines).