# Risk Pipeline Quickstart Notebook

This notebook demonstrates how to run the **Unified Risk Pipeline** end-to-end on the bundled synthetic dataset. It is designed as a minimal yet comprehensive validation harness so you can exercise the entire modelling flow (splitting, WOE, feature selection, model training, risk banding and reporting) with just a few cells.

> Tip: If you are running this notebook outside of the repository checkout, install the latest development branch first.
> `ash
> pip install "git+https://github.com/selimoksuz/risk-model-pipeline.git@development"
> `


## 1. Imports and paths

The quickstart dataset and data dictionary live under examples/data/quickstart. All outputs will be written to output/notebook_quickstart by default.

In [None]:
from pathlib import Path
import pandas as pd

from risk_pipeline.core.config import Config
from risk_pipeline.unified_pipeline import UnifiedRiskPipeline

DATA_DIR = Path('examples/data/quickstart')
INPUT_CSV = DATA_DIR / 'loan_applications.csv'
DICTIONARY_CSV = DATA_DIR / 'data_dictionary.csv'
OUTPUT_DIR = Path('output/notebook_quickstart')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

df = pd.read_csv(INPUT_CSV)
data_dictionary = pd.read_csv(DICTIONARY_CSV)

df.head()

## 2. Quick sanity checks

Inspect the class balance and a few categorical distributions before running the pipeline.

In [None]:
df['target'].value_counts(normalize=True).rename('default_rate')

In [None]:
df[['region', 'segment']].describe(include='all')

## 3. Configure the pipeline

The configuration below keeps the full modelling sequence but with light settings so the notebook finishes in a couple of minutes on a laptop. You can tweak any parameter if you want to stress-test specific stages.

In [None]:
cfg = Config(
    target_column='target',
    id_column='app_id',
    time_column='app_dt',
    create_test_split=True,
    test_size=0.25,
    stratify_test=True,
    oot_months=1,
    enable_dual=False,
    enable_tsfresh_features=False,
    enable_scoring=True,
    output_folder=str(OUTPUT_DIR),
    enable_stage2_calibration=False,
    n_risk_bands=5,
    risk_band_method='quantile',
    selection_steps=['psi', 'univariate', 'iv', 'correlation', 'stepwise'],
    max_psi=0.6,
    algorithms=['logistic'],
    use_optuna=False,
    calculate_shap=False,
    use_noise_sentinel=False,
    random_state=42,
)
cfg.model_type = 'logistic'

## 4. Run the unified pipeline

The pipeline prints progress for each stage. Expect to see data processing, splitting, WOE, feature selection, modelling, calibration (Stage 1 only), risk band optimisation and scoring outputs.

In [None]:
pipe = UnifiedRiskPipeline(cfg)
results = pipe.fit(df, data_dictionary=data_dictionary, score_df=df)

## 5. Inspect results

The 
esults dictionary collects the artefacts most downstream processes need. Below are a few high-level summaries.

In [None]:
best_model = results.get('best_model_name')
model_scores = results.get('model_results', {}).get('scores', {})
print(f'Best model: {best_model}')
pd.DataFrame(model_scores).T

In [None]:
feature_report = pipe.reporter.reports_.get('features')
feature_report.head() if feature_report is not None else 'No feature report available.'

In [None]:
pipe.reporter.reports_.get('risk_bands_summary', {})

## 6. Generated files

All artefacts are stored under the configured output directory. Use the list below for a quick peek.

In [None]:
sorted(p.name for p in OUTPUT_DIR.glob('**/*') if p.is_file())

## 7. Automating via script (optional)

The repository also ships with examples/quickstart_demo.py. You can execute it from the console or from this notebook to validate the pipeline in CI environments.

`python
from examples.quickstart_demo import run_quickstart
run_quickstart('output/quickstart_script')
`
