# Initial Data Exploration

This notebook inspects the synthetic transition history shipped with the project so newcomers can understand the reporting periods, segments, and risk movements captured in the pipeline. The examples mirror the production configuration so the exploratory flow respects the same filters.


In [None]:
import pathlib
import sys

import pandas as pd

PROJECT_ROOT = pathlib.Path('..').resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.pd_transition_matrix import config
from src.pd_transition_matrix.data_management import filter_raw_data


In [None]:
raw = pd.read_csv('../data/raw_transition_data.csv', parse_dates=['period_end'])
raw.head()


The pipeline applies guard rails before any heavy lifting. Below we re-use the exact configuration—keeping only observations from 2024 onward and whatever segments are whitelisted—to ensure the notebook reflects production assumptions.


In [None]:
filters = config.pipeline_config.filters
filtered = filter_raw_data(raw, filters)
filtered[['period_end', 'segment']].drop_duplicates().sort_values(['period_end', 'segment'])


In [None]:
summary = (
    filtered.groupby(['period_end', 'segment', 'risk_bucket_start', 'risk_bucket_end', 'term_months'])['exposure']
    .sum()
    .reset_index()
)
summary.head(10)


With the filters applied we now have a consistent 2024 reporting snapshot to analyse. Adjust the configuration in `config.PipelineFilters` to widen the window or focus on a particular segment before repeating the exploration.
