## Notebook 2 : 02_target_sampling_and_setup.ipynb

#### Author: Satveer Kaur
#### Date: 2025-10-19

#### Notebook Purpose:
This notebook initiates the analytical development phase of the project, focusing on preparing the **clean data for efficient risk modeling and reporting**. This requires two critical actions: **Target Variable Definition** and **Analytical Sampling**.

**Primary Objective:** To transform the structurally clean data into a format ready for quantitative risk assessment and feature engineering.

**Key Deliverables:**
1. **Binary Target Variable**(`is_default`): Formal consolidation of the `loan_status` text categories into a 0/1 outcome (non-default/default) that defines the risk metric.
2. **Representative Sample:** Creation of a statistically valid subset of the full data (2.26M records) to ensure development workflows (EDA, binning, visualization) are efficient and rapid.

**Input:** `clean_data_for_sampling.csv` (Output from Notebook 01).

**Output:** `sample_data_for_development.csv` (Input for Notebook 03).

#### 1. Setup and Data Ingestion
**Purpose**: To initialize the environment, enforce professional display standards, and securely load the clean and stabilized dataset (`clean_data_for_sampling.csv`) from the previous notebook. This action establishes the primary DataFrame (`df`) for all subsequent analytical transformations.

In [13]:
# Importing Libraries
import pandas as pd
from sklearn.model_selection import train_test_split  # for stratified sampling

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.float_format','{:.2f}'.format)

# Load cleaned datasets
df= pd.read_csv('../data/processed/clean_data_for_sampling.csv', low_memory=False, parse_dates=['issue_date'])

print(f'Clean data loaded successfully, Total records: {len(df):,.0f}')
print(f'Initial Columns: {df.shape[1]}')

Clean data loaded successfully, Total records: 2,260,668
Initial Columns: 102


#### 2. Define the Target Variable `is_default`
**Purpose:** To perform the critical analytical transformation of the `loan_status` feature by consolidating various terminal text statuses into a definitive binary target variable (`is_default`). This step formalizes the criteria for loan non-performance (default = 1) versus acceptable performance (non-default = 0), which is essential for quantifying and modeling credit risk.

In [14]:
df.loan_status.unique()

array(['Fully Paid', 'Current', 'Charged Off', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)

In [15]:
# Default groups 
default_statuses = [
    'Charged Off', 'Default', 'Does not meet the credit policy. Status:Charged Off'
]

# Create is_default columns: 1 if status in default list, 0 otherwise
df['is_default'] =  df['loan_status'].apply(
    lambda x : 1 if x in default_statuses else 0
)

default_rate = df['is_default'].mean() * 100
print('Target Variable "is_default" created')
print(f'Observed Default Rate (ODR) in the full portfolio: {default_rate:.2f}%')

Target Variable "is_default" created
Observed Default Rate (ODR) in the full portfolio: 11.92%


#### 3. Analytical Sampling (Stratified)
**Purpose:**  
To increase analytical efficiency and speed up iterative development (EDA, feature engineering) by creating a **statistically representative** 10% **sample** of the full dataset. The stratified approach guarantees the sample's **Observed Default Rate (ODR)** exactly matches the full population's ODR, ensuring reliable feature validation without introducing bias.

In [16]:
# Stratified sampling using train_test_split
df_sample, df_remainder = train_test_split(
    df,
    test_size=0.9, # keep 10% for sample 
    stratify=df['is_default'], # stratify by the binary target
    random_state=42
)

# verification
full_default_rate = df['is_default'].mean() * 100
sample_default_rate = df['is_default'].mean() * 100

print('Stratified Sample Created (10% of full data)')
print(f'Sample Size: {len(df_sample):,.0f} rows')
print(f'Full Portfolio ODR: {full_default_rate:.4f}%')
print(f'Sample ODR: {sample_default_rate:.4f}%')

Stratified Sample Created (10% of full data)
Sample Size: 226,066 rows
Full Portfolio ODR: 11.9151%
Sample ODR: 11.9151%


#### 4. Checkpoint and Export Sample Data
**Purpose:** To save the development sample (`df_sample`) which includes the newly created binary target variable, to the processed data folder. This file will serve as the input for all subsequent exploratory and feature engineering work in Notebook 03.

In [17]:
df_sample.to_csv('../data/processed/sample_data_for_development.csv', index=False)
print('Notebook 2 Complete. Development sample saved.')

Notebook 2 Complete. Development sample saved.


#### 5. Summary and Next Steps
##### Summary
1. **Target Definition:** Successfully defined the binary analytical target variable, is_default, by consolidating terminal loan_status categories.
2. **Observed Default Rate (ODR):** The full portfolio's ODR was calculated at `11.9151 %`
3. **Analytical Efficiency:** A statistically robust 10% sample was created using stratified sampling based on `is_default` to ensure zero sampling bias.
4. **Sample Validation:** The sample's ODR was confirmed to be an exact match for the full portfolio's ODR, validating its use for all iterative development.

##### Next Steps: Feature Development and Exploration
The data is clean, the target is defined, and a representative sample is ready. The focus now shifts to exploring the data and developing the final risk segmentation features.

**Action:** Proceed to Notebook 03 to begin the iterative exploration phase.
1. **Exploratory Data Analysis (EDA):** Perform in-depth visualization of key risk features (FICO, DTI, Income) against the is_default target.
2. **Feature Engineering:** Develop and finalize the binning logic for the primary risk drivers to create the auditable segmentation features.
3. **Monotonicity Validation:** Generate charts and tables to formally validate the consistent separation of risk provided by the engineered features.