# 📊 Data Splitting Strategy

This notebook implements a comprehensive data splitting strategy for the fraud detection model. The strategy ensures robust model evaluation by properly separating data into distinct sets for training, validation, and testing.

## 🎯 Key Components

1. **Out-of-Time (OOT) Split**
   - Separates data based on transaction date
   - Ensures temporal independence between training and test sets
   - Helps evaluate model performance on future data

2. **Train/Validation Split**
   - Further splits training data into train and holdout sets
   - Uses stratified sampling to maintain class distribution
   - Enables model validation during training

3. **Feature Transformation**
   - Handles categorical variables
   - Handles unknown categories
   - Custom encodig enable control over categorical variables

## 📚 Import Libraries and Load Data <a name="import-libraries"></a>

In [5]:
# Import required libraries for data manipulation and model training
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Add project root to Python path for importing custom modules
import sys
sys.path.append('../')

# Import custom encoder for handling categorical variables
from utils.multicolumn_encoder import MultiColumnEncoder

# Load processed data
data = pd.read_parquet('../data/processed_credit_card_transactions.parquet')

# set trannsaction num as index
data.set_index('trans_num', inplace=True)

## ⏰ Out-of-Time Split

Split data into training and out-of-time test sets based on transaction date.


### Split Criteria
- Training data: Transactions before July 2020
- OOT data: Transactions from July 2020 onwards

In [6]:
# Define split criteria based on transaction date
_expression = "trans_date_trans_time < '2020-07-01 00:00:00'"
print(f"Splitting dataframe based on expression {_expression!r}.")

# Split into training and OOT sets
train = data.query(_expression)
oot = data.query(f"~({_expression})")

# Separate features and target for OOT set
oot_y = oot.is_fraud
oot_X = oot.drop(columns=['is_fraud'])

Splitting dataframe based on expression "trans_date_trans_time < '2020-07-01 00:00:00'".


## 🔄 Train/Validation Split

Further split the training data into train and holdout sets using stratified sampling to maintain the class distribution. This helps in:

1. Model validation during training
2. Hyperparameter tuning

In [7]:
# Prepare training data by separating features and target
X = train.drop(columns=['is_fraud'])
y = train['is_fraud']

# Split into train and holdout sets with stratification
train_X, holdout_X, train_y, holdout_y = train_test_split(
    X, y, 
    test_size=0.2,  # 20% of data for holdout
    random_state=42,  # For reproducibility
    stratify=y  # Maintain class distribution
)


## 🔧 Transform Features

Encode categorical variables using our custom MultiColumnEncoder to prepare the data for model training. This ensures:

1. Proper handling of categorical features
2. Consistent encoding across all datasets
3. Handling of unknown categories

In [8]:
# Identify categorical columns in the dataset
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()

# Initialize and fit encoder on training data
encoder = MultiColumnEncoder(categorical_columns)
train_X = encoder.fit_transform(train_X)

# Transform holdout and OOT sets using fitted encoder
holdout_X = encoder.transform(holdout_X)
oot_X = encoder.transform(oot_X)

## 💾 Save Split Datasets

Save the processed datasets to parquet files for future use. This includes:

1. Training data
2. Holdout validation data
3. Out-of-time test data

In [10]:
# Combine features and targets for each dataset
train_data = pd.concat([train_X, train_y], axis=1)
holdout_data = pd.concat([holdout_X, holdout_y], axis=1)
oot_data = pd.concat([oot_X, oot_y], axis=1)

# Save datasets to parquet files for future use
train_data.to_parquet('../data/train_data.parquet')
holdout_data.to_parquet('../data/holdout_data.parquet')
oot_data.to_parquet('../data/oot_data.parquet')