# Prediction Pipeline for Data Preprocessing

Complete 4-stage data preprocessing pipeline for model prediction/inference.

**Pipeline Stages:**
1. Data Cleaning: Raw CSV → Cleaned CSV
2. Feature Engineering (Inference): Cleaned CSV → Clustered CSV (using pre-trained model)
3. Reorganize by Member: Clustered CSV → Organized by Member (Fraud=0)
4. Feature Encoding: Processed CSV → Final Encoded Dataset

**Key Differences from Training:**
- Stage 2: Uses pre-trained clustering model (no training)
- Stage 3: Simple reorganization with Fraud=0 (no fraud matching)
- No temp directory needed
- Simpler output structure

## 0. Import and Configuration

**New Approach**: All paths and parameters are configured in `config/pipeline_config.py`
- No more scattered path definitions
- Easy to switch between train/pred modes
- Centralized parameter management

In [2]:
import sys
import os
import pandas as pd
import importlib.util
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path.cwd().parents[1]
sys.path.insert(0, str(PROJECT_ROOT))

# Import centralized configuration
from config.pipeline_config import get_pred_config

# Get prediction configuration
config = get_pred_config()
config.print_config()

# Create all necessary directories
print("\nCreating directories...")
config.create_directories()

print("\n✓ Configuration loaded successfully!")

ClearShield Pipeline Configuration - Mode: PRED

Project Root: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield

Data Paths:
  raw                 : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/raw
  cleaned             : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/cleaned
  clustered           : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/clustered_out
  by_member           : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/by_member
  final               : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/final
  model_dir           : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/src/data_preprocess/02_feature_engineering/02b_description_encoding
  cluster_model       : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/src/data_preprocess/02_feature_engineering/02b_description_encoding/global_cluster_model.pkl
  tokenize_config     : /Users/wwy/Documents/CMU/25-Fall/Practi

### Load Processing Modules

In [3]:
# Import data cleaning module
sys.path.insert(0, os.path.abspath('./01_data_cleaning'))
spec = importlib.util.spec_from_file_location(
    "data_cleaning",
    "./01_data_cleaning/01_data_cleaning.py"
)
dc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(dc)

# Import feature engineering inference module
spec = importlib.util.spec_from_file_location(
    "inference_stage2",
    "02_feature_engineering/inference_stage2.py"
)
fe_inf = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fe_inf)

# Import fraud relabeling inference module
spec = importlib.util.spec_from_file_location(
    "inference_stage3",
    "03_fraud_relabeling/inference_stage3.py"
)
fr_inf = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fr_inf)

# Import encoding inference module
spec = importlib.util.spec_from_file_location(
    "inference_stage4",
    "04_encoding/inference_stage4.py"
)
enc_inf = importlib.util.module_from_spec(spec)
spec.loader.exec_module(enc_inf)

print("✓ All modules loaded successfully!")

✓ All modules loaded successfully!


### Configure Modules Using Centralized Config

In [4]:
# Configure data cleaning
dc.ENABLE_RENAMING = True
dc.RAW_DIR = str(config.get_path('raw'))
dc.CLEANED_DIR = str(config.get_path('cleaned'))

# Get stage I/O paths
stage2_io = config.get_stage_io(2)
stage3_io = config.get_stage_io(3)
stage4_io = config.get_stage_io(4)

# Configure inference stage 3
fr_inf.INPUT_DIR = str(stage3_io['input'])
fr_inf.OUTPUT_MEMBER_DIR = str(stage3_io['output'])

# Configure inference stage 4
enc_inf.PROCESSED_DIR = str(stage4_io['input'])
enc_inf.OUTPUT_DIR = str(stage4_io['output'])
enc_inf.CONFIG_PATH = str(config.get_path('tokenize_config'))

# Model paths
cluster_model_path = str(config.get_path('cluster_model'))
tokenize_config_path = str(config.get_path('tokenize_config'))

print("✓ All modules configured successfully!")
print(f"\nPipeline will process:")
print(f"  Raw data:     {dc.RAW_DIR}")
print(f"  Final output: {enc_inf.OUTPUT_DIR}")
print(f"\nUsing pre-trained models:")
print(f"  Cluster model: {cluster_model_path}")

✓ All modules configured successfully!

Pipeline will process:
  Raw data:     /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/raw
  Final output: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/final

Using pre-trained models:
  Cluster model: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/src/data_preprocess/02_feature_engineering/02b_description_encoding/global_cluster_model.pkl


---

## 1. Data Cleaning

This cell performs the following preprocessing tasks:
1. Standardize headers (e.g., "AccountID" → "Account ID")
2. Fix comma issues (remove extra commas in field values)
3. Clean Amount field (remove $ and commas, convert to numeric)
4. Fill missing values (Amount→0, others→"Unknown", "null"→empty)
5. Rename files based on date range (MM-DD-YYYY_to_MM-DD-YYYY.csv)

**Data Flow**: `data/pred/raw/` → `data/pred/cleaned/`

In [5]:
dc.main()

Raw: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/raw
Cleaned: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/cleaned

Found 1 CSV files in /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/raw

CSV Files List ⬇️
  1. test.csv (0.06 MB)

Processing Files...

[1/1] test.csv... Missing:31, →01-01-2025_to_01-01-2025.csv

Processing Complete!


---

## 2. Feature Engineering (Inference Mode)

This stage applies the pre-trained clustering model to new data:

1. **BERT Encoding**: Use the same BERT-tiny model to encode "Transaction Description"
2. **Apply Pre-trained PCA**: Transform using saved PCA model
3. **Apply Pre-trained Clustering**: Assign cluster IDs using saved KMeans model
4. **Add Cluster ID**: Append `cluster_id` column to each CSV file

**Data Flow**: `data/pred/cleaned/` → `data/pred/clustered_out/`

**Note**: This uses the pre-trained model saved during training. No model training occurs.

In [6]:
# Check if model exists
if not os.path.exists(cluster_model_path):
    raise FileNotFoundError(f"Pre-trained model not found: {cluster_model_path}")

# Apply clustering using pre-trained model
output_files = fe_inf.infer_clusters_for_directory(
    input_dir=str(stage2_io['input']),
    output_dir=str(stage2_io['output']),
    model_path=cluster_model_path,
    text_column=config.feature_engineering['text_column'],
    batch_size=config.feature_engineering['batch_size'],
    verbose=True
)

[Inference] Found 1 CSV files

[1/1] Processing: 01-01-2025_to_01-01-2025.csv
[Load Model] /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/src/data_preprocess/02_feature_engineering/02b_description_encoding/global_cluster_model.pkl
  BERT Model: prajjwal1/bert-tiny
  PCA Components: 20
  KMeans Clusters: 60
[Process] /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/cleaned/01-01-2025_to_01-01-2025.csv
  Rows: 537
  Device: mps
  Encoding: 537/537 (100.0%)
  Applying PCA...
  Assigning clusters...
[Done] Saved → /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/clustered_out/01-01-2025_to_01-01-2025.csv
  Cluster distribution: min=0, max=59, unique=55

[Complete] Processed 1/1 files


---

## 3. Reorganize by Member (Inference Mode)

This stage reorganizes transactions by member and sets Fraud=0 for all transactions:

1. **Group by Member**: Reorganize transactions by Member ID
2. **Set Fraud=0**: All transactions are assumed non-fraudulent
3. **Save by Member**: Create individual member files

**Data Flow**: `data/pred/clustered_out/` → `data/pred/by_member/`

**Note**: No fraud matching is performed. All transactions are labeled as Fraud=0.

In [7]:
# Reorganize by member with Fraud=0
num_members = fr_inf.main()

STAGE 3-1: FRAUD RELABELING & REORGANIZE BY MEMBER
Input directory:  /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/clustered_out
Output directory: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/by_member
Found 1 CSV file(s)

[1/1] Processing 01-01-2025_to_01-01-2025.csv

Modified 300 member file(s) this run
Sorting modified member files by Post Date / Post Time...
  Sorted 300/300 member file(s)

STAGE 3-1 COMPLETE
Member files modified & sorted: 300
Output location: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/by_member


---

## 4. Feature Encoding

This stage encodes categorical features and prepares the final dataset for model prediction:

1. **Remove ID Columns**: Delete Account ID and Member ID
2. **Encode Categorical Features**: Convert categorical columns to numeric using predefined dictionary
   - Account Type, Action Type, Source Type, Product ID
3. **Parse Time Features**: Convert Post Time to decimal hours
4. **Convert Date Features**: Parse Post Date and Account Open Date
5. **Clean Up**: Remove text columns (Transaction Description, Fraud Adjustment Indicator)

**Data Flow**: `data/pred/by_member/` → `data/pred/final/`

In [8]:
# Encode features for prediction
total_processed = enc_inf.encode_features(
    enc_inf.PROCESSED_DIR,
    enc_inf.OUTPUT_DIR,
    enc_inf.CONFIG_PATH
)

STAGE 4: FEATURE ENCODING (INFERENCE)
Input Dir: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/by_member
Output Dir: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/final
Config Path: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/config/tokenize_dict.json

Loaded encoding dictionary with 4 features

Found 300 member file(s)
  Processed 100/300 files
  Processed 200/300 files
  Processed 300/300 files

STAGE 4 COMPLETE (INFERENCE)
Total files processed: 300/300
Output location: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/pred/final


---

## Pipeline Complete!

The complete prediction preprocessing pipeline consists of 4 stages:

1. **Data Cleaning**: Raw CSV → Cleaned CSV (`data/pred/cleaned/`)
2. **Feature Engineering (Inference)**: Cleaned CSV → Clustered CSV (`data/pred/clustered_out/`)
3. **Reorganize by Member**: Clustered CSV → Organized by Member (`data/pred/by_member/`)
4. **Feature Encoding**: Processed CSV → Final Encoded Dataset (`data/pred/final/`)

**Final Output**: `data/pred/final/member_*.csv`

These final encoded files are ready for model prediction!