# Training Pipeline for Data Preprocessing

Complete 4-stage data preprocessing pipeline for model training.

**Pipeline Stages:**
1. Data Cleaning: Raw CSV → Cleaned CSV
2. Feature Engineering: Cleaned CSV → Clustered CSV (with BERT encoding)
3. Fraud Matching: Clustered CSV → Categorized by Member
4. Feature Encoding: Processed CSV → Final Encoded Dataset

## 0. Import and Configuration

**New Approach**: All paths and parameters are configured in `config/pipeline_config.py`
- No more scattered path definitions
- Easy to switch between train/pred modes
- Centralized parameter management

In [1]:
import sys
import os
import pandas as pd
import importlib.util
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path.cwd().parents[1]
sys.path.insert(0, str(PROJECT_ROOT))

# Import centralized configuration
from config.pipeline_config import get_train_config

# Get training configuration
config = get_train_config()
config.print_config()

# Create all necessary directories
print("\nCreating directories...")
config.create_directories()

print("\n✓ Configuration loaded successfully!")

ClearShield Pipeline Configuration - Mode: TRAIN

Project Root: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield

Data Paths:
  raw                 : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/raw
  cleaned             : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/cleaned
  clustered           : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/clustered_out
  by_member           : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member
  final               : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/final
  by_member_temp      : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/temp
  by_member_matched   : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/matched
  by_member_unmatched : /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/unmatched
  by_member_no_fraud  : /Users/wwy/Documents/CMU/25-F

### Load Processing Modules

In [2]:
# Import data cleaning module
sys.path.insert(0, os.path.abspath('./01_data_cleaning'))
spec = importlib.util.spec_from_file_location(
    "data_cleaning",
    "./01_data_cleaning/01_data_cleaning.py"
)
dc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(dc)

# Import feature engineering module
spec = importlib.util.spec_from_file_location(
    "feature_engineering",
    "02_feature_engineering/02_feature_engineering.py"
)
fe = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fe)

# Import fraud relabeling module
spec = importlib.util.spec_from_file_location(
    "fraud_relabeling",
    "03_fraud_relabeling/03_fraud_relabeling.py"
)
fr = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fr)

# Import encoding module
spec = importlib.util.spec_from_file_location(
    "encoding",
    "04_encoding/04_encoding.py"
)
enc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(enc)

print("✓ All modules loaded successfully!")

✓ All modules loaded successfully!


### Configure Modules Using Centralized Config

In [3]:
# Configure data cleaning
dc.ENABLE_RENAMING = True
dc.RAW_DIR = str(config.get_path('raw'))
dc.CLEANED_DIR = str(config.get_path('cleaned'))

# Configure feature engineering
fe.PROCESSED_DIR = str(config.get_path('cleaned'))
fe.MODEL_NAME = config.feature_engineering['model_name']
fe.TEXT_COLUMN = config.feature_engineering['text_column']
fe.BATCH_SIZE = config.feature_engineering['batch_size']
fe.MAX_LENGTH = config.feature_engineering['max_length']
fe.PCA_DIM = config.feature_engineering['pca_dim']
fe.MIN_K = config.feature_engineering['min_k']
fe.MAX_K = config.feature_engineering['max_k']
fe.K_STEP = config.feature_engineering['k_step']
fe.SAMPLE_SIZE = config.feature_engineering['sample_size']
fe.CLUSTER_BATCH_SIZE = config.feature_engineering['cluster_batch_size']
fe.RANDOM_STATE = config.feature_engineering['random_state']

# Configure fraud relabeling
fr.INPUT_DIR = str(config.get_path('clustered'))
fr.OUTPUT_MEMBER_DIR = str(config.get_path('by_member_temp'))
fr.OUTPUT_PROCESSED_DIR = str(config.get_path('by_member'))
fr.CHUNKSIZE = config.fraud_matching['chunksize']
fr.MIN_HISTORY_LENGTH = config.fraud_matching['min_history_length']

# Configure encoding
enc.PROCESSED_DIR = str(config.get_path('by_member'))
enc.OUTPUT_DIR = str(config.get_path('final'))
enc.CONFIG_PATH = str(config.get_path('tokenize_config'))

print("✓ All modules configured successfully!")
print(f"\nPipeline will process:")
print(f"  Raw data:     {dc.RAW_DIR}")
print(f"  Final output: {enc.OUTPUT_DIR}")

✓ All modules configured successfully!

Pipeline will process:
  Raw data:     /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/raw
  Final output: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/final


---

## 1. Data Cleaning

This cell performs the following preprocessing tasks:
1. Standardize headers (e.g., "AccountID" → "Account ID")
2. Fix comma issues (remove extra commas in field values)
3. Clean Amount field (remove $ and commas, convert to numeric)
4. Fill missing values (Amount→0, others→"Unknown", "null"→empty)
5. Rename files based on date range (MM-DD-YYYY_to_MM-DD-YYYY.csv)

**Data Flow**: `data/train/raw/` → `data/train/cleaned/`

In [6]:
dc.main()

Raw: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/raw
Cleaned: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/cleaned

Found 1 CSV files in /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/raw

CSV Files List ⬇️
  1. test.csv (0.06 MB)

Processing Files...

[1/1] test.csv... Missing:31, →01-01-2025_to_01-01-2025.csv

Processing Complete!


---

## 2. Feature Engineering: Description Encoding and Clustering

This stage performs advanced feature engineering on transaction descriptions:

1. **BERT Encoding**: Use BERT-tiny model to encode "Transaction Description" text into embeddings
2. **Dimensionality Reduction**: Apply PCA to reduce embedding dimensions (default: 20D)
3. **Automatic Clustering**: Find optimal cluster count (k) via heuristic search and cluster with MiniBatchKMeans
4. **Add Cluster ID**: Append `cluster_id` column to each CSV file

**Data Flow**: `data/train/cleaned/` → `data/train/clustered_out/`

**Note**: This step requires GPU/CPU compute and may take significant time depending on data size.

In [7]:
outputs = fe.main()

STAGE 2: DESCRIPTION ENCODING AND CLUSTERING
Input: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/cleaned
Output: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/clustered_out
Model: prajjwal1/bert-tiny
Text Column: Transaction Description
PCA Dimensions: 20
Cluster Range: 10-60 (step 10)

[Scan] Found 1 CSV file(s) in /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/cleaned


                            

[Done] Saved 1 clustered file(s) to /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/clustered_out

STAGE 2 COMPLETE
Processed 1 files
Output location: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/clustered_out




---

## 3. Fraud Matching and Re-labeling

This cell performs fraud detection in two stages:

1. **Reorganize by Member**: Group all transactions by Member ID into individual files (temp directory)
2. **Match Fraud Adjustments**: Find and mark original fraudulent transactions for each refund record (≥10 transactions)
   - Match by amount and date (extract from description or 30-day range)
   - Prevent duplicate matching
   - Categorize as matched/unmatched/no_fraud
   - Automatically delete temp directory after processing

**Data Flow**: `data/train/clustered_out/` → `data/train/by_member/[matched|unmatched|no_fraud]/`

### Stage 3-1: Reorganize transactions by member

**Data Flow**: `data/train/clustered_out/` → `data/train/by_member/temp/` (temporary)

In [8]:
num_members = fr.run_stage1()

STAGE 1: DATA REORGANIZATION
Input: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/clustered_out
Output: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/temp

Found 1 files
Processing 1/1: 01-01-2025_to_01-01-2025.csv
Modified 300 member files this run
Sorting modified files...

300 member files created



### Check member transaction distribution

In [None]:
from glob import glob

# Get member files and count transactions
member_files = glob(os.path.join(fr.OUTPUT_MEMBER_DIR, 'member_*.csv'))
counts = [len(pd.read_csv(f)) for f in member_files]

# Calculate statistics
threshold = fr.MIN_HISTORY_LENGTH
total_count = len(counts)
above_n = sum(1 for c in counts if c >= threshold)
below_n = total_count - above_n
above_ratio = (above_n / total_count) * 100
below_ratio = (below_n / total_count) * 100

# Print results
print(f"Threshold set to: {threshold}")
print(f"Records >= {threshold}: {above_n:,} ({above_ratio:.2f}%)")
print(f"Records < {threshold}: {below_n:,} ({below_ratio:.2f}%)")

### Stage 3-2: Fraud detection and matching

Filter members with minimum history length (≥10 transactions), then match fraud adjustments to original transactions.

**Data Flow**: `data/train/by_member/temp/` → `data/train/by_member/[matched|unmatched|no_fraud]/`

**Note**: The temp directory will be automatically deleted after processing completes.

In [9]:
stats = fr.run_stage2(fr.MIN_HISTORY_LENGTH)

STAGE 2: FRAUD DETECTION
Input: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/temp
Output: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member
Min History Length: 10

Found 300 member files
Filtering: only processing members with >= 10 transactions
Summary saved to: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/member_summary.csv

Cleaned up temporary directory: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member/temp

Processing Summary:
  Total Processed: 1
  Skipped (< 10 txns): 299
  No Fraud: 1
  Matched: 0
  Unmatched: 0

COMPLETE


---

## 4. Feature Encoding

This stage encodes categorical features and prepares the final dataset for model training:

1. **Remove ID Columns**: Delete Account ID and Member ID
2. **Encode Categorical Features**: Convert categorical columns to numeric using predefined dictionary
   - Account Type, Action Type, Source Type, Product ID
3. **Parse Time Features**: Convert Post Time to decimal hours
4. **Convert Date Features**: Parse Post Date and Account Open Date
5. **Clean Up**: Remove text columns (Transaction Description, Fraud Adjustment Indicator)

**Data Flow**: `data/train/by_member/[matched|unmatched|no_fraud]/` → `data/train/final/[matched|unmatched|no_fraud]/`

In [10]:
total_processed = enc.encode_features(enc.PROCESSED_DIR, enc.OUTPUT_DIR, enc.CONFIG_PATH)

FEATURE ENCODING
Input Dir: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/by_member
Output Dir: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/train/final
Config Path: /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/config/tokenize_dict.json

Loaded encoding dictionary with 4 features

matched: No member files found

unmatched: No member files found

no_fraud: Found 1 files
  no_fraud: Encoded 1/1 files

Encoding Complete!
Total files found: 1
Total files processed: 1


---

## Pipeline Complete!

The complete data preprocessing pipeline consists of 4 stages:

1. **Data Cleaning**: Raw CSV → Cleaned CSV (`data/train/cleaned/`)
2. **Feature Engineering**: Cleaned CSV → Clustered CSV (`data/train/clustered_out/`)
3. **Fraud Matching**: Clustered CSV → Categorized by Member (`data/train/by_member/[matched|unmatched|no_fraud]/`)
4. **Feature Encoding**: Processed CSV → Final Encoded Dataset (`data/train/final/[matched|unmatched|no_fraud]/`)

**Final Output**: `data/train/final/[matched|unmatched|no_fraud]/member_*.csv`

These final encoded files are ready for model training!