# Pipeline for Data Preprocessing

## 0. Import and Configuration

In [1]:
import sys
import os
import pandas as pd
import importlib.util

# Import data cleaning module
sys.path.insert(0, os.path.abspath('./01_data_cleaning'))
spec = importlib.util.spec_from_file_location(
    "data_cleaning",
    "./01_data_cleaning/01_data_cleaning.py"
)
dc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(dc)

# Import feature engineering module
spec = importlib.util.spec_from_file_location(
    "feature_engineering",
    "02_feature_engineering/02b_description_encoding/test_run_pipeline.py"
)
fe = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fe)

# Import fraud relabeling module
spec = importlib.util.spec_from_file_location(
    "fraud_relabeling",
    "03_fraud_relabeling/03_fraud_relabeling.py"
)
fr = importlib.util.module_from_spec(spec)
spec.loader.exec_module(fr)

# Import encoding module
spec = importlib.util.spec_from_file_location(
    "encoding",
    "04_encoding/04_encoding.py"
)
enc = importlib.util.module_from_spec(spec)
spec.loader.exec_module(enc)

# Configure data cleaning
dc.ENABLE_RENAMING = True
dc.RAW_DIR = '../../data/pred/raw'
dc.CLEANED_DIR = '../../data/pred/cleaned'

# Configure feature engineering
fe.PROCESSED_DIR = '../../data/pred/cleaned'  # Read from cleaned data
fe.MODEL_NAME = 'prajjwal1/bert-tiny'
fe.TEXT_COLUMN = 'Transaction Description'
fe.BATCH_SIZE = 64
fe.MAX_LENGTH = 64
fe.PCA_DIM = 20
fe.MIN_K = 10
fe.MAX_K = 60
fe.K_STEP = 10
fe.SAMPLE_SIZE = 10000
fe.CLUSTER_BATCH_SIZE = 4096
fe.RANDOM_STATE = 42

# Configure fraud relabeling
fr.INPUT_DIR = '../../data/pred/clustered_out'  # Read from feature engineering output
fr.OUTPUT_MEMBER_DIR = '../../data/pred/by_member'
fr.OUTPUT_PROCESSED_DIR = '../../data/pred/processed'
fr.CHUNKSIZE = 50000

# Configure encoding
enc.PROCESSED_DIR = '../../data/pred/processed'  # Read from fraud relabeling output
enc.OUTPUT_DIR = '../../data/pred/final'
enc.CONFIG_PATH = '../../config/tokenize_dict.json'

# Create directories if they don't exist
directories = [
    dc.RAW_DIR,
    dc.CLEANED_DIR,
    fr.INPUT_DIR,
    fr.OUTPUT_MEMBER_DIR,
    fr.OUTPUT_PROCESSED_DIR,
    enc.OUTPUT_DIR
]

for directory in directories:
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Created directory: {directory}")

Created directory: ../../data/pred/raw
Created directory: ../../data/pred/cleaned
Created directory: ../../data/pred/clustered_out
Created directory: ../../data/pred/by_member
Created directory: ../../data/pred/processed
Created directory: ../../data/pred/final


## 1. Data Cleaning

This cell performs the following preprocessing tasks:
1. Standardize headers (e.g., "AccountID" → "Account ID")
2. Fix comma issues (remove extra commas in field values)
3. Clean Amount field (remove $ and commas, convert to numeric)
4. Fill missing values (Amount→0, others→"Unknown", "null"→empty)
5. Rename files based on date range (MM-DD-YYYY-MM-DD-YYYY.csv)

**Input**: Raw CSV files from `../../data/raw/`  
**Output**: Cleaned and renamed CSV files saved to `../../data/cleaned/`

In [2]:
dc.main()

Raw: ../../data/raw
Cleaned: ../../data/cleaned

Found 1 CSV files in ../../data/raw

CSV Files List ⬇️
  1. Transaction Data 09232025 (5).csv (63.05 MB)

Processing Files...

[1/1] Transaction Data 09232025 (5).csv... Missing:28077, →09-01-2024_to_09-01-2025.csv

Processing Complete!


### Cleaned File Analysis

#### Size & Rows & Date Span

#### Overlapping detection

## 2. Feature Engineering: Description Encoding and Clustering

This stage performs advanced feature engineering on transaction descriptions:

1. **BERT Encoding**: Use BERT-tiny model to encode "Transaction Description" text into embeddings
2. **Dimensionality Reduction**: Apply PCA to reduce embedding dimensions (default: 20D)
3. **Automatic Clustering**: Find optimal cluster count (k) via heuristic search and cluster with MiniBatchKMeans
4. **Add Cluster ID**: Append `cluster_id` column to each CSV file

**Input**: Cleaned CSV files from `../../data/cleaned/`  
**Output**: Clustered files saved to `../../data/clustered_out/`

**Note**: This step requires GPU/CPU compute and may take significant time depending on data size.

In [4]:
outputs = fe.main()

STAGE 2: DESCRIPTION ENCODING AND CLUSTERING
Input: ../../data/cleaned
Model: prajjwal1/bert-tiny
Text Column: Transaction Description
PCA Dimensions: 20
Cluster Range: 10-60 (step 10)

[Scan] Found 1 CSV file(s) in ../../data/cleaned


                                    

[Done] Saved 1 clustered file(s) to /Users/wwy/Documents/CMU/25-Fall/Practicum/Clearshield/data/clustered_out

STAGE 2 COMPLETE
Processed 1 files
Output location: ../../data/cleaned


## 3. Fraud Matching and Re-label
This cell performs fraud detection in two stages:

1. **Reorganize by Member**: Group all transactions by Member ID into individual files
2. **Match Fraud Adjustments**: Find and mark original fraudulent transactions for each refund record （>10）
   - Match by amount and date (extract from description or 30-day range)
   - Prevent duplicate matching
   - Categorize as matched/unmatched/no_fraud

**Input**: Clustered files from `../../data/clustered_out/`  
**Output**: Processed member files saved to `../../data/processed/[matched|unmatched|no_fraud]/`

### Stage 3-1: Reorganize transactions by member

**Input**: Clustered files from `../../data/clustered_out/`  
**Output**: Member-grouped files saved to `../../data/by_member/`

In [29]:
num_members = fr.run_stage1()

STAGE 1: DATA REORGANIZATION
Input: ../../data/clustered_out
Output: ../../data/by_member

Found 4 files
Processing 1/4: 01-01-2025_to_03-31-2025.csv
Processing 2/4: 04-01-2025_to_06-30-2025.csv
Processing 3/4: 07-01-2025_to_09-01-2025.csv
Processing 4/4: 09-01-2024_to_12-31-2024.csv
Created 29656 member files
Sorting files...
  Sorted 1000/29656 files
  Sorted 2000/29656 files
  Sorted 3000/29656 files
  Sorted 4000/29656 files
  Sorted 5000/29656 files
  Sorted 6000/29656 files
  Sorted 7000/29656 files
  Sorted 8000/29656 files
  Sorted 9000/29656 files
  Sorted 10000/29656 files
  Sorted 11000/29656 files
  Sorted 12000/29656 files
  Sorted 13000/29656 files
  Sorted 14000/29656 files
  Sorted 15000/29656 files
  Sorted 16000/29656 files
  Sorted 17000/29656 files
  Sorted 18000/29656 files
  Sorted 19000/29656 files
  Sorted 20000/29656 files
  Sorted 21000/29656 files
  Sorted 22000/29656 files
  Sorted 23000/29656 files
  Sorted 24000/29656 files
  Sorted 25000/29656 files
  Sor

## 4. Feature Encoding

This stage encodes categorical features and prepares the final dataset for model training:

1. **Remove ID Columns**: Delete Account ID and Member ID
2. **Encode Categorical Features**: Convert categorical columns to numeric using predefined dictionary
   - Account Type, Action Type, Source Type, Product ID
3. **Parse Time Features**: Convert Post Time to decimal hours
4. **Convert Date Features**: Parse Post Date and Account Open Date
5. **Clean Up**: Remove text columns (Transaction Description, Fraud Adjustment Indicator)

**Input**: Processed member files from `../../data/processed/[matched|unmatched|no_fraud]/`  
**Output**: Final encoded files saved to `../../data/final/[matched|unmatched|no_fraud]/`

In [34]:
total_processed = enc.encode_features(enc.PROCESSED_DIR, enc.OUTPUT_DIR, enc.CONFIG_PATH)

FEATURE ENCODING
Input Dir: ../../data/processed
Output Dir: ../../data/final
Config Path: ../../config/tokenize_dict.json

Loaded encoding dictionary with 4 features

matched: Found 544 files
  Processed 100/544 files
  Processed 200/544 files
  Processed 300/544 files
  Processed 400/544 files
  Processed 500/544 files
  matched: Encoded 544/544 files

unmatched: Found 78 files
  unmatched: Encoded 78/78 files

no_fraud: Found 22673 files
  Processed 100/22673 files
  Processed 200/22673 files
  Processed 300/22673 files
  Processed 400/22673 files
  Processed 500/22673 files
  Processed 600/22673 files
  Processed 700/22673 files
  Processed 800/22673 files
  Processed 900/22673 files
  Processed 1000/22673 files
  Processed 1100/22673 files
  Processed 1200/22673 files
  Processed 1300/22673 files
  Processed 1400/22673 files
  Processed 1500/22673 files
  Processed 1600/22673 files
  Processed 1700/22673 files
  Processed 1800/22673 files
  Processed 1900/22673 files
  Processed 2

TODO : vulnerability scanner

---

## Pipeline Complete!

The complete data preprocessing pipeline consists of 4 stages:

1. **Data Cleaning**: Raw CSV → Cleaned CSV (`data/cleaned/`)
2. **Feature Engineering**: Cleaned CSV → Clustered CSV (`data/clustered_out/`)
3. **Fraud Matching**: Clustered CSV → Categorized by Member (`data/processed/`)
4. **Feature Encoding**: Processed CSV → Final Encoded Dataset (`data/final/`)

**Final Output**: `../../data/final/[matched|unmatched|no_fraud]/member_*.csv`

These final encoded files are ready for model training!