# Phase 2 — Feature Engineering
**Inputs:** `X_train.csv`, `y_train.csv` (from Phase 1)  
**Outputs:** `X_train_fe.csv`, `y_train_fe.csv`  
**Constraints:** No SMOTE · No modeling · No future data · No test-set touch

---

## Leakage Prevention Strategy

Every feature in this notebook is computed using **only past information** at the time of each transaction:

| Risk | Mitigation |
|---|---|
| Using future transactions in rolling windows | `.shift(1)` before `.rolling()` so window sits entirely in the past |
| Amount normalisation using test statistics | `mean` and `std` computed on **train only**, stored as constants |
| Fraud-rate features leaking current label | `shift(1)` applied to `Class` before rolling, excluding the current row |
| NaN fill introducing future info | Rows with NaN from cold-start dropped entirely (no forward/backward fill) |

## Load Data

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

X_train = pd.read_csv("data/processed/X_train.csv")
y_train = pd.read_csv("data/processed/y_train.csv")

train = X_train.copy()
train["Class"] = y_train["Class"].values

print(f"X_train: {X_train.shape}  |  y_train: {y_train.shape}")
print(f"Fraud rate : {train['Class'].mean() * 100:.4f}%")
print(f"Time range : {train['Time'].min():.0f}s → {train['Time'].max():.0f}s")


Loaded  X_train: (227845, 30)  |  y_train: (227845, 1)
Columns: ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
Time range: 0s → 145247s
Fraud rate: 0.1830%


## Step 1 — Basic Transformations

**`log_amount`** compresses the heavy right-skew of transaction amounts.  
**`amount_zscore_global`** standardises amount using *train-only* statistics — the stored `mean_train` / `std_train` will be reused later when transforming the test set, ensuring no test information bleeds into training.

In [None]:
train = train.sort_values("Time").reset_index(drop=True)

train["log_amount"] = np.log1p(train["Amount"])

# compute on train only — will be reused when transforming the test set
mean_train = train["Amount"].mean()
std_train  = train["Amount"].std(ddof=0)
train["amount_zscore_global"] = (train["Amount"] - mean_train) / std_train

pd.DataFrame([{"mean_amount": mean_train, "std_amount": std_train}]).to_csv(
    "data/processed/train_scaling_params.csv", index=False
)

print(f"mean={mean_train:.4f}  std={std_train:.4f}")
print(f"log_amount range    : {train['log_amount'].min():.3f} → {train['log_amount'].max():.3f}")
print(f"zscore range        : {train['amount_zscore_global'].min():.3f} → {train['amount_zscore_global'].max():.3f}")


Amount  →  mean=90.8249  std=250.5032
log_amount       : min=0.000  max=9.886
amount_zscore_global: min=-0.363  max=78.106

Scaling params saved to data/processed/train_scaling_params.csv


## Step 2 — Rolling Behavioral Features (Time-Safe)

Because the dataset has no individual card/user ID, all transactions are treated as a single chronological stream ordered by `Time`.

**Leakage prevention:** `.shift(1)` is applied *before* `.rolling()`. This means the rolling window for row *i* covers rows *0 … i-1* — the current transaction is never included in its own feature computation.

| Feature | Window | Meaning |
|---|---|---|
| `rolling_mean_amount` | 100 txns | Typical recent transaction size |
| `rolling_std_amount` | 100 txns | Recent volatility in amounts |
| `time_diff` | 1 txn | Gap (seconds) since last transaction |

In [None]:
WINDOW_AMOUNT = 100

# shift(1) so row i only sees rows 0…i-1 — no current-row leakage
amount_shifted = train["Amount"].shift(1)

train["rolling_mean_amount"] = amount_shifted.rolling(WINDOW_AMOUNT, min_periods=1).mean()
train["rolling_std_amount"]  = amount_shifted.rolling(WINDOW_AMOUNT, min_periods=2).std(ddof=1)
train["time_diff"]           = train["Time"] - train["Time"].shift(1)

print(f"rolling_mean_amount : {train['rolling_mean_amount'].notna().sum():,} non-null")
print(f"rolling_std_amount  : {train['rolling_std_amount'].notna().sum():,} non-null")
print(f"time_diff           : {train['time_diff'].notna().sum():,} non-null")
print()
print(train[["Time", "Amount", "rolling_mean_amount", "rolling_std_amount", "time_diff"]].iloc[98:104])


Rolling behavioral features added:
  rolling_mean_amount  — 227,844 non-null
  rolling_std_amount   — 227,843 non-null
  time_diff            — 227,844 non-null

Sample (rows 98–103):
     Time  Amount  rolling_mean_amount  rolling_std_amount  time_diff
98   67.0   28.28            65.635306          186.355543        0.0
99   68.0   15.99            65.257980          185.440320        1.0
100  68.0    2.69            64.765300          184.567145        0.0
101  68.0   22.36            63.296000          184.469625        0.0
102  69.0    9.47            63.492700          184.414827        1.0
103  69.0    0.76            59.800800          181.717369        0.0


## Step 3 — Amount Deviation Features

These capture how unusual the *current* transaction amount is relative to the recent stream.

- **`amount_deviation`** — raw distance from the rolling mean.  
- **`amount_zscore_rolling`** — normalised deviation; `np.where` guards against division by zero when `rolling_std_amount` is 0 or NaN.

In [None]:
train["amount_deviation"] = train["Amount"] - train["rolling_mean_amount"]

# guard against zero or NaN std (will be NaN → dropped later)
std_safe = train["rolling_std_amount"].replace(0, np.nan)
train["amount_zscore_rolling"] = train["amount_deviation"] / std_safe

print(train[["amount_deviation", "amount_zscore_rolling"]].describe().round(4))


Amount deviation features added:
  amount_deviation      — 227,844 non-null
  amount_zscore_rolling — 227,843 non-null

Descriptive stats:
       amount_deviation  amount_zscore_rolling
count       227844.0000            227843.0000
mean            -0.0035                 0.0518
std            251.0174                 1.5976
min           -295.4574                -2.4175
25%            -83.2858                -0.4253
50%            -53.6495                -0.2963
75%             -5.6945                -0.0335
max          19557.1228               155.6727


## Step 4 — Fraud Momentum Features

These features simulate **fraud bursts** — periods of elevated fraud activity.

**Critical leakage point:** The `Class` label of row *i* must not contribute to its own feature. `.shift(1)` moves the label series back by one position, so the rolling window for row *i* covers labels of rows *0 … i-1* only.

| Feature | Window | Meaning |
|---|---|---|
| `rolling_fraud_rate_500` | 500 txns | Share of recent txns flagged as fraud |
| `rolling_fraud_count_500` | 500 txns | Absolute count of recent fraud txns |

In [None]:
WINDOW_FRAUD = 500

# shift(1): current row's label must not feed into its own fraud-rate feature
class_shifted = train["Class"].shift(1)

train["rolling_fraud_count_500"] = class_shifted.rolling(WINDOW_FRAUD, min_periods=1).sum()
train["rolling_fraud_rate_500"]  = class_shifted.rolling(WINDOW_FRAUD, min_periods=1).mean()

print(f"Global fraud rate  : {train['Class'].mean():.6f}")
print(f"Rolling mean rate  : {train['rolling_fraud_rate_500'].mean():.6f}")
print()
print(train["rolling_fraud_rate_500"].describe().round(6))


Fraud momentum features (window=500) added:
  rolling_fraud_count_500 — 227,844 non-null
  rolling_fraud_rate_500  — 227,844 non-null

Fraud rate feature stats:
count    227844.000000
mean          0.001830
std           0.003992
min           0.000000
25%           0.000000
50%           0.000000
75%           0.002000
max           0.056000
Name: rolling_fraud_rate_500, dtype: float64

Global fraud rate in train         : 0.001830
Mean of rolling_fraud_rate_500     : 0.001830


## Step 5 — Clean Dataset & Save

Rolling windows produce NaN values at the start of the series (cold-start). These rows are dropped **entirely** — no forward/backward fill, no imputation with future values.

The `Class` column is separated into `y_train_fe` and the engineered feature matrix is saved as `X_train_fe`.

In [None]:
import os

ENGINEERED_COLS = [
    "log_amount", "amount_zscore_global",
    "rolling_mean_amount", "rolling_std_amount", "time_diff",
    "amount_deviation", "amount_zscore_rolling",
    "rolling_fraud_count_500", "rolling_fraud_rate_500",
]

all_feature_cols = list(X_train.columns) + ENGINEERED_COLS
feature_df = train[all_feature_cols + ["Class"]].copy()

rows_before = len(feature_df)
feature_df.dropna(inplace=True)
print(f"Dropped {rows_before - len(feature_df):,} NaN rows (rolling cold-start)")
print(f"Remaining : {len(feature_df):,}  |  fraud rate: {feature_df['Class'].mean() * 100:.4f}%")

X_train_fe = feature_df[all_feature_cols].reset_index(drop=True)
y_train_fe = feature_df[["Class"]].reset_index(drop=True)

os.makedirs("data/processed", exist_ok=True)
X_train_fe.to_csv("data/processed/X_train_fe.csv", index=False)
y_train_fe.to_csv("data/processed/y_train_fe.csv", index=False)

print(f"X_train_fe : {X_train_fe.shape}  →  data/processed/X_train_fe.csv")
print(f"y_train_fe : {y_train_fe.shape}  →  data/processed/y_train_fe.csv")


Rows before NaN drop : 227,845
Rows dropped (NaN)   : 2
Rows after  NaN drop : 227,843
Fraud rate after drop: 0.1830%

Saved:
  X_train_fe.csv       → 227,843 rows × 39 col(s)
  y_train_fe.csv       → 227,843 rows × 1 col(s)

Engineered feature columns (9):
  log_amount
  amount_zscore_global
  rolling_mean_amount
  rolling_std_amount
  time_diff
  amount_deviation
  amount_zscore_rolling
  rolling_fraud_count_500
  rolling_fraud_rate_500


## Feature Summary & Leakage Audit

In [None]:
# quick leakage / sanity check — all engineered cols should be NaN-free after the drop
null_counts = X_train_fe[ENGINEERED_COLS].isnull().sum()
assert null_counts.sum() == 0, f"Unexpected NaNs:\n{null_counts[null_counts > 0]}"

print(f"Features  : {X_train_fe.shape[1]}  ({len(ENGINEERED_COLS)} engineered + {X_train_fe.shape[1] - len(ENGINEERED_COLS)} original)")
print(f"Rows      : {X_train_fe.shape[0]:,}")
print(f"Fraud rate: {y_train_fe['Class'].mean() * 100:.4f}%")
print()
for col in ENGINEERED_COLS:
    print(f"  {col}")


FEATURE ENGINEERING SUMMARY
  log_amount                      NaN%=0.00   log(1 + Amount)                   — no future info
  amount_zscore_global            NaN%=0.00   (Amount − train_mean) / train_std — train stats only
  rolling_mean_amount             NaN%=0.00   shift(1) + rolling(100).mean()    — past only
  rolling_std_amount              NaN%=0.00   shift(1) + rolling(100).std()     — past only
  time_diff                       NaN%=0.00   Time − Time.shift(1)              — previous row only
  amount_deviation                NaN%=0.00   Amount − rolling_mean_amount      — past only
  amount_zscore_rolling           NaN%=0.00   amount_deviation / rolling_std    — past only
  rolling_fraud_count_500         NaN%=0.00   shift(1) + rolling(500).sum()     — past labels only
  rolling_fraud_rate_500          NaN%=0.00   shift(1) + rolling(500).mean()    — past labels only

Final X_train_fe shape : (227843, 39)
Final y_train_fe shape : (227843, 1)
Fraud rate             : 0.1830%

