# Feature Engineering — PaySim Fraud Detection

In the EDA notebook, I discovered that fraud only occurs in TRANSFER and CASH_OUT transactions, and that balance columns may reveal anomalies. Now I'll create features that capture these patterns. The goal is a small set of interpretable features that highlight suspicious behaviour without overfitting to the training data.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/PS_20174392719_1491204439457_log.csv")
print(f"Loaded {df.shape[0]:,} transactions with {df.shape[1]} columns")

Loaded 6,362,620 transactions with 11 columns


## Filtering to Fraud-Relevant Transactions

Since fraud only occurs in TRANSFER and CASH_OUT transactions (as I discovered in EDA), I'll filter the dataset to these types only. This reduces noise from irrelevant transaction types and makes the problem more tractable.

In [2]:
df_filtered = df[df["type"].isin(["TRANSFER", "CASH_OUT"])].copy()

print(f"Filtered from {df.shape[0]:,} to {df_filtered.shape[0]:,} transactions")
print(f"Fraud cases retained: {df_filtered['isFraud'].sum():,} / {df['isFraud'].sum():,}")
print(f"\nTransaction type breakdown:")
print(df_filtered["type"].value_counts())

Filtered from 6,362,620 to 2,770,409 transactions
Fraud cases retained: 8,213 / 8,213

Transaction type breakdown:
type
CASH_OUT    2237500
TRANSFER     532909
Name: count, dtype: int64


All fraud cases are retained after filtering, which confirms that filtering to these two transaction types doesn't lose any signal. The dataset is now more focused on the transactions where fraud actually happens.

## Feature Engineering

I'm creating seven features based on domain knowledge about how fraud typically works in payment systems. Each feature targets a specific suspicious pattern — balance inconsistencies, account draining, or transaction type risk.

### Feature 1: Origin Balance Error

In a legitimate transaction, the math should work out: `oldbalanceOrg - amount = newbalanceOrig`. When this equation doesn't balance, something suspicious is happening — either the data is corrupted or the transaction is part of a fraud scheme that manipulates balances.

In [3]:
df_filtered["orig_balance_error"] = (
    df_filtered["oldbalanceOrg"] - df_filtered["amount"]
) - df_filtered["newbalanceOrig"]

print("orig_balance_error statistics:")
print(df_filtered["orig_balance_error"].describe())
print(f"\nNon-zero errors: {(df_filtered['orig_balance_error'] != 0).sum():,}")

orig_balance_error statistics:


count    2.770409e+06
mean    -2.859850e+05
std      8.753230e+05
min     -9.244552e+07
25%     -2.798912e+05
50%     -1.435971e+05
75%     -5.185310e+04
max      1.000000e-02
Name: orig_balance_error, dtype: float64

Non-zero errors: 2,596,313


### Feature 2: Destination Balance Error

Same logic for the receiving end: `oldbalanceDest + amount` should equal `newbalanceDest`. If money was sent but didn't arrive (or the balance changed unexpectedly), that's a red flag.

In [4]:
df_filtered["dest_balance_error"] = (
    df_filtered["oldbalanceDest"] + df_filtered["amount"]
) - df_filtered["newbalanceDest"]

print("dest_balance_error statistics:")
print(df_filtered["dest_balance_error"].describe())

dest_balance_error statistics:
count    2.770409e+06
mean    -2.864713e+04
std      5.934794e+05
min     -7.588573e+07
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+07
Name: dest_balance_error, dtype: float64


### Feature 3: Origin Account Emptied

In my experience, legitimate customers rarely transfer their entire balance in a single transaction. Fraudsters typically drain an account completely — take everything and run. This binary flag captures that pattern.

In [5]:
df_filtered["orig_emptied"] = (df_filtered["newbalanceOrig"] == 0).astype(int)

print("orig_emptied distribution:")
print(df_filtered["orig_emptied"].value_counts())
print(f"\nFraud rate when account emptied: {df_filtered[df_filtered['orig_emptied'] == 1]['isFraud'].mean():.4%}")
print(f"Fraud rate when account not emptied: {df_filtered[df_filtered['orig_emptied'] == 0]['isFraud'].mean():.4%}")

orig_emptied distribution:
orig_emptied
1    2496656
0     273753
Name: count, dtype: int64



Fraud rate when account emptied: 0.3226%
Fraud rate when account not emptied: 0.0584%


The difference in fraud rates is striking. Transactions that empty an account have a much higher fraud rate than those that don't. This feature should be highly predictive.

### Feature 4: Amount to Balance Ratio

A transaction that represents a large fraction of the account balance is inherently riskier. If someone is transferring 95% of their funds in one go, that's more suspicious than transferring 5%. I'm adding 1 to the denominator to handle accounts with zero starting balance without causing division errors.

In [6]:
df_filtered["amount_to_balance_ratio"] = df_filtered["amount"] / (
    df_filtered["oldbalanceOrg"] + 1
)

print("amount_to_balance_ratio statistics:")
print(df_filtered["amount_to_balance_ratio"].describe())

amount_to_balance_ratio statistics:
count    2.770409e+06
mean     1.577019e+05
std      7.615423e+05
min      0.000000e+00
25%      4.993073e+00
50%      5.607207e+02
75%      1.567414e+05
max      9.244552e+07
Name: amount_to_balance_ratio, dtype: float64


### Feature 5: Destination Balance Unchanged

This is a particularly suspicious pattern — money was supposedly sent to a destination account, but the destination balance didn't change at all. This could indicate a fake receiving account or data manipulation as part of a fraud scheme.

In [7]:
df_filtered["dest_unchanged"] = (
    df_filtered["oldbalanceDest"] == df_filtered["newbalanceDest"]
).astype(int)

print("dest_unchanged distribution:")
print(df_filtered["dest_unchanged"].value_counts())
print(f"\nFraud rate when dest unchanged: {df_filtered[df_filtered['dest_unchanged'] == 1]['isFraud'].mean():.4%}")
print(f"Fraud rate when dest changed: {df_filtered[df_filtered['dest_unchanged'] == 0]['isFraud'].mean():.4%}")

dest_unchanged distribution:
dest_unchanged
0    2764617
1       5792
Name: count, dtype: int64

Fraud rate when dest unchanged: 70.5456%


Fraud rate when dest changed: 0.1493%


### Feature 6: Is Transfer

From the EDA, TRANSFER transactions have a higher fraud rate than CASH_OUT (0.77% vs 0.18%). A simple binary flag captures this difference in risk profile between the two transaction types.

In [8]:
df_filtered["is_transfer"] = (df_filtered["type"] == "TRANSFER").astype(int)

print("is_transfer distribution:")
print(df_filtered["is_transfer"].value_counts())

is_transfer distribution:
is_transfer
0    2237500
1     532909
Name: count, dtype: int64


### Feature 7: Hour of Day

Although the EDA didn't show a strong hourly pattern in fraud, I'm including hour anyway. Even weak predictive signals can contribute when combined with other features, and temporal patterns might interact with other variables in ways the models can learn.

In [9]:
df_filtered["hour"] = df_filtered["step"] % 24

print("hour distribution (sample):")
print(df_filtered["hour"].value_counts().sort_index().head(10))

hour distribution (sample):


hour
0     18214
1      6005
2      1922
3       780
4       512
5       632
6       904
7      2542
8     10302
9    120028
Name: count, dtype: int64


## Feature Validation

I'll do a quick sanity check to make sure all features were created correctly and look reasonable. No nulls should exist, and the feature distributions should align with expectations.

In [10]:
feature_cols = [
    "orig_balance_error", "dest_balance_error", "orig_emptied",
    "amount_to_balance_ratio", "dest_unchanged", "is_transfer", "hour"
]

print("Feature summary:")
print(df_filtered[feature_cols].describe())
print(f"\nNull values per feature:")
print(df_filtered[feature_cols].isnull().sum())
print(f"\nInfinite values per feature:")
print((df_filtered[feature_cols] == np.inf).sum())
print(f"\nDataset shape: {df_filtered.shape}")

Feature summary:


       orig_balance_error  dest_balance_error  orig_emptied  \
count        2.770409e+06        2.770409e+06  2.770409e+06   
mean        -2.859850e+05       -2.864713e+04  9.011868e-01   
std          8.753230e+05        5.934794e+05  2.984111e-01   
min         -9.244552e+07       -7.588573e+07  0.000000e+00   
25%         -2.798912e+05        0.000000e+00  1.000000e+00   
50%         -1.435971e+05        0.000000e+00  1.000000e+00   
75%         -5.185310e+04        0.000000e+00  1.000000e+00   
max          1.000000e-02        1.000000e+07  1.000000e+00   

       amount_to_balance_ratio  dest_unchanged   is_transfer          hour  
count             2.770409e+06    2.770409e+06  2.770409e+06  2.770409e+06  
mean              1.577019e+05    2.090666e-03  1.923575e-01  1.530878e+01  
std               7.615423e+05    4.567599e-02  3.941525e-01  4.004595e+00  
min               0.000000e+00    0.000000e+00  0.000000e+00  0.000000e+00  
25%               4.993073e+00    0.000000e+00 

orig_balance_error         0
dest_balance_error         0
orig_emptied               0
amount_to_balance_ratio    0
dest_unchanged             0
is_transfer                0
hour                       0
dtype: int64

Dataset shape: (2770409, 18)


No nulls or infinities in the engineered features. The statistics look sensible — binary features have values in {0, 1}, hour is in [0, 23], and the continuous features have reasonable ranges.

## Using the Production Function

The features above are also implemented in `src/features.py` as a single reusable function called `engineer_features()`. This is what the API will use for inference, so I need to verify it produces identical results. Having a single source of truth for feature engineering prevents training/serving skew.

In [11]:
import sys
sys.path.insert(0, "..")

from src.features import engineer_features

# Apply function to original data
df_from_function = engineer_features(df)

# Verify results match
print(f"Manual shape: {df_filtered.shape}")
print(f"Function shape: {df_from_function.shape}")

assert df_filtered.shape == df_from_function.shape, "Shape mismatch!"

# Check feature values match
print("\nFeature comparison:")
for col in feature_cols:
    match = (df_filtered[col].values == df_from_function[col].values).all()
    print(f"  {col}: {'MATCH' if match else 'MISMATCH'}")

print("\nAll features match between manual creation and function import.")

Manual shape: (2770409, 18)
Function shape: (2770409, 18)

Feature comparison:
  orig_balance_error: MATCH
  dest_balance_error: MATCH
  orig_emptied: MATCH
  amount_to_balance_ratio: MATCH
  dest_unchanged: MATCH
  is_transfer: MATCH
  hour: MATCH

All features match between manual creation and function import.


The `engineer_features()` function produces identical results to the manual feature creation above. This is critical — the modelling notebook and the production API will both import this function, ensuring consistency between training and inference.

## Summary

I've created seven features targeting different fraud signals:

- **Balance anomalies**: `orig_balance_error`, `dest_balance_error`, `dest_unchanged`
- **Account behaviour**: `orig_emptied`, `amount_to_balance_ratio`
- **Transaction metadata**: `is_transfer`, `hour`

The `engineer_features()` function in `src/features.py` encapsulates this logic for use in both training and the production API. In the next notebook, I'll use these features to train and evaluate fraud detection models.