### Transaction Anomaly Detection for Financial Controls

Objective: Detect unusual or potentially fraudulent transactions to help finance teams identify financial risk, prioritize investigations, and strengthen internal controls.

### Step1: Data Preparation

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv("Synthetic_Financial_datasets_log.csv")

In [3]:
df.shape

(6362620, 11)

In [4]:
df.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [8]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [9]:
df['isFraud'].value_counts()

isFraud
0    6354407
1       8213
Name: count, dtype: int64

In [10]:
df['isFraud'].value_counts(normalize=True)

isFraud
0    0.998709
1    0.001291
Name: proportion, dtype: float64

In [11]:
df['type'].value_counts()

type
CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: count, dtype: int64

In [12]:
df.groupby('type')['isFraud'].mean().sort_values(ascending=False)

type
TRANSFER    0.007688
CASH_OUT    0.001840
CASH_IN     0.000000
DEBIT       0.000000
PAYMENT     0.000000
Name: isFraud, dtype: float64

### STEP 2: FEATURE ENGINEEERING

In this step, we engineer finance-oriented features designed to identify
unusual or potentially fraudulent transaction behavior.

The focus is on creating interpretable signals commonly used in
financial controls, audit, and fraud monitoring systems.

Based on exploratory analysis, fraudulent activity is concentrated
in `TRANSFER` and `CASH_OUT` transaction types.

To reduce noise and focus on meaningful risk patterns, the analysis
is restricted to these transaction types only.

In [13]:
df_risk = df[df['type'].isin(['TRANSFER', 'CASH_OUT'])].copy()
df_risk.shape

(2770409, 11)

Legitimate financial transactions follow basic accounting logic:
- The origin account balance decreases by the transaction amount
- The destination account balance increases by the transaction amount

Large deviations from this logic indicate inconsistencies and may signal
fraudulent activity or system manipulation.

In [14]:
df_risk['orig_balance_diff'] = (
    df_risk['oldbalanceOrg'] - df_risk['amount'] - df_risk['newbalanceOrig']
)

In [15]:
df_risk['dest_balance_diff'] = (
    df_risk['oldbalanceDest'] + df_risk['amount'] - df_risk['newbalanceDest']
)

This feature measures how large a transaction is relative to the
origin account’s available balance.

Transactions that consume a large proportion of an account balance
are considered high-risk and are commonly observed in fraud scenarios.

In [16]:
df_risk['amount_to_balance_ratio'] = (
    df_risk['amount'] / (df_risk['oldbalanceOrg'] + 1)
)

Fraudulent behavior often involves repeated transactions originating
from the same account or directed toward the same destination.

Transaction frequency features help identify abnormal usage patterns
and potential mule or intermediary accounts.

In [17]:
df_risk['orig_txn_count'] = (
    df_risk.groupby('nameOrig')['amount'].transform('count')
)

In [18]:
df_risk['dest_txn_count'] = (
    df_risk.groupby('nameDest')['amount'].transform('count')
)

The engineered features capture complementary dimensions of transaction risk:
- Balance inconsistencies
- Unusually large relative transaction amounts
- Abnormal transaction frequency patterns

These signals will be used in the next step to detect anomalous
transactions using unsupervised learning techniques.

In [20]:
df_risk[['orig_balance_diff','dest_balance_diff','amount_to_balance_ratio']].describe()

Unnamed: 0,orig_balance_diff,dest_balance_diff,amount_to_balance_ratio
count,2770409.0,2770409.0,2770409.0
mean,-285985.0,-28647.13,157701.9
std,875323.0,593479.4,761542.3
min,-92445520.0,-75885730.0,0.0
25%,-279891.2,0.0,4.993073
50%,-143597.1,0.0,560.7207
75%,-51853.1,0.0,156741.4
max,0.01,10000000.0,92445520.0


In [19]:
df_risk['isFraud'].mean()

np.float64(0.002964544224336551)

### STEP 3: ANOMALY DETECTION

In this step, we apply an unsupervised anomaly detection model to identify
unusual financial transactions.

The objective is not to predict fraud labels directly, but to assign
risk scores that help finance teams prioritize investigations and
monitor abnormal transaction behavior.

In [21]:
features = [
    'amount',
    'orig_balance_diff',
    'dest_balance_diff',
    'amount_to_balance_ratio',
    'orig_txn_count',
    'dest_txn_count'
]

X = df_risk[features].copy()
X.shape

(2770409, 6)

The selected features capture complementary dimensions of transaction risk:
- Transaction magnitude
- Balance inconsistencies
- Relative transaction size
- Repetition and behavioral patterns

These signals are commonly used in financial controls and fraud monitoring.

In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Features are standardized to ensure that no single variable dominates
the anomaly detection process due to scale differences.

In [None]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.003,  
    random_state=42,
    n_jobs=1
)

iso_forest.fit(X_scaled)


Isolation Forest is an unsupervised algorithm well suited for fraud
and anomaly detection, especially when labeled fraud cases are rare.

The model isolates observations that differ significantly from normal
transaction behavior.

In [24]:
df_risk['anomaly_score'] = -iso_forest.decision_function(X_scaled)
df_risk['is_anomalous'] = iso_forest.predict(X_scaled)

# Convert output: -1 = anomaly, 1 = normal
df_risk['is_anomalous'] = df_risk['is_anomalous'].map({1: 0, -1: 1})

Each transaction receives:
- An anomaly score (higher = more unusual)
- A binary anomaly flag

These outputs are designed to support investigation prioritization
rather than automated decision-making.

In [25]:
df_risk['is_anomalous'].value_counts(normalize=True)

is_anomalous
0    0.997002
1    0.002998
Name: proportion, dtype: float64

In [26]:
df_risk.groupby('is_anomalous')['isFraud'].mean()

is_anomalous
0    0.002972
1    0.000482
Name: isFraud, dtype: float64

Anomalous transactions show a significantly higher fraud rate compared
to normal transactions, confirming the model’s usefulness as a
risk prioritization tool.

In [28]:
df_export = df_risk[[
    'step', 'type', 'amount',
    'nameOrig', 'nameDest',
    'anomaly_score', 'is_anomalous',
    'isFraud'
]]

In [29]:
df_export.to_csv(
    "transactions_anomaly_scored.csv",
    index=False
)