## 1. Introduction


#### 🎯 Project Goal
> Build a machine learning system that can automatically detect fraudulent financial transactions based on transaction metadata.

#### 📊 Dataset Summary

- **Files**: 183 `.pkl` files, each representing transactions for a specific date.  
- **Combined Records**: (You’ll fill this after combining the files in Step 2).

  ![image.png](attachment:91b59484-b97d-410d-84e4-0937bdb61036.png)
  

#### 🚨 Simulated Fraud Scenarios

1. **Amount-Based Fraud**

   
   - Any transaction with `TX_AMOUNT > 220` is marked as fraud (baseline pattern).  

3. **Terminal-Based Fraud**

   
   - 2 terminals chosen daily.  
   - All transactions from these terminals over the next 28 days are fraudulent.  

4. **Customer-Based Fraud**

   
   - 3 customers chosen daily.  
   - Over the next 14 days, ~33% of their transactions are marked fraudulent with `TX_AMOUNT` multiplied by 5.


## 2. Importing Libraries

In [9]:
# Core libraries
import os
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Feature Engineering
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Others
import warnings
warnings.filterwarnings('ignore')

# Optional: Set some seaborn style
sns.set(style="whitegrid")


## Data Loading

### 2.1 Define Dataset Path

In [14]:
data_path = "data" #Path of our dataset

### 2.2 Count Files

In [13]:
files = sorted([f for f in os.listdir(data_path) if f.endswith(".pkl")])
print("Total .pkl files found:", len(files))
print("First 3 files:", files[:3])

Total .pkl files found: 183
First 3 files: ['2018-04-01.pkl', '2018-04-02.pkl', '2018-04-03.pkl']


### 2.3 Read Sample File

In [19]:
sample_df = pd.read_pickle(os.path.join(data_path, files[0]))
print("Sample file shape:", sample_df.shape)
sample_df.head()

Sample file shape: (9488, 9)


Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.0,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0


In [20]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9488 entries, 0 to 9487
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   TRANSACTION_ID     9488 non-null   int64         
 1   TX_DATETIME        9488 non-null   datetime64[ns]
 2   CUSTOMER_ID        9488 non-null   object        
 3   TERMINAL_ID        9488 non-null   object        
 4   TX_AMOUNT          9488 non-null   float64       
 5   TX_TIME_SECONDS    9488 non-null   object        
 6   TX_TIME_DAYS       9488 non-null   object        
 7   TX_FRAUD           9488 non-null   int64         
 8   TX_FRAUD_SCENARIO  9488 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 741.2+ KB


In [21]:
sample_df.isnull().sum()

TRANSACTION_ID       0
TX_DATETIME          0
CUSTOMER_ID          0
TERMINAL_ID          0
TX_AMOUNT            0
TX_TIME_SECONDS      0
TX_TIME_DAYS         0
TX_FRAUD             0
TX_FRAUD_SCENARIO    0
dtype: int64

In [24]:
sample_df["TX_FRAUD"].value_counts()

TX_FRAUD
0    9485
1       3
Name: count, dtype: int64

### 2.4 Merge All Files

In [25]:
dfs = [pd.read_pickle(os.path.join(data_path, f)) for f in files]
full_df = pd.concat(dfs).sort_values("TX_DATETIME").reset_index(drop=True)

In [26]:
print("Combined DataFrame shape:", full_df.shape)
print("Columns:", full_df.columns.tolist())
print("Fraud vs Legit counts:\n", full_df["TX_FRAUD"].value_counts())

Combined DataFrame shape: (1754155, 9)
Columns: ['TRANSACTION_ID', 'TX_DATETIME', 'CUSTOMER_ID', 'TERMINAL_ID', 'TX_AMOUNT', 'TX_TIME_SECONDS', 'TX_TIME_DAYS', 'TX_FRAUD', 'TX_FRAUD_SCENARIO']
Fraud vs Legit counts:
 TX_FRAUD
0    1739474
1      14681
Name: count, dtype: int64
