# 📊 00_eda_overview.ipynb
**Exploratory Data Analysis: Fraud Detection in Financial Transactions**

This notebook walks through the data behind our fraud detection problem using the PaySim synthetic dataset. The goal is to explore transaction types, detect patterns in fraudulent behavior, and identify key challenges for modeling.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

# Load the dataset
df = pd.read_csv('/mnt/data/imbl_fraud.csv')
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,548,CASH_OUT,107758.11,C532543098,0.0,0.0,C480664961,277432.79,385190.89,0
1,258,TRANSFER,140540.86,C469332498,0.0,0.0,C2012266745,157713.83,298254.69,0
2,214,CASH_OUT,174294.02,C559944430,20390.0,0.0,C1109783085,5663788.15,5838082.17,0
3,19,CASH_OUT,424292.19,C1386932347,0.0,0.0,C164234467,589460.78,1013752.97,0
4,428,CASH_OUT,67515.86,C897752718,0.0,0.0,C562051757,1103984.18,1171500.04,0


<a id='top'></a>
#### Outline: 
#### 1. <a href='#import'>Import</a>
#### 2. <a href='#EDA'>Exploratory Data Analysis</a>
21. <a href='#fraud-trans'>Which types of transactions are fraudulent?</a>
22. <a href='#isFlaggedFraud'>What determines whether the feature *isFlaggedFraud* gets set or not?</a>
23. <a href='#merchant'>Are expected merchant accounts accordingly labelled?</a>
24. <a href='#common-accounts'>Are there account labels common to fraudulent TRANSFERs and CASH_OUTs?</a>

#### 3. <a href='#clean'>Data Cleaning</a>
31. <a href='#imputation'>Imputation of Latent Missing Values</a>

#### 4. <a href='#feature-eng'>Feature Engineering</a>
#### 5. <a href='#visualization'>Data Visualization</a>
51. <a href='#time'>Dispersion over time</a>
52. <a href='#amount'>Dispersion over amount</a>
53. <a href='#error'>Dispersion over error in balance in destination accounts</a>
54. <a href='#separation'>Separating out genuine from fraudulent transactions</a>
51. <a href='#correlation'>Fingerprints of genuine and fraudulent transactions</a>

#### 6. <a href='#ML'>Machine Learning to Detect Fraud in Skewed Data</a>
61. <a href='#importance'>What are the important features for the ML model?</a>
62. <a href='#decision-tree'>Visualization of ML model</a>
63. <a href='#learning-curve'>Bias-variance tradeoff</a>

#### 7. <a href='#conclusion'>Conclusion</a>

<a id='EDA'></a>
#### 2. Exploratory Data Analysis
In this section and until section 4, we wrangle with the data exclusively using Dataframe methods. This is the most succinct way to gain insights into the dataset. More elaborate visualizations follow in subsequent sections. 

From the exploratory data analysis (EDA) of section <a href='#EDA#'>2</a>, we know that fraud only occurs in 
'TRANSFER's and 'CASH_OUT's. So we assemble only the corresponding data in X
for analysis.


## 🧠 Summary: What Did We Learn?
- The dataset is highly imbalanced (~6% fraud).
- Fraud tends to occur more with **TRANSFER** and **CASH_OUT** types.
- Fraudulent transactions often involve **large amounts**.
- Some patterns in balances and recipient types could be useful features.

These observations will guide our modeling decisions — and help us avoid common traps in fraud detection tasks.
