<a href="https://colab.research.google.com/github/talk2omkarrane/Sales-Product-Performance-Dashboard-Power-BI-/blob/main/IEEE_CIS_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd

# Load transaction data
train_txn = pd.read_csv('train_transaction.csv')

# Load identity data
train_id = pd.read_csv('train_identity.csv')

print("Transaction shape:", train_txn.shape)
print("Identity shape:", train_id.shape)




Transaction shape: (80008, 394)
Identity shape: (144233, 41)


In [5]:
train_txn.head()


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
train_id.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In [7]:
train_txn.shape

(80008, 394)

In [8]:
train_id.shape

(144233, 41)

üöÄ Phase 2 ‚Äì Step 2 (Execution)

1Ô∏è‚É£ Merge Transaction + Identity Tables

üëâ Why this step (quick reminder)

Transaction = mandatory

Identity = optional enrichment

Join key = TransactionID

Join type = left join

In [9]:
# Merge transaction and identity data
data = train_txn.merge(train_id, on='TransactionID', how='left')

print("Merged data shape:", data.shape)


Merged data shape: (80008, 434)


In [10]:
# Basic sanity checks
print("Target column:", 'isFraud')
print("TransactionID unique:", data['TransactionID'].is_unique)


Target column: isFraud
TransactionID unique: True


3Ô∏è‚É£ Check Fraud Rate (CRITICAL STEP)

This tells us how imbalanced the problem is.

In [13]:
fraud_count = data['isFraud'].sum()
total_count = len(data)
fraud_rate = fraud_count / total_count

fraud_count, total_count, fraud_rate

(np.int64(2126), 80008, np.float64(0.026572342765723428))

‚úÖ What We Skipped (Intentionally) ‚Äî and Why
1Ô∏è‚É£ We skipped understanding every column

Dataset has 400+ features

Many (Vxxx) are anonymized & engineered

Understanding each one is not required to build a good model

üëâ Why skipped:
In real projects, you focus on patterns & signals, not decoding every feature.

2Ô∏è‚É£ We skipped data cleaning perfection

We did NOT:

Fill all NaNs

Remove all missing values

Normalize everything

üëâ Why skipped:

Missing values are informative in fraud data

Tree models handle NaNs naturally

Over-cleaning can destroy signals

3Ô∏è‚É£ We skipped feature engineering early

No aggregations

No time windows

No encodings yet

üëâ Why skipped:
You first need to understand the problem + baseline behavior before adding complexity.

4Ô∏è‚É£ We skipped modeling early

No Logistic Regression

No XGBoost yet

üëâ Why skipped:
Jumping to models without understanding:

Class imbalance

Business cost

Data behavior
leads to wrong conclusions

üéØ Why We Did NOT Focus on Accuracy
Key reason:

Accuracy lies when classes are imbalanced

In our data:

97.34% genuine

2.66% fraud

A model predicting:

‚ÄúAll transactions are genuine‚Äù

Gives:

97% accuracy

0 fraud detected

0 business value

üéØ Why We Focus on Precision Instead

You decided earlier:

üîµ Blocking a genuine customer is worse

So the bank‚Äôs real question is:

‚ÄúWhen we block a transaction, are we confident it‚Äôs actually fraud?‚Äù

That is precision.

üìå Accuracy vs Precision (One-Liner)

Accuracy: ‚ÄúHow often am I correct overall?‚Äù ‚ùå misleading here

Precision: ‚ÄúWhen I say fraud, how often am I right?‚Äù ‚úÖ business-aligned

üè¶ Real Banking Interpretation
Metric	Meaning
Accuracy	Useless in heavy imbalance
Recall	Catching fraud (secondary)
Precision	Avoid blocking real customers (primary)
PR-AUC	Best summary metric
üß† Interview-Ready Summary (You Can Memorize)

‚ÄúWe intentionally avoided early modeling and accuracy-based evaluation because fraud detection is a highly imbalanced problem. Since false positives directly impact customer experience, we prioritized precision and PR-based metrics aligned with real business costs.‚Äù

This answer alone puts you above average candidates.


üß† Final Complete Brief (Clean Version)

You can save this as a project explanation:

‚Ä¢ We did not attempt to understand or clean all 400+ features upfront, as many were anonymized and missingness carried information.

‚Ä¢ We avoided early modeling to first understand class imbalance and business risk.

‚Ä¢ Accuracy was not used due to severe imbalance (~2.6% fraud), as it provides misleading performance.

‚Ä¢ Precision and PR-based metrics were prioritized since false positives directly impact customer experience.

‚Ä¢ We delayed train-test splitting to avoid time-based data leakage.

‚Ä¢ Decision thresholds were deferred to align with business risk tolerance rather than arbitrary values.

5Ô∏è‚É£ We skipped random train‚Äìvalidation splitting inside the training data to avoid time-based data leakage, even though Kaggle provides a separate unlabeled test set.

Labelled data has the correct answers and is used for learning and evaluation, while unlabelled data has no answers and is used only for making predictions.

**BASIC EDA** - For understanding how transaction actually work


In [15]:
data.groupby('isFraud')['TransactionAmt'].describe()


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
isFraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,77882.0,128.68392,207.267638,1.0,44.0,75.0,125.0,4829.95
1,2126.0,135.035269,195.814846,0.292,39.9625,77.0,150.0,3081.97
