# About this Notebook

This notebook addresses the critical challenge of **bank transaction fraud detection and prevention**. Financial fraud represents a significant risk in the financial sector, leading to substantial losses and impacting customer trust. Developing robust methods to identify and prevent fraudulent transactions is therefore a key priority.

This project leverages a publicly available dataset from Kaggle (https://www.kaggle.com/datasets/marusagar/bank-transaction-fraud-detection), containing valuable information on bank transactions. The primary objective is to develop a machine learning model capable of outperforming a baseline rule-based fraud detection strategy. **The goal is to achieve a more efficient and accurate system that delivers improved business outcomes through minimized fraud losses and enhanced security.**

The methodology employed in this project includes:

1. Data Acquisition: The dataset will be obtained via the Kaggle API and KaggleHub.
2. Data Splitting: The data will be partitioned into training, validation, and test sets for rigorous model evaluation.
3. Exploratory Data Analysis (EDA): EDA will be conducted on the training data to gain insights and inform feature engineering.
4. Baseline Model (Rule-Based Strategy): A rule-based strategy will be defined and implemented as a performance benchmark.
5. Preprocessing and Pipeline Creation: Preprocessing steps will be applied to the training data, and a pipeline will be constructed for efficient data transformation and modeling.
6. Model Selection: Several suitable machine learning models will be explored and selected.
7. Training and Hyperparameter Optimization: Selected models will be trained and their hyperparameters optimized using the validation set.
8. Evaluation: The final models will be evaluated on the test set using relevant technical and financial metrics.
9. Conclusion: The project will conclude with a summary of findings and an assessment of the approach's effectiveness.

Importing libraries that will be used:

In [None]:
# Importing libraries
import os
import kagglehub
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## 1. Data Acquisition 

In [None]:
# Downloading the dataset
base_path = kagglehub.dataset_download("marusagar/bank-transaction-fraud-detection") 
final_path = os.path.join(base_path, "Bank_Transaction_Fraud_Detection.csv")
df = pd.read_csv(final_path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Customer_ID              200000 non-null  object 
 1   Customer_Name            200000 non-null  object 
 2   Gender                   200000 non-null  object 
 3   Age                      200000 non-null  int64  
 4   State                    200000 non-null  object 
 5   City                     200000 non-null  object 
 6   Bank_Branch              200000 non-null  object 
 7   Account_Type             200000 non-null  object 
 8   Transaction_ID           200000 non-null  object 
 9   Transaction_Date         200000 non-null  object 
 10  Transaction_Time         200000 non-null  object 
 11  Transaction_Amount       200000 non-null  float64
 12  Merchant_ID              200000 non-null  object 
 13  Transaction_Type         200000 non-null  object 
 14  Merc