# Payments Fraud Detection Project

`Author:` [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)\
`Date:` 21-Sept-2024\
[Send me an email](mailto:mohammadebad1@hotmail.com)\
[Visit my GitHub profile](https://github.com/smebad)

[Dataset used in this notebook](https://www.kaggle.com/datasets/ealaxi/paysim1)

## Introduction
The dataset used in this project contains financial transaction data. It includes various types of transactions such as CASH_OUT, PAYMENT, TRANSFER, and others. The goal is to detect fraudulent transactions by building a classification model. In this project, we will explore the dataset, analyze correlations, and build a machine learning model to predict whether a transaction is fraudulent or not.

## 1. Importing Necessary Libraries
In this section, we import the required libraries, including Pandas for data manipulation, NumPy for numerical operations, Plotly for visualization, and Scikit-learn for building a machine learning model.

In [34]:
# importing necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

## 2. Loading and Exploring the Dataset
We start by loading the dataset and checking the first few rows to understand its structure.

In [17]:
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')
print(df.head(10))

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815      170136.00       160296.36   
1     1   PAYMENT   1864.28  C1666544295       21249.00        19384.72   
2     1  TRANSFER    181.00  C1305486145         181.00            0.00   
3     1  CASH_OUT    181.00   C840083671         181.00            0.00   
4     1   PAYMENT  11668.14  C2048537720       41554.00        29885.86   
5     1   PAYMENT   7817.71    C90045638       53860.00        46042.29   
6     1   PAYMENT   7107.77   C154988899      183195.00       176087.23   
7     1   PAYMENT   7861.64  C1912850431      176087.23       168225.59   
8     1   PAYMENT   4024.36  C1265012928        2671.00            0.00   
9     1     DEBIT   5337.77   C712410124       41720.00        36382.23   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0            0.00        0               0  
1  M2044282225            

* We also check the data types and see if there are any null values.

In [55]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            int64  
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         object 
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

### Observations:

* The dataset contains columns such as type (transaction type), amount, isFraud, and various balance-related fields.
No missing values are present in the dataset.

## 3. Transaction Type Analysis
We analyze the different types of transactions in the dataset and visualize them using bar and pie charts.

In [56]:
# Exploring the data of transactions
df.type.value_counts()

type
1    2237500
2    2151495
3    1399284
4     532909
5      41432
Name: count, dtype: int64

In [57]:
# Lets calculate the transaction types and visualize them
type = df["type"].value_counts()
transactions = type.index
values = type.values

In [58]:
# plotting the plotly bar chart
figure = px.bar(x = transactions, y = values, color = values, 
  color_continuous_scale = "sunset")
figure.show()

In [59]:
# plotting the pie chart for more clarity
figure = px.pie(df, values = values, names = transactions, title="Types of Transaction")
figure.show()

### Observations:

* PAYMENT and CASH_OUT are the most common types of transactions.

## 4. Correlation Analysis
Now we have to find out how the numerical features correlate with the target variable isFraud.

In [27]:
# Select only the numerical columns
numeric_df = df.select_dtypes(include=[float, int])

# Now compute the correlation
correlation = numeric_df.corr()

# Print sorted correlation values with 'isFraud'
print(correlation["isFraud"].sort_values(ascending=False))


isFraud           1.000000
amount            0.076688
isFlaggedFraud    0.044109
step              0.031578
oldbalanceOrg     0.010154
newbalanceDest    0.000535
oldbalanceDest   -0.005885
newbalanceOrig   -0.008148
Name: isFraud, dtype: float64


### Observations:

* Some features like oldbalanceOrg and newbalanceOrig may have a significant correlation with fraud detection.

## 5. Preprocessing Categorical Variables
We need to convert categorical variables (such as transaction types) into numerical values to use them in machine learning models.

In [28]:
# Transforming categorical variables into numerical
df['type'] = df['type'].map({'CASH_OUT':1,'PAYMENT':2,'CASH_IN':3,'TRANSFER':4,'DEBIT':5})
df["isFraud"] = df ["isFraud"].map({0: "No Fraud", 1: "Fraud"})
df.head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,2,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,No Fraud,0
1,1,2,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,No Fraud,0
2,1,4,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,Fraud,0
3,1,1,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,Fraud,0
4,1,2,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,No Fraud,0
5,1,2,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,No Fraud,0
6,1,2,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,No Fraud,0
7,1,2,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,No Fraud,0
8,1,2,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,No Fraud,0
9,1,5,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,No Fraud,0


#### Note: We have encoded the transaction types into numbers and relabeled isFraud for clarity.

## 6. Building the Classification Model
We will now build a decision tree classifier to predict whether a transaction is fraudulent based on features such as transaction type, amount, and balances.

In [33]:
# Lets build a classification model to classify whether the transaction is fraud or not by doing the feature and target selection
x = np.array(df[["type", "amount", "oldbalanceOrg", "newbalanceOrig"]])
y = np.array(df["isFraud"])

In [36]:
# Splitting the dataset for model training
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [38]:
model = DecisionTreeClassifier()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest) * 100)

99.97005950378932


#### Model Accuracy: The model's performance on the test data is printed.

## 7. Predicting Fraudulent Transactions
Let's test the model by predicting whether specific transactions are fraudulent.

* First Example Prediction:

In [52]:
# Lets predict the transactions
features = np.array([[4, 9000.60, 9000.60, 0.00]])
print(model.predict(features))

['Fraud']


* Second Example Prediction:

In [54]:
# Lets predict the transactions
features = np.array([[2, 9839.64, 170136.00, 160296.36]])
print(model.predict(features))

['No Fraud']


### Observations:

It has been observed that the model predicts whether the transactions in the examples are fraudulent or not.

# 8. Summary
### What I Did in This Project:
* Data Exploration: I explored the dataset and visualized the types of transactions and their counts.
* Correlation Analysis: I calculated the correlation between features and fraud detection.
* Data Preprocessing: I converted categorical variables into numerical values to make them suitable for machine learning.
* Model Building: I built a Decision Tree Classifier to predict fraudulent transactions.
* Prediction: I tested the model by predicting specific transactions and analyzing its accuracy.

### What I Learned:
* I learned how to handle a real-world dataset, preprocess it, and extract meaningful insights using visualization and correlation.
* I gained hands-on experience with building a simple decision tree model and learned how to evaluate its accuracy.
* I understood the importance of feature selection and how certain features (like balance) can play a significant role in fraud detection.

#### By the end of this project, I was able to detect fraudulent transactions with a machine learning model and gained a deeper understanding of the data and the modeling process.

