# <span style="color:pink"> -Credit Card Fraud Detection- </span>


![](https://miro.medium.com/v2/resize:fit:1024/0*nDTApwXjrDUH3kqm.png)



## **Credit Card Fraud Detection**

Hi this is my notebook where I want to explore and understand a credit card transactions dataset.  
**The main goal here is simple i try to find out which transactions are fraud and which are normal.**  

Fraud is a big problem for banks and people because it makes a lot of money loss 
so machine learning can help to detect it.  

**This dataset has details about financial transactions, such as:**
- step → time of transaction.
- type → type of transaction (CASH_IN, CASH_OUT, TRANSFER, PAYMENT, etc.).
- amount → the transaction amount.
- nameOrig → sender’s ID.
- oldbalanceOrg → sender’s balance before the transaction.
- newbalanceOrig → sender’s balance after the transaction.
- nameDest → receiver’s ID.
- oldbalanceDest → receiver’s balance before the transaction.
- newbalanceDest → receiver’s balance after the transaction.
- isFraud → target column (0 = normal transaction, 1 = fraud).

**in this porject I will do two parts :** 
1. Data Analysis (EDA) → I will look at the data, make some plots, and answer questions.  
2. Machine Learning → I will build a model to predict if a transaction is fraud or not.

Let's start with the analysis part first in this notebook!


## <span style="color:pink"> 1-importing the libraries</span>

In [None]:

# importing all the libraries we will use in this project.
import numpy as np                   # Numerical computations
import pandas as pd                      # Data manipulation & analysis
import matplotlib.pyplot as plt          # Basic visualization
import seaborn as sns                    # Statistical visualization
import plotly.express as px              # Interactive visualization

## <span style="color:pink"> 2-Loading Data & Initial Exploration</span>

In [None]:
# loading the dataset
df = pd.read_csv('data/transactions.csv')

In [None]:
# Display the shape of the dataframe
df.shape

In [None]:
# Display the first few rows of the dataframe
df.head()

In [None]:
# Display the last few rows of the dataframe
df.tail()

In [None]:
# Display the summary statistics of the dataframe
df.describe()

In [None]:
# Display the information about the dataframe
df.info()

In [None]:
# checking for duplicates
df.duplicated().sum()

In [None]:
# checking for missing values
df.isnull().sum()

no missing values in the data.

In [None]:
df.dtypes

the dtype of the columne 'type' needs to be changed.

In [None]:
# Drop the 'step' column (safely) and update df in-place
df.drop(columns=['step', 'nameOrig', 'nameDest'], inplace=True, errors='ignore')

# confirm change
df.columns

## <span style="color:pink"> 3-Univariate Analysis</span>

### Categorical columns
**'type' , 'isFraud'**

In [None]:

# Plot histograms for categorical columns using plotly
catC = ['type', 'isFraud']
for col in catC:
    fig = px.histogram(df, x=col, color=col, text_auto=True, title=f'Count of {col}')
    fig.show()

In [None]:
# Calculate the percentage of fraudulent transactions and transaction types
f = df['isFraud'].value_counts() * 100
t = df['type'].value_counts() * 100
print(f"\nPercentage of Fraudulent Transactions:\n{f}\n")
print(f"\nPercentage of Transaction Types:\n{t}\n")

In [None]:
fig = px.pie(df,names='type',title='Distribution of Transaction Types',hole=0.3)
fig.show()

In [None]:
fig=px.pie(df, names='isFraud', title='Distribution of Fraud and Non-Fraud Transactions', hole=0.3)
fig.show()

 - Most common transaction types are **CASH_OUT and PAYMENT**.

- "Most transactions are normal (isFraud=0) and only a very small portion are fraud (isFraud=1) **This indicates the dataset is highly imbalanced** which will be important to address before modeling.

### Numrical columnes
**'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest' , 'newbalanceDest'**

#### Distribution Analysis

In [None]:
df['amount'].describe()

In [None]:
px.histogram(df, x='amount', nbins=100,marginal='box', title='Distribution of Transaction Amounts').show()

- Most transaction amounts are relatively small (25% below 13K and 50% below 74K) However some transactions reach extremely high values up to 52M The mean (180K) is much higher than the median (74K) which indicates that the distribution is highly right-skewed due to extreme outliers.


In [None]:
df['oldbalanceOrg'].describe()

In [None]:
px.histogram(df, x='oldbalanceOrg', nbins=100,marginal='box', title='Distribution of Old Balance (Origin)').show()

- Most origin account balances are very small (25% are zero and 50% are below 14K) However some accounts reach balances as high as 50M The mean (831K) is much higher than the median (14K) which shows that the distribution is highly right-skewed with extreme outliers.



In [None]:
df['newbalanceOrig'].describe()

In [None]:
px.histogram(df, x='newbalanceOrig',marginal='box', nbins=100, title='Distribution of New Balance (Origin)').show()

- Half of the accounts end up with a zero balance after the transaction (25% = 0, 50% = 0) However some accounts still hold very large balances (up to 40M) The mean (852K) is much higher than the median (0) indicating a very right-skewed distribution dominated by zeros with a few extremely large values.


In [None]:
df['oldbalanceDest'].describe()

In [None]:
px.histogram(df, x='oldbalanceDest', nbins=100, marginal='box', title='Distribution of Old Balance (Destination)').show()

- Many destination accounts had very small balances before the transaction (25% = 0 and 50% ≈ 132K) However some accounts held extremely large balances (up to 236M) The mean (1.09M) is much higher than the median (132K) which indicates a right-skewed distribution with outliers.

In [None]:
df['newbalanceDest'].describe()

In [None]:
px.histogram(df, x='newbalanceDest', nbins=100,marginal='box', title='Distribution of New Balance (Destination)').show()

- After the transaction 25% of destination accounts still have zero balance and 50% are below 214K However some accounts reach extremely high balances (up to 311M) The mean (1.21M) is larger than the median (214K) which indicates a right-skewed distribution caused by extreme outliers.

#### Outliers Detection

In [None]:
# outlier detection
columns = ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
for col in columns:
    px.box(df, y=col, title=f'Box Plot of {col}').show()

- From the boxplot it's clear that there 's many outliers in the amounts and balances since these outilers may represent frauduleent transactions i decided not to remove them insted i will apply **"Scaling"** to reduce thier impact while keeping the information availabal to the model 

## <span style="color:pink"> 4- feature engineering </span>


In [None]:
# feature that checks if the transaction is logically correct based on balances
df['mismatch_org'] = (df['oldbalanceOrg'] - df['amount']) != df['newbalanceOrig']

# feature that measures how large the transaction amount is compared to the sender account balance
df['amount_ratio'] = (df['amount'] > df['oldbalanceOrg']) .astype(int)

# feature that checks if the destination account balance is logically correct based on receiver balances beacause sometimes fraudsters manipulate destination balances
df['mismatch_dest'] = (df['oldbalanceDest'] + df['amount']) != df['newbalanceDest']

df['mismatch_org']=df['mismatch_org'].astype(int)
df['mismatch_dest']=df['mismatch_dest'].astype(int)
  
df['mismatch_org'].value_counts()


In [None]:
df['mismatch_dest'].value_counts()

In [None]:
df['amount_ratio'].value_counts()

#####  - mismatch_org : Most transactions do not have a logical balance change for the sender after the transfer. 

####  - mismatch_dest : Many transactions show that the receiver’s balance does not increase as expected. 

####  - amount_ratio : Many transactions involve sending more money than the sender’s available balance.
 

####  - # These features helped me notice unusual balance and amount behavior in the data.#




## <span style="color:pink"> 5-Bivariate Analysis </span>

### categorical columne

In [None]:
df['type'].value_counts()

In [None]:
# Calculate the percentage of fraudulent transactions for each type
df.groupby('type')['isFraud'].mean() * 100  

In [None]:
# Visualizing the percentage of fraudulent transactions by type
fraud_percentage = df.groupby('type')['isFraud'].mean() * 100
fraud_percentage = fraud_percentage.reset_index()
fig = px.bar(fraud_percentage, x='type', y='isFraud', title='Percentage of Fraudulent Transactions by Type')
fig.show()

- we see that the percentage of fraudulent transactions is highest in 'TRANSFER' and 'CASH_OUT' types which indicates that these transaction types are more prone to fraud.


In [None]:
df.groupby('mismatch_org')['isFraud'].value_counts()

This feature shows that many non-fraud transactions still have abnormal receiver balance behavior, which can help the model detect hidden fraud patterns.

In [None]:
# aggregate count of mismatch_dest per fraud status 
df.groupby('mismatch_dest')['isFraud'].value_counts()

This feature shows that many non-fraud transactions still have abnormal destination balance behavior which can help the model detect hidden fraud patterns.

In [None]:
#fraud vs non-fraud within each amount_ratio value
df.groupby('amount_ratio')['isFraud'].value_counts()

Some transactions labeled as non-fraud involve sending amounts larger than the sender’s balance indicating unusual behavior useful for the model.

### numrical columnes

In [None]:
# box plot for 'amount' by isFraud
px.box(df, y='amount', color='isFraud', title='Box Plot of amount by Fraud Status').show()

Fraudulent transactions usually involve higher median amounts (472K) compared to non-fraud (71K) Non-fraud transactions can reach extreme outliers up to 52M while fraud cases typically stay below 10M This shows that transaction amount can help differentiate fraud from non-fraud.

In [None]:
# heatmap for correlation between features
numeric_df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Heatmap of Feature Correlations')
plt.show()  

- oldbalanceDest and newbalanceDest are highly correlated (0.97) suggesting  redundancy between them.
- Transaction amount has a moderate correlation with destination balances (0.32 with oldbalanceDest 0.49 with newbalanceDest).
- The direct correlation of isFraud with numeric features is weak (all values near 0) meaning fraud cannot be explained by a simple linear relationship with these variables

## <span style="color:pink"> 5- Answring Q </span>

#### 1.	Which transaction types have the highest fraud ratio?
Fraud is concentrated in CASH_OUT and TRANSFER transactions while other types show almost no fraud.
#### 2.	Do fraudulent transactions usually involve higher amounts compared to non-fraud?
Yes Fraudulent transactions have a much higher median amount (472K) compared to non-fraud (71K) even though non-fraud can reach extreme outliers up to 52M.

#### 3. What is the average transaction amount for fraud vs non-fraud?

In [None]:
df.groupby('isFraud')['amount'].mean()

In [None]:
df.to_csv('cleaned_fraud_data.csv', index=False)

In [None]:
df