<a href="https://colab.research.google.com/github/viktoruebelhart/analyzing-financial-fraud/blob/main/analyzing_financial_fraud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This project is dedicated to building a robust solution for detecting fraudulent activities. By developing and implementing advanced analytical techniques, the project aims to identify suspicious patterns and irregularities that may indicate fraudulent behavior. The ultimate goal is to create a reliable system that can flag potentially fraudulent transactions or activities, enhancing security and supporting proactive measures against fraud.

Key stages of the process include:

*  Conducting exploratory data analysis
*  Analyzing correlations
*  Engineering features
*  Building and training models
*  Assessing model performance

dataset:
https://drive.google.com/file/d/1zjK8zQK5zvhR_r2chWI5dCjeOwASlPfb/view?usp=sharing

# Importing the dataset

In [None]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import asarray

from datetime import datetime as dt
from datetime import timedelta as td

from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# Feature importance
from sklearn.inspection import permutation_importance

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Alura/fraud_detection_dataset.csv')

In [None]:
df.head()

- step - indicates the total time passed in hours from the start of the simulation, ranging between 1 and 744 (representing a 30-day period).

- type - specifies the type of transaction, such as deposit, withdrawal, debit, payment, or transfer.

- amount - reflects the total value involved in the transaction.

- nameOrig - identifies the customer who initiated the transaction.

- oldbalanceOrg - shows the account balance of the initiating party before the transaction occurred.

- newbalanceOrig - shows the updated balance of the origin account after the transaction.

- nameDest - refers to the intended recipient or target of the transaction.

- oldbalanceDest - captures the balance of the recipient's account prior to the transaction.

- newbalanceDest - displays the balance of the destination account after the transaction.

- isFraud - indicates if the transaction is classified as fraudulent. In this scenario, fraud is assumed to occur when a user’s account is accessed and drained through transfers, followed by a withdrawal from the destination account.

- isFlaggedFraud - marked by the bank as potential fraud if the transaction attempts to transfer an amount exceeding 200,000.

##Exploratory Data Analysis (EDA)


##check the information in our dataset and analyze classification and fulfillment

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

There isn't any null value in the dataset

In [None]:
df.describe()

In [None]:
df['isFraud'].value_counts()

In [None]:
percentage_frau = df['isFraud'].value_counts(normalize=True) * 100
percentage_frau

In [None]:
# Calculate the percentage of fraudulent transactions
fraud_percentage = df['isFraud'].value_counts(normalize=True) * 100

# Create the plot
plt.figure(figsize=(8, 6))
sns.set_palette("coolwarm")
ax = sns.countplot(x='isFraud', data=df)

# Customize title and axis labels
plt.title('Distribution of Fraudulent Transactions')
plt.xlabel('Fraudulent (1) / Not Fraudulent (0)')
plt.ylabel('Number of Transactions')

# Add percentage labels to the bars
for p, percentage in zip(ax.patches, fraud_percentage):
    ax.annotate(f'{percentage:.2f}%',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center',
                va='bottom',
                xytext=(0, 5),
                textcoords='offset points',
                fontsize=12,
                color='black')

plt.show()

Based on the analysis, the percentage of fraudulent transactions (isFraud) is very low, with only 0.13% of transactions marked as fraud, while 99.87% are non-fraudulent. This highly imbalanced distribution suggests a significant class imbalance, which can impact the performance of any predictive models trained on this data.

Let's begin by evaluating the effectiveness of the bank's fraud detection engine.
We started by examining the total fraud referrals that were conducted.

In [None]:
# Analyze the effectiveness of the bank's fraud detection system.
# Calculate how many fraudulent transactions were correctly identified by the bank.

true_positives = len(df[(df['isFraud'] == 1) & (df['isFlaggedFraud'] == 1)])
false_negatives = len(df[(df['isFraud'] == 1) & (df['isFlaggedFraud'] == 0)])
total_fraudulent = len(df[df['isFraud'] == 1])

bank_accuracy = (true_positives / total_fraudulent) * 100

print(f"Number of True Positives: {true_positives}")
print(f"Number of False Negatives: {false_negatives}")
print(f"Total Fraudulent Transactions: {total_fraudulent}")
print(f"Bank Accuracy in identifying fraud: {bank_accuracy:.2f}%")

In [None]:
ax = sns.barplot(x=['Bank Identifying', 'Fraud'],
                 y=[df.isFlaggedFraud.sum(), df.isFraud.sum()])
plt.title('Bank Identifying vs Real Fraud')

# Add count labels on each bar
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center',
                va='bottom',
                xytext=(0, 5),
                textcoords='offset points',
                fontsize=10)

plt.show()

The bank's fraud detection engine shows a low accuracy in identifying fraudulent transactions, with only 16 true positives out of 8,213 total fraudulent transactions. This results in a significant number of false negatives (8,197), where fraud went undetected. Consequently, the bank's accuracy in detecting fraud is just 0.19%, indicating that the current detection system may require substantial improvements to effectively identify and mitigate fraudulent activity.

##Fraud analysis over time

In [None]:
#Fraud over Step time
fraud_counts = df[df['isFraud'] == 1].groupby('step').size()
fraud_counts

In [None]:
# Plot fraud_counts (fraud per each hour of the month)
plt.figure(figsize=(14, 6))
plt.plot(fraud_counts.index, fraud_counts.values, label="Fraud per Hour of the Month")
plt.xlabel('Hour of the Month (Step)')
plt.ylabel('Number of Frauds')
plt.title('Fraud Distribution Over the Month (Step)')
plt.legend()
plt.show()

In [None]:
# Get the top 5 hours with the highest fraud counts
top_5_fraud_counts = fraud_counts.nlargest(5)

# Display the top 5 hours
print("Top 5 Hours with Highest Fraud Counts:")
print(top_5_fraud_counts)

In [None]:
# Calculates fraud count for each hour of the day (0 to 23)
fraud_counts_by_hour = df[df['isFraud'] == 1].groupby(df['step'] % 24).size()

# Show results
fraud_counts_by_hour

In [None]:
# Plot fraud_counts_by_hour (fraud per each hour of the day)
plt.figure(figsize=(10, 5))
plt.bar(fraud_counts_by_hour.index, fraud_counts_by_hour.values, color='orange', label="Fraud per Hour of the Day")
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Frauds')
plt.title('Fraud Distribution by Hour of the Day')
plt.xticks(range(24))  # Sets the x-axis to show all hours of the day (0 to 23)
plt.legend()
plt.show()

In [None]:
# Get the top 5 hours with the highest fraud counts by hour
top_5_fraud_counts_by_hour = fraud_counts_by_hour.nlargest(5)

# Display the top 5 hours
print("Top 5 Hours with Highest Fraud Counts:")
print(top_5_fraud_counts_by_hour)

Based on the analysis of the top 5 hours with the highest fraud counts, it is evident that there is limited variation in fraud occurrences across different hours of the day. The highest counts are clustered closely, with only a small numerical difference between the top entries. This suggests that certain hours, such as 10, 2, and 8, experience slightly elevated fraud activity, but overall, the distribution does not show significant spikes or fluctuations.

This consistent pattern indicates that fraud attempts may be occurring throughout the day rather than being concentrated in specific high-risk periods.

## Distribution of the number of operations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the transaction amount divided by fraud/non-fraud
sns.boxplot(data=df, x='isFraud', y='amount')
plt.xlabel('Fraud Status (0 = Non-Fraud, 1 = Fraud)')
plt.ylabel('Transaction Amount')
plt.title('Transaction Amounts by Fraud and Non-Fraud')
plt.show()

In [None]:
# Checking distribution of source and destination accounts when FRAUD
features_counts = ['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']

fig = plt.figure(figsize=(12, 8))

for i, column in enumerate(features_counts):
    ax = fig.add_subplot(2, 2, i + 1)  # Use enumerate to get the index
    sns.boxplot(data=df[df.isFraud == 1], y=column, )  # Use y=column to set the y-axis
    ax.set_title(f'Distribution of {column} When Fraud')  # Add title for clarity

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show()


In [None]:
# Filter only fraudulent transactions
fraudulent_transactions = df[df['isFraud'] == 1]

In [None]:
# Descriptive statistics for amount
amount_stats_amount = fraudulent_transactions['amount'].describe()
print("Descriptive Statistics for Amount in Fraudulent Transactions:")
print(amount_stats_amount)

In [None]:
# Descriptive statistics for balances
balance_stats = fraudulent_transactions[['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']].describe()

print("\nDescriptive Statistics for Balances in Fraudulent Transactions:")
print(balance_stats)

Overall Conclusions

High-value Transactions: The data indicates that fraudulent transactions often involve high amounts of money, suggesting that fraudsters are targeting substantial sums.

Variability: The high standard deviations across amounts and balances indicate a diverse range of behaviors and strategies in fraudulent transactions.

Impact on Origin Accounts: The analysis suggests that the balances in origin accounts are significantly impacted post-transaction, with funds being drained as part of the fraudulent activities.

Targeting of Accounts: The data indicates that accounts with high balances are frequently targeted, both for withdrawals and transfers.

Potential for Multiple Transactions: The wide range of transaction amounts and balances implies that fraudsters may employ various tactics, possibly conducting multiple smaller transactions alongside larger ones to evade detection.

##Analyze transaction types

In [None]:
#counts the type of trasaction
df['type'].value_counts()

In [None]:
# Calculate the percentage of each transaction type
type_percentage = df['type'].value_counts(normalize=True) * 100

# Display the result
print('The percentage of each transaction \n', type_percentage)

In [None]:
# Plotting the distribution of transaction types with distinct colors for each type
plt.figure(figsize=(8, 4))

# Countplot with a custom palette to apply different colors to each transaction type
sns.countplot(x='type', data=df, palette='Set2')

# Setting the title and axis labels
plt.title('Distribution of Transaction Types')
plt.xlabel('Transaction Type')
plt.ylabel('Number of Transactions')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()


In [None]:
#counts the type of trasaction when is Fraud
type_fraud = df[df['isFraud'] == 1]['type'].value_counts()
type_fraud

In [None]:
# Calculate the percentage of each fraud transaction type
type_fraud_percentage = df[df['isFraud'] == 1]['type'].value_counts(normalize=True) * 100

# Display the result
print('The percentage of each fraud transaction type \n', type_fraud_percentage)

In [None]:
sns.countplot(x='type', data=df[df['isFraud'] == 1], palette='Set2')
plt.title('Distribution of Fraudulent Transaction Types')
plt.xlabel('Transaction Type')
plt.ylabel('Number of Fraudulent Transactions')
plt.xticks(rotation=45)

##Evaluating account names

In [None]:
df.head()

In [None]:
# quantity of each name in the origin list
df.nameOrig.value_counts()

In [None]:
# quantity of each destiny list name
df['nameDest'].value_counts()

The amounts associated with the same target name are significantly higher than those for the source.

It would be useful to conduct a deeper analysis of the destine names.

In [None]:
# Analyze 'nameDest' for fraudulent transactions

# Group fraudulent transactions by 'nameDest' and count occurrences
fraudulent_nameDest_counts = df[df['isFraud'] == 1]['nameDest'].value_counts()

# Display the top N most frequent 'nameDest' in fraudulent transactions
N = 10  # Change N to see a different number of top names
print(f"Top {N} 'nameDest' involved in fraudulent transactions:")
print(fraudulent_nameDest_counts.head(N))

# Analyze the distribution of transaction amounts for these top 'nameDest'
for name in fraudulent_nameDest_counts.head(N).index:
    amounts = df[(df['isFraud'] == 1) & (df['nameDest'] == name)]['amount']
    print(f"\nTransaction amounts for 'nameDest' = {name}:")
    print(amounts.describe())


In [None]:
# Group by 'nameDest' and 'isFraud' and count transactions.
transaction_counts = df.groupby(['nameDest', 'isFraud']).size().unstack(fill_value=0)

# Get the 'nameDest' from fraudulent_nameDest_counts
names = fraudulent_nameDest_counts.index

# Filter transaction counts for the names
transaction_counts = transaction_counts.loc[names]

# Display the results
transaction_counts

Several recipients involved in fraudulent transactions conducted numerous transactions over the period, with only one or two being flagged as fraudulent.

Let's examine the distribution of transaction counts for recipients who experienced at least one fraudulent transaction.

In [None]:
#boxplot do transaction_counts

plt.figure(figsize=(12, 6))
sns.boxplot(data=transaction_counts)
plt.title('Boxplot of Transaction Counts for Recipients with Fraudulent Transactions')
plt.xlabel('Transaction Status (0: Not Fraud, 1: Fraud)')
plt.ylabel('Number of Transactions')
plt.show()

In [None]:
# (IQR) do transaction_counts

# Calculate the first quartile (Q1)
Q1 = transaction_counts.quantile(0.25)

# Calculate the third quartile (Q3)
Q3 = transaction_counts.quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

print("IQR of transaction_counts:")
IQR

#Does this distribution differ from that of other recipients (those with no involvement in fraudulent transactions)?

In [None]:
# Analyze transactions for nameDest where isFraud is 0
non_fraudulent_transactions = df[df['isFraud'] == 0]

# Group by 'nameDest' and count transactions
non_fraudulent_nameDest_counts = non_fraudulent_transactions['nameDest'].value_counts()

# Display the top N most frequent 'nameDest' in non-fraudulent transactions
N = 10  # Change N to see a different number of top names
print(f"Top {N} 'nameDest' involved in non-fraudulent transactions:")
print(non_fraudulent_nameDest_counts.head(N))

# Analyze the distribution of transaction amounts for these top 'nameDest'
for name in non_fraudulent_nameDest_counts.head(N).index:
    amounts = non_fraudulent_transactions[(non_fraudulent_transactions['nameDest'] == name)]['amount']
    print(f"\nTransaction amounts for 'nameDest' = {name}:")
    print(amounts.describe())

#Further analysis, you could compare these results with the fraudulent transactions
# to look for patterns or differences.  For example, you could compute the average
# transaction amount for each nameDest in both fraudulent and non-fraudulent
# transactions, and compare them using a box plot.

In [None]:
non_fraudulent_nameDest_counts

In [None]:
#boxplot do non_fraudulent_nameDest_counts

plt.figure(figsize=(12, 6))
sns.boxplot(data=non_fraudulent_nameDest_counts)
plt.title('Boxplot of Transaction Counts for Recipients with Non-Fraudulent Transactions')
plt.xlabel('Number of Transactions')
plt.show()

In [None]:
#(IQR) do non_fraudulent_nameDest_counts

# Calculate the first quartile (Q1)
Q1 = non_fraudulent_nameDest_counts.quantile(0.25)

# Calculate the third quartile (Q3)
Q3 = non_fraudulent_nameDest_counts.quantile(0.75)

# Calculate the interquartile range (IQR)
IQR = Q3 - Q1

print(f"The Interquartile Range (IQR) of non_fraudulent_nameDest_counts is: {IQR}")

There was a difference in the distribution of transaction counts between recipients involved in fraudulent transactions and those who were not.

In [None]:
# Select only numerical features for correlation calculation
numerical_df = df.select_dtypes(include=['number'])

# plotting the correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(round(numerical_df.corr(),4),
            annot=True)

In [None]:
df['type'].value_counts()

In [None]:
#transform colum 'type' string in int

# Create a mapping for transaction types to numerical values
type_mapping = {
    'PAYMENT': 0,
    'TRANSFER': 1,
    'CASH_OUT': 2,
    'DEBIT': 3,
    'CASH_IN': 4
}

# Apply the mapping to the 'type' column
df['type_int'] = df['type'].map(type_mapping)

# Now 'type_int' contains the corresponding numerical values
df.head()

In [None]:
# Select only numerical features for correlation calculation
numerical_df = df.select_dtypes(include=['number'])

# plotting the correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(round(numerical_df.corr(),4),
            annot=True)

In [None]:
#correlation
correlation_matrix = numerical_df.corr()

# Exibe a matriz de correlação
print(correlation_matrix)


In [None]:
#Selecting the highest correlations
correlation_matrix = numerical_df.corr()

# Transforms the correlation matrix into a DataFrame and resets the index
correlation_pairs = correlation_matrix.unstack().reset_index()
correlation_pairs.columns = ['Variable1', 'Variable2', 'Correlation']

# Removes the correlations of a variable with itself
correlation_pairs = correlation_pairs[correlation_pairs['Variable1'] != correlation_pairs['Variable2']]

# Sorts by the absolute value of the correlation, in descending order
correlation_pairs['AbsCorrelation'] = correlation_pairs['Correlation'].abs()
sorted_correlation_pairs = correlation_pairs.sort_values(by='AbsCorrelation', ascending=False)

# Displays the pairs with the strongest correlations (e.g., above 0.4)
strong_correlations = sorted_correlation_pairs[sorted_correlation_pairs['AbsCorrelation'] > 0.4]
print(strong_correlations)


Strong Positive Correlations:

oldbalanceOrg and newbalanceOrig have an extremely high correlation (0.9988), suggesting these variables are almost identical in value. This could indicate that, in many cases, the new balance of the origin account (newbalanceOrig) is directly tied to its old balance (oldbalanceOrg).
newbalanceDest and oldbalanceDest also have a very high correlation (0.9766), implying a similar close relationship between these two variables.

Conclusion: These high correlations might mean that these pairs are redundant in predictive modeling. We could consider dropping one variable in each pair to avoid multicollinearity if using models that assume independence among predictors, such as linear regression.

In [None]:
#show the correlation all features with isFraud

# Calculate the correlation between all features and 'isFraud'
correlation_with_fraud = numerical_df.corr()['isFraud'].sort_values(ascending=False)

# Print the correlations
correlation_with_fraud

Weak Correlation with All Variables:

The highest correlation with isFraud is with amount (0.0767), which is still quite low. None of the variables have a strong or even moderate correlation with isFraud.
Other variables like isFlaggedFraud (0.0441) and step (0.0316) show even weaker positive correlations, while some features like type_int, oldbalanceDest, and newbalanceOrig have very slight negative correlations.

Implications:

These weak correlations suggest that none of these individual variables alone have a strong linear relationship with isFraud. This could imply that identifying fraud might require more complex interactions between features rather than relying on single-variable thresholds.
It also suggests that linear models may struggle to accurately predict fraud based solely on these features. Non-linear models or models that capture interactions between multiple variables (like decision trees, random forests, or neural networks) might perform better in detecting fraud.

#let's check fraud percentage according to transactions per hour

In [None]:
# Create a new column 'hour_of_day' representing the hour of the day (0-23)
df['hour_of_day'] = df['step'] % 24


In [None]:
# Calculate the total number of transactions per hour
transactions_per_hour = df.groupby('hour_of_day').size()

print(transactions_per_hour)

In [None]:
# Create a new column 'transactions_per_hour' in the DataFrame
df['transactions_per_hour'] = df['hour_of_day'].map(transactions_per_hour)
df

In [None]:
# Calculate the percentage of fraudulent transactions per hour
fraud_percentage_per_hour = (fraudulent_transactions_per_hour / transactions_per_hour) * 100

# Print or visualize the results
print(fraud_percentage_per_hour)

# You can also plot this data for better visualization:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(fraud_percentage_per_hour.index, fraud_percentage_per_hour.values)
plt.xlabel('Hour of the Day')
plt.ylabel('Fraud Percentage')
plt.title('Fraud Percentage per Hour of the Day')
plt.show()