# Group 4 - Script Assignment 6 

## Problem Statement

In Financial Technology (FinTech), anomaly detection plays a crucial role in identifying fraudulent activities such as money laundering or unauthorized transactions. One popular method for anomaly detection is DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which effectively identifies clusters of data points that are closely packed together while marking outliers as noise. In this assignment, you are tasked with implementing a Python script to detect anomalous transactions in financial data using DBSCAN.

## Download the provided financial transaction datasetLinks to an external site. Preprocess the data if necessary (e.g., normalization, feature engineering). (5 points)

### Imports and Read File

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load transaction data from the chosen file
transactions = pd.read_excel('../data/bank.xlsx')
transactions

Unnamed: 0,Account No,DATE,TRANSACTION DETAILS,CHQ.NO.,VALUE DATE,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT,.
0,409000611074',2017-06-29,TRF FROM Indiaforensic SERVICES,,2017-06-29,,1000000.0,1.000000e+06,.
1,409000611074',2017-07-05,TRF FROM Indiaforensic SERVICES,,2017-07-05,,1000000.0,2.000000e+06,.
2,409000611074',2017-07-18,FDRL/INTERNAL FUND TRANSFE,,2017-07-18,,500000.0,2.500000e+06,.
3,409000611074',2017-08-01,TRF FRM Indiaforensic SERVICES,,2017-08-01,,3000000.0,5.500000e+06,.
4,409000611074',2017-08-16,FDRL/INTERNAL FUND TRANSFE,,2017-08-16,,500000.0,6.000000e+06,.
...,...,...,...,...,...,...,...,...,...
116196,409000362497',2019-03-05,TRF TO 1196428 Indiaforensic SE,,2019-03-05,117934.30,,-1.901902e+09,.
116197,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901602e+09,.
116198,409000362497',2019-03-05,FDRL/INTERNAL FUND TRANSFE,,2019-03-05,,300000.0,-1.901302e+09,.
116199,409000362497',2019-03-05,IMPS 05-03-20194C,,2019-03-05,109868.65,,-1.901412e+09,.


### Review dataset

In [28]:
transactions.describe()

Unnamed: 0,DATE,CHQ.NO.,VALUE DATE,WITHDRAWAL AMT,DEPOSIT AMT,BALANCE AMT
count,116201,905.0,116201,53549.0,62652.0,116201.0
mean,2017-05-20 00:08:40.477448448,791614.503867,2017-05-20 00:04:43.288439808,4489190.0,3806586.0,-1404852000.0
min,2015-01-01 00:00:00,1.0,2015-01-01 00:00:00,0.01,0.01,-2045201000.0
25%,2016-05-30 00:00:00,704231.0,2016-05-30 00:00:00,3000.0,99000.0,-1690383000.0
50%,2017-06-05 00:00:00,873812.0,2017-06-05 00:00:00,47083.0,426500.0,-1661395000.0
75%,2018-05-26 00:00:00,874167.0,2018-05-26 00:00:00,5000000.0,4746411.0,-1236888000.0
max,2019-03-05 00:00:00,874525.0,2019-03-05 00:00:00,459447500.0,544800000.0,8500000.0
std,,151205.93291,,10848500.0,8683093.0,534820200.0


In [29]:
transactions.isnull().sum()

Account No                  0
DATE                        0
TRANSACTION DETAILS      2499
CHQ.NO.                115296
VALUE DATE                  0
WITHDRAWAL AMT          62652
DEPOSIT AMT             53549
BALANCE AMT                 0
.                           0
dtype: int64

### Clean and drop features

In [30]:
# Replace missing values with 0 for WITHDRAWAL AMT and DEPOSIT AMT

transactions['WITHDRAWAL AMT'] = transactions['WITHDRAWAL AMT'].fillna(0)
transactions['DEPOSIT AMT'] = transactions['DEPOSIT AMT'].fillna(0)


# Remove trailing single quote from Account No
transactions['Account No'] = transactions['Account No'].str.replace("'", "")

In [31]:
# Standardize the data
scaler = StandardScaler()
transactions_scaled = scaler.fit_transform(transactions)
transactions_scaled.head(20)

TypeError: float() argument must be a string or a real number, not 'Timestamp'

In [None]:

# DBSCAN clustering
eps_values = [0.1, 0.5, 1.0]  
min_samples_values = [5, 10, 15]  

best_score = -1
best_params = None
best_labels = None

# Number of clusters, ignoring noise
for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(transactions_scaled)
        score = len(set(labels)) - (1 if -1 in labels else 0)  
        if score > best_score:
            best_score = score
            best_params = (eps, min_samples)
            best_labels = labels

# Extracting the best parameters and clustering results
best_eps, best_min_samples = best_params
print("Best parameters: eps={}, min_samples={}".format(best_eps, best_min_samples))
print("Number of clusters found:", best_score)

# Assigning cluster labels to the original data
transactions['cluster'] = best_labels

# Accessing clustered transactions
for cluster_id in transactions['cluster'].unique():
    cluster_transactions = transactions[transactions['cluster'] == cluster_id]
    print(f"Cluster {cluster_id}:")
    print(cluster_transactions.head())

# Analyzing anomalies
anomaly_mask = best_labels == -1  
anomalies = transactions[anomaly_mask]

# Analyzing characteristics of anomalies
anomalies_description = anomalies.describe()
print("Characteristics of anomalies:")
print(anomalies_description)

# Visualizing clusters and outliers
plt.figure(figsize=(10, 6))

# Plotting clustered transactions
plt.scatter(transactions['BALANCE AMT'], transactions['DATE'], c=labels, cmap='viridis', alpha=0.5)
plt.colorbar(label='Cluster')
plt.title('DBSCAN Clustering of Transactions')
plt.xlabel('BALANCE AMT')
plt.ylabel('DATE')
plt.grid(True)

# Highlighting anomalies
plt.scatter(anomalies['BALANCE AMT'], anomalies['DATE'], color='red', label='Anomalies')
plt.legend()

plt.show()