<a href="https://colab.research.google.com/github/sn1218/Transactions_Fraud_Detection/blob/main/Testing_Clustering_Model_on_New_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Clustering Model on New Data
### <i>By Sum-Ye</i>

## Imports

In [21]:
! pip install hdbscan

Collecting hdbscan
  Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hdbscan
Successfully installed hdbscan-0.8.39


In [23]:
# for data analytics
import pandas as pd
import numpy as np

# for working with files
from google.colab import files
import pickle

# for modelling
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import hdbscan
from sklearn.neighbors import NearestNeighbors

## Source Data

This notebook will test the model on another 10% sample of the same dataset used in the 'Online Payment Fraud Detection Using Machine Learning.ipynb'.

In [None]:
# upload token downloaded from kaggle in order to connect to the kaggle API
files.upload()

In [6]:
# make directory called kaggle
! mkdir ~/.kaggle
# move kaggle.json file into new directory
! cp kaggle.json ~/.kaggle/
# set permissions
! chmod 600 ~/.kaggle/kaggle.json

In [7]:
# download GTZAN Dataset Music Genre Classification
! kaggle datasets download rupakroy/online-payments-fraud-detection-dataset
# unzip the folder
! unzip online-payments-fraud-detection-dataset
# remove the zipped file
! rm -r online-payments-fraud-detection-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/rupakroy/online-payments-fraud-detection-dataset
License(s): CC-BY-NC-SA-4.0
Downloading online-payments-fraud-detection-dataset.zip to /content
 92% 164M/178M [00:01<00:00, 141MB/s]
100% 178M/178M [00:01<00:00, 109MB/s]
Archive:  online-payments-fraud-detection-dataset.zip
  inflating: PS_20174392719_1491204439457_log.csv  


In [10]:
f = '/content/PS_20174392719_1491204439457_log.csv'
df = pd.read_csv(f)

**About the dataset:**
* **step:** represents a unit of time where 1 step equals 1 hour
* **type:** type of online transaction
* **amount:** the amount of the transaction
* **nameOrig:** customer starting the transaction
* **oldbalanceOrg:** balance before the transaction
* **newbalanceOrig:** balance after the transaction
* **nameDest:** recipient of the transaction
* **oldbalanceDest:** initial balance of recipient before the transaction
* **newbalanceDest:** the new balance of recipient after the transaction
* **isFraud:** fraud transaction

## Preparing New Data

This notebook will test the model on another 10% sample of the same dataset used in the 'Online Payment Fraud Detection Using Machine Learning.ipynb'.

### Producing New Sample

In [None]:
# upload previous df sample - we'll use it to make sure we choose data points that have not previously been used
# also upload high_fraud_cluster_lists and cluster_centroids
files.upload()

In [25]:
# unpickle files

with open('df_sampled.pkl', 'rb') as f:
    df_sampled = pickle.load(f)

with open('cluster_centroids.pkl', 'rb') as file:
    original_cluster_centroids = pickle.load(file)
    original_subcluster_centroids = pickle.load(file)

with open('high_fraud_cluster_lists.pkl', 'rb') as file:
    high_fraud_clusters = pickle.load(file)
    high_fraud_subclusters = pickle.load(file)
    high_fraud_subclusters_62 = pickle.load(file)

In [14]:
df_sampled.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6266285,615,TRANSFER,56510.5,C956959892,56510.5,0.0,C582952352,0.0,0.0,1,0
6358608,714,TRANSFER,10000000.0,C131412689,10390763.24,390763.24,C1846567377,0.0,0.0,1,0
4472587,324,TRANSFER,21580.37,C765450059,21580.37,0.0,C125935569,0.0,0.0,1,0
6293769,668,CASH_OUT,1576531.53,C2000069930,1576531.53,0.0,C1345448010,0.0,1576531.53,1,0
6055691,499,TRANSFER,488243.65,C2117568077,488243.65,0.0,C1238599279,0.0,0.0,1,0


In [16]:
# separate df into fraudulent and non-fraudulent
df_non_fraud = df[df['isFraud']==0]
df_fraud = df[df['isFraud']==1]

# # separate df_sampled into fraudulent and non-fraudulent
df_non_fraud_sampled = df_sampled[df_sampled['isFraud']==0]
df_fraud_sampled = df_sampled[df_sampled['isFraud']==1]

In [34]:
# get a new sample from the original dataframe. This will act as our new data
df_fraud_remaining = df_fraud.drop(df_fraud_sampled.index)
df_non_fraud_remaining = df_non_fraud.drop(df_non_fraud_sampled.index)

# 10% like last time
df_non_fraud_sample_new = df_non_fraud_remaining.sample(n=635440, random_state=42)
df_fraud_sample_new = df_fraud_remaining.sample(n=821, random_state=42)

df_sampled_new = pd.concat([df_fraud_sample_new, df_non_fraud_sample_new])

In [35]:
df_sampled_new.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6362477,731,CASH_OUT,219934.26,C2054799485,219934.26,0.0,C997585991,0.0,219934.26,1,0
6303115,683,TRANSFER,93483.23,C1834440439,93483.23,0.0,C678733325,0.0,0.0,1,0
1511375,151,CASH_OUT,258103.59,C2058115852,258103.59,0.0,C1350667222,244362.91,502466.5,1,0
6127445,541,TRANSFER,108755.28,C172956459,108755.28,0.0,C249062179,0.0,0.0,1,0
6074514,517,TRANSFER,699183.61,C735610108,699183.61,0.0,C1408037662,0.0,0.0,1,0


### Feature Engineering

In [36]:
def prepare_features(df):

  # find the relative time in hours and day of the week (we don't know the absolute time that step starts at)
  df['hour'] = df['step'] % 24
  df['day_of_week'] = df['step'] % 7

  # how large is the transaction relative to the sender's available balance - fraudulent activities might exhibit large ratios
  df['amount_to_oldbalance_ratio'] = df['amount'] / (df['oldbalanceOrg'] + 1e-5)  # to avoid division by zero

  # how much of the sender's balance is left after the transaction - draining the account might be indicative of fraud
  df['remaining_balance_ratio'] = df['newbalanceOrig'] / (df['oldbalanceOrg'] + 1e-5)

  # how large is the transaction relative to the recipient's inital balance - fraudulent transactions may involve sending large amounts to accounts with little to no initial balance.
  df['amount_to_dest_balance_ratio'] = df['amount'] / (df['oldbalanceDest'] + 1e-5)

  #  net change in the sender’s balance after the transaction
  df['balance_change'] = df['newbalanceOrig'] - df['oldbalanceOrg']

  # net change for the recipient
  df['rec_balance_change'] = df['newbalanceDest'] - df['oldbalanceDest']

  # find time since last transaction
  # replace nulls (for first transactions) with large values because filling with 0 indicates no time gap between transactions which could distort clustering
  df['time_since_last_transaction_cust'] = df.groupby('nameOrig')['step'].diff().fillna(df['step'].max())
  df['time_since_last_transaction_rec'] = df.groupby('nameDest')['step'].diff().fillna(df['step'].max())

  # one-hot encode type category
  df = pd.get_dummies(df, columns=['type'])

  # sent to merchant or not
  df['is_merchant'] = df['nameDest'].apply(lambda x: 1 if x[0] == 'M' else 0)

  return df

In [37]:
df_sampled_new = prepare_features(df_sampled_new)

In [38]:
df_sampled_new.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,...,balance_change,rec_balance_change,time_since_last_transaction_cust,time_since_last_transaction_rec,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,is_merchant
6362477,731,219934.26,C2054799485,219934.26,0.0,C997585991,0.0,219934.26,1,0,...,-219934.26,219934.26,743.0,743.0,False,True,False,False,False,0
6303115,683,93483.23,C1834440439,93483.23,0.0,C678733325,0.0,0.0,1,0,...,-93483.23,0.0,743.0,743.0,False,False,False,False,True,0
1511375,151,258103.59,C2058115852,258103.59,0.0,C1350667222,244362.91,502466.5,1,0,...,-258103.59,258103.59,743.0,743.0,False,True,False,False,False,0
6127445,541,108755.28,C172956459,108755.28,0.0,C249062179,0.0,0.0,1,0,...,-108755.28,0.0,743.0,743.0,False,False,False,False,True,0
6074514,517,699183.61,C735610108,699183.61,0.0,C1408037662,0.0,0.0,1,0,...,-699183.61,0.0,743.0,743.0,False,False,False,False,True,0


In [20]:
features = [
    'hour',
    'day_of_week',
    'amount',
    'amount_to_oldbalance_ratio',
    'remaining_balance_ratio',
    'amount_to_dest_balance_ratio',
    'balance_change', 'rec_balance_change',
    'time_since_last_transaction_cust', 'time_since_last_transaction_rec',
    'type_CASH_IN', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER',
    'is_merchant'
]

In [39]:
# scale the data as we did above
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_sampled_new[features])
pca = PCA(n_components=4)
reduced_data_new = pca.fit_transform(X_scaled)

## Modelling

Use HDBSCAN on the new data to form clusters.

In [40]:
# cluster using HDBSCAN
clusterer = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=10, prediction_data=True, approx_min_span_tree=True)
new_cluster_labels = clusterer.fit_predict(reduced_data_new)

#add cluster labels to dataframe
df_sampled_new['cluster'] = new_cluster_labels

# get centroids for the new clusters
new_cluster_centroids = df_sampled_new.groupby('cluster')[features].mean()

Map the new clusters to the original ones produced during the initial building of the model.

In [42]:
# fit a KNN model on the original centroids
knn = NearestNeighbors(n_neighbors=1)
knn.fit(original_cluster_centroids)

# find the nearest original cluster for each new cluster
distances, indices = knn.kneighbors(new_cluster_centroids)

# map the new clusters to the original clusters
new_to_original_cluster_map = {new_cluster: original_cluster_centroids.index[nearest_idx]
                               for new_cluster, nearest_idx in enumerate(indices)}

In [43]:
# map new clusters to original clusters, and get just the number of the cluster
df_sampled_new['mapped_cluster'] = df_sampled_new['cluster'].apply(
    lambda x: new_to_original_cluster_map.get(x, x)[0] if isinstance(new_to_original_cluster_map.get(x, x), pd.Index) else new_to_original_cluster_map.get(x, x)
)

In [44]:
# map new clusters to original clusters, and get just the number of the cluster
df_sampled_new.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,...,time_since_last_transaction_cust,time_since_last_transaction_rec,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,is_merchant,cluster,mapped_cluster
6362477,731,219934.26,C2054799485,219934.26,0.0,C997585991,0.0,219934.26,1,0,...,743.0,743.0,False,True,False,False,False,0,-1,-1
6303115,683,93483.23,C1834440439,93483.23,0.0,C678733325,0.0,0.0,1,0,...,743.0,743.0,False,False,False,False,True,0,-1,-1
1511375,151,258103.59,C2058115852,258103.59,0.0,C1350667222,244362.91,502466.5,1,0,...,743.0,743.0,False,True,False,False,False,0,-1,-1
6127445,541,108755.28,C172956459,108755.28,0.0,C249062179,0.0,0.0,1,0,...,743.0,743.0,False,False,False,False,True,0,-1,-1
6074514,517,699183.61,C735610108,699183.61,0.0,C1408037662,0.0,0.0,1,0,...,743.0,743.0,False,False,False,False,True,0,20,42


As we did in the original model, subcluster the noise cluster (mapped_cluster -1) and mapped_cluster 62.

In [45]:
try:
    # separate noise points from df_sampled as we did above
    noise_points_new = df_sampled_new[df_sampled_new['mapped_cluster'] == -1]

    if noise_points_new.empty:
        print("No points mapped to cluster -1. Skipping subclustering.")
    else:
        # perform HDBSCAN subclustering on noise points
        subclusterer = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=5)
        noise_points_new['subcluster'] = subclusterer.fit_predict(noise_points_new[features])

        # assign subcluster labels back into the main dataframe (for noise points only)
        df_sampled_new.loc[df_sampled_new['mapped_cluster'] == -1, 'subcluster'] = noise_points_new['subcluster']

except Exception as e:
    # Handle any exceptions that may occur
    print(f"An error occurred: {e}")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  noise_points_new['subcluster'] = subclusterer.fit_predict(noise_points_new[features])


In [46]:
try:
    # filter the new dataframe for points mapped to cluster 62
    cluster_62_new = df_sampled_new[df_sampled_new['mapped_cluster'] == 62]

    # check if there are points to cluster
    if cluster_62_new.empty:
        print("No points mapped to cluster 62. Skipping subclustering.")
    else:
        # perform HDBSCAN subclustering on the points in cluster 62
        subclusterer_62 = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=5)
        cluster_62_new['subcluster'] = subclusterer_62.fit_predict(cluster_62_new[features])

        # assign subcluster labels back into the main dataframe for those mapped to cluster 62
        df_sampled_new.loc[df_sampled_new['mapped_cluster'] == 62, 'subcluster'] = cluster_62_new['subcluster']

except Exception as e:
    # Handle any exceptions that may occur
    print(f"An error occurred: {e}")

No points mapped to cluster 62. Skipping subclustering.


In [47]:
# find new subcluster centroids
new_subcluster_centroids = df_sampled_new.groupby(['mapped_cluster', 'subcluster'])[features].mean()

In [48]:
# fit a KNN model on the original subcluster centroids
knn_subcluster = NearestNeighbors(n_neighbors=1)
knn_subcluster.fit(original_subcluster_centroids)

# find the nearest original subcluster for each new subcluster
distances, indices = knn_subcluster.kneighbors(new_subcluster_centroids)

# create a mapping from new subclusters to original subclusters
new_to_original_subcluster_map = {
    new_subcluster: original_subcluster_centroids.index[nearest_idx]
    for new_subcluster, nearest_idx in enumerate(indices)
}
df_sampled_new['mapped_subcluster'] = df_sampled_new['subcluster'].map(new_to_original_subcluster_map)

In [49]:
# get tuple from index object
df_sampled_new['mapped_subcluster'] = df_sampled_new['subcluster'].apply(
    lambda x: new_to_original_subcluster_map.get(x, x)[0] if isinstance(new_to_original_subcluster_map.get(x, x), pd.Index) else new_to_original_subcluster_map.get(x, x)
)

# get 2nd value in tuple for the subcluster number
df_sampled_new['mapped_subcluster_value'] = df_sampled_new['mapped_subcluster'].apply(
    lambda x: x[1] if isinstance(x, tuple) and len(x) > 1 else x
)

Find if the mapped clusters and subclusters are in teh existing list of high risk clusters, if so, mark the whole cluster as fraudulent.

In [50]:
# check if the mapped clusters and subclusters are classed as high risk

df_sampled_new['high_fraud_risk'] = df_sampled_new.apply(
    lambda row: (
        1 if (
            row['mapped_cluster'] in high_fraud_clusters or
            (row['mapped_cluster'] == 62 and
             row['mapped_subcluster_value'] in high_fraud_subclusters_62) or
            (row['mapped_cluster'] == -1 and
             row['mapped_subcluster_value'] in high_fraud_subclusters)
        ) else 0
    ),
    axis=1
)

Check how the model did.

In [51]:
# check how the model did

# true labels (actual fraud cases)
y_true = df_sampled_new['isFraud']

# predicted fraud risk based on clusters
y_pred = df_sampled_new['high_fraud_risk']

# confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# classification report
report = classification_report(y_true, y_pred)
print("Classification Report:")
print(report)

Confusion Matrix:
[[626592   8848]
 [   763     58]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99    635440
           1       0.01      0.07      0.01       821

    accuracy                           0.98    636261
   macro avg       0.50      0.53      0.50    636261
weighted avg       1.00      0.98      0.99    636261



## Conclusions

Despite achieving the initial goal of >50% recall and 25% precision when first building the model, when applying the model to new data, it became apparent that the model's performance significantly degraded, with recall and precision dropping to 7% and 1%, respectively.

To improve performance in future iterations, I would:
* Train the model on a larger, more representative sample of the data.
* Explore techniques for handling class imbalance more effectively, such as adjusting loss functions or applying oversampling techniques.
* Dedicate more time to subclustering different clusters and experimenting with alternative clustering techniques to ensure that the model captures general patterns rather than specific clusters.
* Apply more comprehensive validation techniques, such as k-fold cross-validation or testing on multiple datasets, to identify potential weaknesses earlier in the modeling process.

In conclusion, while the model did not generalise well to the new data, this experience provides valuable insights into the limitations of the current approach. With these reflections, I can design a more robust and adaptable fraud detection model in future projects.