https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset

https://www.kaggle.com/code/soheiltehranipour/how-to-detect-fraud-in-crypto

About Dataset
Context

This dataset contains rows of known fraud and valid transactions made over Ethereum, a type of cryptocurrency. This dataset is imbalanced, so keep that in mind when modelling
Content

Here is a description of the rows of the dataset:

    Index: the index number of a row

    Address: the address of the ethereum account

    FLAG: whether the transaction is fraud or not

    Avg min between sent tnx: Average time between sent transactions for account in minutes

    Avg_min_between_received_tnx: Average time between received transactions for account in minutes

    Time_Diff_between_first_and_last(Mins): Time difference between the first and last transaction

    Sent_tnx: Total number of sent normal transactions

    Received_tnx: Total number of received normal transactions

    Number_of_Created_Contracts: Total Number of created contract transactions

    Unique_Received_From_Addresses: Total Unique addresses from which account received transactions

    Unique_Sent_To_Addresses20: Total Unique addresses from which account sent transactions

    Min_Value_Received: Minimum value in Ether ever received

    Max_Value_Received: Maximum value in Ether ever received

    Avg_Value_Received5Average value in Ether ever received

    Min_Val_Sent: Minimum value of Ether ever sent

    Max_Val_Sent: Maximum value of Ether ever sent

    Avg_Val_Sent: Average value of Ether ever sent

    Min_Value_Sent_To_Contract: Minimum value of Ether sent to a contract

    Max_Value_Sent_To_Contract: Maximum value of Ether sent to a contract

    Avg_Value_Sent_To_Contract: Average value of Ether sent to contracts

    Total_Transactions(Including_Tnx_to_Create_Contract): Total number of transactions

    Total_Ether_Sent:Total Ether sent for account address

    Total_Ether_Received: Total Ether received for account address

    Total_Ether_Sent_Contracts: Total Ether sent to Contract addresses

    Total_Ether_Balance: Total Ether Balance following enacted transactions

    Total_ERC20_Tnxs: Total number of ERC20 token transfer transactions

    ERC20_Total_Ether_Received: Total ERC20 token received transactions in Ether

    ERC20_Total_Ether_Sent: Total ERC20token sent transactions in Ether

    ERC20_Total_Ether_Sent_Contract: Total ERC20 token transfer to other contracts in Ether

    ERC20_Uniq_Sent_Addr: Number of ERC20 token transactions sent to Unique account addresses

    ERC20_Uniq_Rec_Addr: Number of ERC20 token transactions received from Unique addresses

    ERC20_Uniq_Rec_Contract_Addr: Number of ERC20token transactions received from Unique contract addresses

    ERC20_Avg_Time_Between_Sent_Tnx: Average time between ERC20 token sent transactions in minutes

    ERC20_Avg_Time_Between_Rec_Tnx: Average time between ERC20 token received transactions in minutes

    ERC20_Avg_Time_Between_Contract_Tnx: Average time ERC20 token between sent token transactions

    ERC20_Min_Val_Rec: Minimum value in Ether received from ERC20 token transactions for account

    ERC20_Max_Val_Rec: Maximum value in Ether received from ERC20 token transactions for account

    ERC20_Avg_Val_Rec: Average value in Ether received from ERC20 token transactions for account

    ERC20_Min_Val_Sent: Minimum value in Ether sent from ERC20 token transactions for account

    ERC20_Max_Val_Sent: Maximum value in Ether sent from ERC20 token transactions for account

    ERC20_Avg_Val_Sent: Average value in Ether sent from ERC20 token transactions for account

    ERC20_Uniq_Sent_Token_Name: Number of Unique ERC20 tokens transferred

    ERC20_Uniq_Rec_Token_Name: Number of Unique ERC20 tokens received

    ERC20_Most_Sent_Token_Type: Most sent token for account via ERC20 transaction

    ERC20_Most_Rec_Token_Type: Most received token for account via ERC20 transactions


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, auc, classification_report

import pickle

In [None]:
df = pd.read_csv('/kaggle/input/ethereum-frauddetection-dataset/transaction_dataset.csv', index_col=0)
df.sample(3)

In [None]:
df.shape

In [None]:


# drop first two columns (Index, Adress)
df = df.iloc[:,2:]



In [None]:
from pycaret.classification import *

In [None]:
setup(df,target="FLAG",session_id=85)

In [None]:
compare_models()

In [None]:
df.info()

In [None]:
for col in df:print(f'{col} : {len(df[col].unique())}')

In [None]:
df.select_dtypes(include=['float','int']).describe()

In [None]:
df['FLAG'].value_counts()

In [None]:
fig = px.pie(df, values=df['FLAG'].value_counts().values, names=df['FLAG'].value_counts() ,
             title='Target distribution of being Fraud or not', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

In [None]:
print(f'Percentage of non-fraudulent instances : {len(df.loc[df["FLAG"]==0])/len(df["FLAG"])*100}')
print(f'Percentage of fraudulent instances : {len(df.loc[df["FLAG"]==1])/len(df["FLAG"])*100}')

In [None]:
df.isnull().sum()

In [None]:
# Turn object variables into 'category' dtype for more computation efficiency
categories = df.select_dtypes('O').columns.astype('category')
df[categories]

In [None]:
# Drop the two categorical features
df.drop(df[categories], axis=1, inplace=True)

In [None]:
# Replace missings of numerical variables with median
df.fillna(df.median(), inplace=True)

In [None]:
df.isnull().sum()

In [None]:
# Filtering the features with 0 variance
no_var = df.var() == 0
df.var()[no_var]

In [None]:
# Drop features with 0 variance --- these features will not help in the performance of the model
df.drop(df.var()[no_var].index, axis = 1, inplace = True)
print(df.var())

In [None]:
df.shape

In [None]:
df.info()

In [None]:
corr = df.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)]=True
with sns.axes_style('white'):
    fig, ax = plt.subplots(figsize=(60,60))
    sns.heatmap(corr,  mask=mask, annot=True, cmap='CMRmap', center=0, linewidths=0.1, square=True,annot_kws={"size": 16})

In [None]:
soretd_corr=corr.sort_values(by=['FLAG'],key=abs)

In [None]:
soretd_corr

In [None]:
corr['min val sent']

In [None]:


drop = ['total transactions (including tnx to create contract',
        'total ether sent contracts',
        'max val sent to contract',
        ' ERC20 avg val rec',
        ' ERC20 avg val rec',
        ' ERC20 max val rec',
        ' ERC20 min val rec',
        ' ERC20 uniq rec contract addr',
        'max val sent',
        ' ERC20 avg val sent',
        ' ERC20 min val sent',
        ' ERC20 max val sent',
        ' Total ERC20 tnxs',
        'avg value sent to contract',
        'Unique Sent To Addresses',
        'Unique Received From Addresses',
        'total ether received',
        ' ERC20 uniq sent token name',
        'min value received',
        'min val sent',
        ' ERC20 uniq rec addr' ]
df.drop(drop, axis=1, inplace=True)



In [None]:
# Recheck the Correlation matrix
corr = df.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)]=True
with sns.axes_style('white'):
    fig, ax = plt.subplots(figsize=(18,18))
    sns.heatmap(corr,  mask=mask, annot=True, cmap='CMRmap', center=0, linewidths=0.1, square=True,annot_kws={"size": 8})

In [None]:
columns = df.columns
columns


In [None]:
# Some features present a small distribution
for i in df.columns[1:]:
    if len(df[i].value_counts()) < 10:
        print(f'The column {i} has the following distribution: \n{df[i].value_counts()}')
        print('======================================')



In [None]:
drops = ['min value sent to contract', ' ERC20 uniq sent addr.1']
df.drop(drops, axis=1, inplace=True)
print(df.shape)
df.head()

In [None]:
y = df.iloc[:, 0]
X = df.iloc[:, 1:]
print(X.shape, y.shape)


In [None]:
# Split into training (80%) and testing set (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
X_train


In [None]:
sc = StandardScaler()
sc_train = sc.fit_transform(X_train)


In [None]:
sc_df = pd.DataFrame(sc_train, columns=X_train.columns)
sc_df



In [None]:
oversample = SMOTE()
print(f'Shape of the training before SMOTE: {sc_train.shape, y_train.shape}')


In [None]:
x_tr_resample, y_tr_resample = oversample.fit_resample(sc_train, y_train)
print(f'Shape of the training after SMOTE: {x_tr_resample.shape, y_tr_resample.shape}')



In [None]:
# Target distribution before SMOTE
non_fraud = 0
fraud = 0

for i in y_train:
    if i == 0:
        non_fraud +=1
    else:
        fraud +=1

# Target distribution after SMOTE
no = 0
yes = 1

for j in y_tr_resample:
    if j == 0:
        no +=1
    else:
        yes +=1


print(f'BEFORE OVERSAMPLING \n \tNon-frauds: {non_fraud} \n \tFauds: {fraud}')
print(f'AFTER OVERSAMPLING \n \tNon-frauds: {no} \n \tFauds: {yes}')

In [None]:
LR = LogisticRegression(random_state=42)
LR.fit(x_tr_resample, y_tr_resample)

# Transform test features
sc_test = sc.transform(X_test)

preds = LR.predict(sc_test)


In [None]:
print(y_test.shape)
y_test.value_counts()


In [None]:
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))




Considering the confusion matrix:

    LR model, correctly identified 367 (TP) of FRAUD cases, out of 422 (P).
    LR model flagged as FRAUD 712 (FP) out of 1547, when this cases were actually NON-FRAUD

Dealing with a fraud detection scenario, we care more about the transactions that were actualy FRAUDS, but which were treated as NON-FRAUD by our model (FN - 55) TYPE II ERROR

Therby, let's try to increase the recall using other methods.




Random Forest Classifier


In [None]:
RF = RandomForestClassifier(random_state=42)
RF.fit(x_tr_resample, y_tr_resample)
preds_RF = RF.predict(sc_test)

print(classification_report(y_test, preds_RF))
print(confusion_matrix(y_test, preds_RF))



The RF classifier seems to produce more efective results

    Both FP and FN are reduced considerably increasing the recall & precision
    Using RF, the model fails to detect 24 FRAUD cases.

Let's see if we can increase these results.




XGB Classifier


In [None]:
xgb_c = xgb.XGBClassifier(random_state=42)
xgb_c.fit(x_tr_resample, y_tr_resample)
preds_xgb = xgb_c.predict(sc_test)

print(classification_report(y_test, preds_xgb))
print(confusion_matrix(y_test, preds_xgb))



The results of XGBClassifier shows that its doing slightly better than the RF when it comes to NON-FRAUD transactions, flagging 19 cases as fraud when they were actually non-fraud.

Wen it comes to identifiying FRAUDS, XGBClassifier missed 19 transactions out of 422, suggesting the best recall score.

Considering that, the XGBClassifier is the choice that we want.

Let's see if we can improve these results.




Hyperparameters tuning for XGB Classifier


In [None]:


params_grid = {'learning_rate':[0.01, 0.1, 0.5],
              'n_estimators':[100,200],
              'subsample':[0.5, 0.9],
               'max_depth':[3,4],
               'colsample_bytree':[0.3,0.7]}

grid = GridSearchCV(estimator=xgb_c, param_grid=params_grid, scoring='recall', cv = 10, verbose = 0)

grid.fit(x_tr_resample, y_tr_resample)
print(f'Best params found for XGBoost are: {grid.best_params_}')
print(f'Best recall obtained by the best params: {grid.best_score_}')



Best params found for XGBoost are: {'colsample_bytree': 0.7, 'learning_rate': 0.5, 'max_depth': 4, 'n_estimators': 200, 'subsample': 0.9}
Best recall obtained by the best params: 0.9849451237123328


In [None]:


preds_best_xgb = grid.best_estimator_.predict(sc_test)
print(classification_report(y_test, preds_best_xgb))
print(confusion_matrix(y_test, preds_best_xgb))



In [None]:
# Plotting AUC for untuned XGB Classifier
probs = xgb_c.predict_proba(sc_test)
pred = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, pred)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(12,8))
plt.title('ROC for tuned XGB Classifier')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

