# Ethereum Fraud Detection

Building a DNN using Keras and a preprocessing pipeline. Note that we got worse accuracy using this than a XGboost model [here](https://www.kaggle.com/stuartday274/basic-xgb-model) but was a fun learning experience. If anyone has any pointers to improve the accuracy I'd love to hear them :) 

Here is a description of the rows of the dataset:
- Index: the index number of a row
- Address: the address of the ethereum account
- FLAG: whether the transaction is fraud or not
- Avg min between sent tnx: Average time between sent transactions for account in minutes
- Avgminbetweenreceivedtnx: Average time between received transactions for account in minutes
- TimeDiffbetweenfirstand_last(Mins): Time difference between the first and last transaction
- Sent_tnx: Total number of sent normal transactions
- Received_tnx: Total number of received normal transactions
- NumberofCreated_Contracts: Total Number of created contract transactions
- UniqueReceivedFrom_Addresses: Total Unique addresses from which account received transactions
- UniqueSentTo_Addresses20: Total Unique addresses from which account sent transactions
- MinValueReceived: Minimum value in Ether ever received
- MaxValueReceived: Maximum value in Ether ever received
- AvgValueReceived5Average value in Ether ever received
- MinValSent: Minimum value of Ether ever sent
- MaxValSent: Maximum value of Ether ever sent
- AvgValSent: Average value of Ether ever sent
- MinValueSentToContract: Minimum value of Ether sent to a contract
- MaxValueSentToContract: Maximum value of Ether sent to a contract
- AvgValueSentToContract: Average value of Ether sent to contracts
- TotalTransactions(IncludingTnxtoCreate_Contract): Total number of transactions
- TotalEtherSent:Total Ether sent for account address
- TotalEtherReceived: Total Ether received for account address
- TotalEtherSent_Contracts: Total Ether sent to Contract addresses
- TotalEtherBalance: Total Ether Balance following enacted transactions
- TotalERC20Tnxs: Total number of ERC20 token transfer transactions
- ERC20TotalEther_Received: Total ERC20 token received transactions in Ether
- ERC20TotalEther_Sent: Total ERC20token sent transactions in Ether
- ERC20TotalEtherSentContract: Total ERC20 token transfer to other contracts in Ether
- ERC20UniqSent_Addr: Number of ERC20 token transactions sent to Unique account addresses
- ERC20UniqRec_Addr: Number of ERC20 token transactions received from Unique addresses
- ERC20UniqRecContractAddr: Number of ERC20token transactions received from Unique contract addresses
- ERC20AvgTimeBetweenSent_Tnx: Average time between ERC20 token sent transactions in minutes
- ERC20AvgTimeBetweenRec_Tnx: Average time between ERC20 token received transactions in minutes
- ERC20AvgTimeBetweenContract_Tnx: Average time ERC20 token between sent token transactions
- ERC20MinVal_Rec: Minimum value in Ether received from ERC20 token transactions for account
- ERC20MaxVal_Rec: Maximum value in Ether received from ERC20 token transactions for account
- ERC20AvgVal_Rec: Average value in Ether received from ERC20 token transactions for account
- ERC20MinVal_Sent: Minimum value in Ether sent from ERC20 token transactions for account
- ERC20MaxVal_Sent: Maximum value in Ether sent from ERC20 token transactions for account
- ERC20AvgVal_Sent: Average value in Ether sent from ERC20 token transactions for account
- ERC20UniqSentTokenName: Number of Unique ERC20 tokens transferred
- ERC20UniqRecTokenName: Number of Unique ERC20 tokens received
- ERC20MostSentTokenType: Most sent token for account via ERC20 transaction
- ERC20MostRecTokenType: Most received token for account via ERC20 transactions

# Load data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn import metrics

In [2]:
df = pd.read_csv('transaction_dataset.csv')
df.columns = [x.lower() for x in df.columns]

In [3]:
cols_to_drop = [
    ' erc20 most sent token type',
    ' erc20_most_rec_token_type',
    'address',
    'index',
    'unnamed: 0'
]

features = [x for x in df.columns if (x != 'flag' and x not in cols_to_drop)]

unique_values = df.nunique()

features = [x for x in features if x in unique_values.loc[(unique_values>1)]]


In [4]:
df[features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9841 entries, 0 to 9840
Data columns (total 38 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   avg min between sent tnx                              9841 non-null   float64
 1   avg min between received tnx                          9841 non-null   float64
 2   time diff between first and last (mins)               9841 non-null   float64
 3   sent tnx                                              9841 non-null   int64  
 4   received tnx                                          9841 non-null   int64  
 5   number of created contracts                           9841 non-null   int64  
 6   unique received from addresses                        9841 non-null   int64  
 7   unique sent to addresses                              9841 non-null   int64  
 8   min value received                                    9841

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin


class BasePipeStep(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns=[]):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        return X
    
class SelectColumns(BasePipeStep):
    
    def transform(self, X):
        X = X.copy()
        return X[self.columns]
    
class FillNumericData(BasePipeStep):
    
    def fit(self, X, y=None):
        self.means = { col: X[col].mean() for col in self.columns}
        return self
        
    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].fillna(self.means[col])
        return X


class ScaleNumeric(BasePipeStep):
    
    def fit(self, X, y=None):
        self.scaler = StandardScaler()
        self.scaler.fit(X[self.columns])
        return self
        
    def transform(self, X):
        X = X.copy()
        X[self.columns] = self.scaler.transform(X[self.columns])
        return X

class GetValues(BasePipeStep):
    
    def transform(self, X):
        X = X.copy()
        return X.values

In [6]:
from sklearn.pipeline import Pipeline
preprocessing = Pipeline([
    ('feature_selection', SelectColumns(features)),
    ('fill_missing', FillNumericData(features)),
    ('standard_scaling', ScaleNumeric(features)),
    ('returnValues', GetValues())
])

In [8]:
!pip install keras




[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
from tensorflow.keras.utils import to_categorical

X = df[features]
y = df['flag']
y = to_categorical(y)


X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

In [None]:
print(np.array([test_prediction]).shape)


In [12]:
X_train = preprocessing.fit_transform(X_train)
X_test = preprocessing.transform(X_test)

In [34]:
from keras.models import Sequential
from keras.layers import Dense
from keras import Input
#create model
model = Sequential()
#add model layers
model.add(Input(shape=(len(features),)))

model.add(Dense(len(features), activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 38)                1482      
                                                                 
 dense_5 (Dense)             (None, 20)                780       
                                                                 
 dense_6 (Dense)             (None, 5)                 105       
                                                                 
 dense_7 (Dense)             (None, 2)                 12        
                                                                 
Total params: 2,379
Trainable params: 2,379
Non-trainable params: 0
_________________________________________________________________


In [35]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10)

Epoch 1/10


ValueError: in user code:

    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\training.py", line 817, in train_step
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 439, in update_state
        self.build(y_pred, y_true)
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 359, in build
        self._metrics = tf.__internal__.nest.map_structure_up_to(y_pred, self._get_metric_objects,
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 485, in _get_metric_objects
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 485, in <listcomp>
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 504, in _get_metric_object
        metric_obj = metrics_mod.get(metric)
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\metrics.py", line 3785, in get
        return deserialize(str(identifier))
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\metrics.py", line 3741, in deserialize
        return deserialize_keras_object(
    File "C:\Users\tziya\anaconda3\lib\site-packages\keras\utils\generic_utils.py", line 708, in deserialize_keras_object
        raise ValueError(

    ValueError: Unknown metric function: false_positive. Please ensure this object is passed to the `custom_objects` argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.


In [45]:
from sklearn import metrics
test_prediction = [np.argmax(x) for x in model.predict(X_test)]

acc = metrics.accuracy_score(test_prediction, [np.argmax(y) for y in y_test])

real_answers = [np.argmax(y) for y in y_test]

real_answers = np.array([real_answers])

test_prediction = np.array([test_prediction])

print(real_answers.sum())
print(real_answers.sum())

print(np.array([test_prediction]).shape)
print(np.array([test_prediction]).sum())
print((np.array([test_prediction]).size))

print(f'Accuracy: {acc:,.2%}')

721
721
(1, 1, 3248)
11
3248
Accuracy: 77.71%


In [53]:
import keras
score = metrics.roc_auc_score([np.argmax(y) for y in y_test], model.predict(X_test)[:,1])

score1 = keras.metrics.AUC(
    num_thresholds=200,
    curve="ROC",
    summation_method="interpolation",
    name=None,
    dtype=None,
    thresholds=None,
    multi_label=False,
    num_labels=None,
    label_weights=None,
    from_logits=False,
)
print(f'Area under ROC of Model On Test Set - {score:,.2%}')

m = keras.metrics.TruePositives()
m.update_state(test_prediction, real_answers)
print(m.result().numpy())

m = keras.metrics.FalsePositives()
m.update_state(test_prediction, real_answers)
print(m.result().numpy())

m = keras.metrics.TrueNegatives()
m.update_state(test_prediction, real_answers)
print(m.result().numpy())

m = keras.metrics.FalseNegatives()
m.update_state(test_prediction, real_answers)
print(m.result().numpy())

score1.update_state(test_prediction, real_answers)
print(score1.result())

Area under ROC of Model On Test Set - 58.11%
4.0
717.0
2520.0
7.0
tf.Tensor(0.57106745, shape=(), dtype=float32)
