<h1> Ethereum Fraud Detection EDA </h1>

<h3>the purpose of this notebook is to gain a better understanding of the data. The following questions are going to be asked:</h3>
<h4>Q1. Do we have any missing values?</h4>
<h4>Q2. Is the data balanced?</h4>
<h4>Q3. Is the data skewed?</h4>
<h4>Q4. What feature values often belong to fraud accounts?</h4>
<h4>Q5. Is our data random or does it follow a certain trend?</h4>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from pandas_profiling import ProfileReport
from pandas.plotting import lag_plot,autocorrelation_plot

sns.set_style('ticks')
pd.set_option('display.max_columns',500)

In [None]:
df = pd.read_csv('/kaggle/input/transaction_dataset.csv')

This is the `Preprocessor` class being called. All I am doing for now is removing unneeded features

In [None]:
class Preprocessor:
    """
    This is the base Preprocessor class that will be using for 
    any data preprocessing required
    """
    def __init__(self,df):
        self.df = df

    def clean(self):
        self.remove_features()
        self.drop_duplicates()
    
    def add_columns(self,inference=0):
        """
        This method adds columns to the data fetched via the REST API
        Parameters:
        filename = the name of the file, without the .csv extension
        Returns:
        df = A DataFrame of the dataset with columns
        """
        # Define list of columns
        cols = ['Index',
                         'Address',
                         'FLAG',
                         'Avg min between sent tnx',
                         'Avg min between received tnx',
                         'Time Diff between first and last (Mins)',
                         'Sent tnx',
                         'Received Tnx',
                         'Number of Created Contracts',
                         'Unique Received From Addresses',
                         'Unique Sent To Addresses',
                         'min value received',
                         'max value received ',
                         'avg val received',
                         'min val sent',
                         'max val sent',
                         'avg val sent',
                         'min value sent to contract',
                         'max val sent to contract',
                         'avg value sent to contract',
                         'total transactions (including tnx to create contract',
                         'total Ether sent',
                        #  'total ether received',
                         'total ether sent contracts',
                         'total ether balance',
                         ' Total ERC20 tnxs',
                         ' ERC20 total Ether received',
                         ' ERC20 total ether sent',
                         ' ERC20 total Ether sent contract',
                         ' ERC20 uniq sent addr',
                         ' ERC20 uniq rec addr',
                         ' ERC20 uniq sent addr.1',
                         ' ERC20 uniq rec contract addr',
                         ' ERC20 avg time between sent tnx',
                         ' ERC20 avg time between rec tnx',
                         ' ERC20 avg time between rec 2 tnx',
                         ' ERC20 avg time between contract tnx',
                         ' ERC20 min val rec',
                         ' ERC20 max val rec',
                         ' ERC20 avg val rec',
                         ' ERC20 min val sent',
                         ' ERC20 max val sent',
                        #  ' ERC20 avg val sent',
                         ' ERC20 min val sent contract',
                         ' ERC20 max val sent contract',
                         ' ERC20 avg val sent contract',
                         ' ERC20 uniq sent token name',
                         ' ERC20 uniq rec token name',
                         ' ERC20 most sent token type',
                         ' ERC20_most_rec_token_type']

        # Read file,assign cols
        self.df.columns = cols

    def remove_features(self,inference=False):
        """
        This method removes unnecessary features
        Returns:
        
        df = a DataFrame without unneeded features
        """
        # Remove unnecessary fields
        self.df.drop(['Index','Address', ' ERC20 uniq sent token name',
 ' ERC20 uniq rec token name',
 ' ERC20 most sent token type',
 ' ERC20_most_rec_token_type',' ERC20 min val sent contract',' ERC20 max val sent contract',' ERC20 avg val sent contract','min value sent to contract','max val sent to contract','avg value sent to contract',' ERC20 avg time between sent tnx',' ERC20 avg time between rec tnx',' ERC20 avg time between rec 2 tnx','total ether sent contracts',' ERC20 avg time between contract tnx',' ERC20 total Ether sent contract',' ERC20 uniq sent addr.1'],axis=1,inplace=True)
    def drop_duplicates(self):
        self.df.drop_duplicates(inplace=True)

In [None]:
preprocessor = Preprocessor(df)
preprocessor.remove_features()

In [None]:
ProfileReport(df,minimal=True)

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df.nunique()

Straight away, we can see that there are several features with missing values. Either Imputation or Removal will be required

In [None]:
df.skew()

We can also see here that the majority of our features are heavily skewed, so we will have to apply feature engineering and possibly some transformations to the features

In [None]:
df.describe()

Here we can seew that the features all lie in different ranges. Usually, we would normalise our features before training, however I am going to use a tree-based model, so normalisation is not needed here

<h1>Q1. Do we have any missing values?</h1>

In [None]:
df.isnull().sum()

We can see that there are 12 features, each missing 829 rows. In other words:

In [None]:
print('Percentage of missing rows: ' + str(round(((829/len(df)) * 100),1)) + '%') 

8.4% of our data is missing. Possible courses of action:

1. Drop NaN value rows
2. Impute NaN value rows

Let's take a closer look the rows with missing values:

In [None]:
df[df.isnull().T.any()]

Here, we notice something; all the missing values seem to belong to fraudulent accounts. We can confirm this:

In [None]:
df[df.isnull().T.any()]['FLAG'].value_counts()

Our theory is true; All the missing values are of the positive class!

<h1>Q2: Is the data balanced?</h1>

In [None]:
sns.countplot(df['FLAG'])
plt.show()

In [None]:
df['FLAG'].value_counts()

In [None]:
print('Percentage of non-fraudulent instances: ' + str(round(((7662/len(df)) * 100))) + '%') 

In [None]:
print('Percentage of fraudulent instances: ' + str(round(((2179/len(df)) * 100))) + '%') 

We can clearly see here that the data is heavily imblanced, with only 22% of the accounts considered as fraudulent. Possible courses of action:

1. Oversampling/Undersampling.
2. Leaving it as it is for the model.

<h1>Q3. Is the data skewed?</h1>

In [None]:
df.skew()

The answer is yes, and we see that some features, such as `ERC20 avg val sent`, are heavily skewed, with most of the weight being on the left tail. Except `total ether balance`, which is slightly skewed to the right

If we plot a KDE plot of `ERC20 avg val sent`:

In [None]:
sns.kdeplot(df.dropna()[' ERC20 avg val sent'],bw=1.5)
plt.show()

We get this plot, with some random distribution. However, when we perform a boxcox transformation of the data:

In [None]:
sns.kdeplot(boxcox1p(df.dropna()[' ERC20 avg val sent'],boxcox_normmax(df.dropna()[' ERC20 avg val sent'] + 1)), bw=1.5)
plt.show()

We get data that is normally distributed!

<h1>Q4. What feature values often belong to fraud accounts?</h1>

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(),annot=False,cmap='coolwarm',fmt='')
plt.show()

In [None]:
df.corr()['FLAG'].sort_values(ascending=False)[1:]

As we can see, there seems to be no real correlations at all between features, with the highest correled feature being the `Time Diff between first and last (Mins)`, with a correlation of around -0.26. However, there could be some underlying correlations:

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(df['Number of Created Contracts'],df['FLAG'])
plt.show()

We can see that the more contacts a user has created, the more likely they are to be of a fraudulent transaction

<h1>Q5. Is our data random or does it follow a certain trend?</h1>

In [None]:
plt.figure(figsize=(15,15))
autocorrelation_plot(df['total ether balance'])
plt.show()

In [None]:
plt.figure(figsize=(10,10))
lag_plot(df['total ether balance'])
plt.show()

Clearly here we can see that the majority of the data points are random, with the autocorrelation plot showing us that most of the points are located in the 99% confidence band. 

The lag plot shows a similar story, with many of the points clustered at the center, showing us that the data has a few non-zero values, but the points are mainly non-zero, and do not follow any trend