# EDA Version 2

In this notebook, I will continue where I left off from v1 but will only use `data/test_data` at the moment

In [1]:
import mysql.connector
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

In [None]:
db_password = os.environ.get('DB_PASSWORD')

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password=db_password,  # Use your MySQL password
    database="blockchain_fraud"
)

# Fetch all transactions
query = "SELECT * FROM transactions;"
df = pd.read_sql(query, conn)

# Close connection
conn.close()


In [None]:
# Fix the id column
df.drop(columns=['id'], inplace=True)
df.rename_axis('id', axis=1,inplace=True)

In [None]:
df.describe()

Here, we can see that the value of most transactions is relatively low, at around 1-2 ether. The maximum value (as of last check) was 33183 ether, which is a crazy high volume of ether. 

In [None]:
df.info()

Non NaN values which is a good sign.

Lets check for duplicate data on our hash, since that should be a unique key.

In [None]:
df.duplicated(subset=['hash']).sum()

No duplicates, which is also good

Lets check the distributions of some of our features.

In [None]:
def plot_log_scatter(index, values, title, xlabel, ylabel):
    sns.set_style("whitegrid")

    plt.figure(figsize=(12, 6))

    plt.scatter(index,
                values,
                linestyle='-', 
                marker='o',
                s=1,
                alpha=0.5, 
                color='black',
                label="Transaction Value")

    # Use log scale for better visualization of outliers
    plt.yscale("log")

    plt.xticks(np.linspace(0, len(values), num=10, dtype=int))

    plt.xlabel(xlabel, fontsize=14)
    plt.ylabel(ylabel, fontsize=14)
    plt.title(title, fontsize=16, fontweight='bold')

    plt.grid(True, linestyle='--', alpha=0.6)

    plt.legend()

    plt.show()


In [None]:
plot_log_scatter(df.index,
                df['value'],
                'Distribution of transaction value (ETH)',
                'Transaction Index',
                'ETH Value (Log Scale)')

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(df['value'], bins=50, log_scale=True, kde=True)

plt.xlabel("Transaction Value (ETH)", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.title("Distribution of Ethereum Transaction Values", fontsize=16)
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()


The values of these transactions in ETH is actually fairly normal, and the mass seems to lean a bit heavier on the right side than the left side. We can see that there are actually just as many outliers on the left (possibly more) than there are on the right.

In [None]:
plt.figure(figsize=(20, 5))
sns.boxplot(x=df['value'], showfliers=True)

plt.xlabel("Transaction Value (ETH)", fontsize=20)
plt.title("Ethereum Transaction Value Distribution", fontsize=25)
plt.show()

Lets fine out where our outliers are (for the larger transactions)

In [None]:
def find_outliers(data, threshhold):
    zscores = stats.zscore(data)
    return np.where(zscores >= threshhold)[0]

outliers_gt_2 = find_outliers(df['value'], 2)
outliers_gt_3 = find_outliers(df['value'], 3)

outliers_gt_2_val = df['value'][outliers_gt_2].sort_values()
outliers_gt_3_val = df['value'][outliers_gt_3].sort_values()

In [None]:
print(f'Number of transactions with a Z-score of 2 or higher: {len(outliers_gt_2_val)}')
print('--------------------------------------------------------------------------------')
print(f'Values that are greater than a Z-score of 2:\n {outliers_gt_2_val}')

In [None]:
print(f'Number of transactions with a Z-score of 3 or higher: {len(outliers_gt_3_val)}')
print('--------------------------------------------------------------------------------')
print(f'Values that are greater than a Z-score of 3:\n {outliers_gt_3_val}')

Having the z-score on hand in our data can be a useful feature, so lets add a column that displays it.

In [None]:
df['value_zscore'] = stats.zscore(df['value'])
df.head()

Something else that might be useful is detecting transaction patterns between senders and recipients. Frequent interaction between ids can possibly be suspicious activity.

Lets integrate a SQL query to conduct this analysis.

In [None]:
# Open another connection
db_password = os.environ.get('DB_PASSWORD')

# Connect to MySQL
conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password=db_password,  # Use your MySQL password
    database="blockchain_fraud"
)

query = '''SELECT sender, recipient, COUNT(*) AS num_transactions
           FROM transactions
           GROUP BY sender, recipient
           HAVING COUNT(*) > 50'''

df_wash_trading = pd.read_sql(query, conn)

conn.close()

In [None]:
df_wash_trading.describe()