# Ethereum Fraud Detection System

This notebook provides comprehensive documentation of our Ethereum fraud detection system, which uses machine learning to identify potentially fraudulent transactions based on wallet behavior patterns.

## Project Overview

Our fraud detection system consists of two main components:
1. A model training pipeline (`model.py`)
2. A prediction system for new wallets (`predict.py`)

The system uses a Random Forest classifier trained on blockchain transaction data to identify patterns associated with fraudulent activities in the Ethereum network.

## 1. Model Training (`model.py`)

The `model.py` script implements a complete machine learning pipeline for fraud detection:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report
import pickle
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('csv_files/transaction_dataset.csv', index_col=0)
print(f"Dataset shape: {df.shape}")

# Handle categorical columns
categories = df.select_dtypes('O').columns
print(f"Dropping categorical columns: {list(categories)}")
df.drop(df[categories], axis=1, inplace=True)

# Fill missing values with median
df.fillna(df.median(), inplace=True)

# Remove features with zero variance
no_var = df.var() == 0
if any(no_var):
    print(f"Dropping zero variance features: {list(df.var()[no_var].index)}")
    df.drop(df.var()[no_var].index, axis=1, inplace=True)

# Drop redundant or less important features
drop = ['total transactions (including tnx to create contract', 'total ether sent contracts', 
        'max val sent to contract', ' ERC20 avg val rec', ' ERC20 max val rec', ' ERC20 min val rec', 
        ' ERC20 uniq rec contract addr', 'max val sent', ' ERC20 avg val sent', ' ERC20 min val sent', 
        ' ERC20 max val sent', ' Total ERC20 tnxs', 'avg value sent to contract', 'Unique Sent To Addresses',
        'Unique Received From Addresses', 'total ether received', ' ERC20 uniq sent token name', 
        'min value received', 'min val sent', ' ERC20 uniq rec addr', 'min value sent to contract', 
        ' ERC20 uniq sent addr.1']

# Only drop columns that exist in the dataframe
drop_existing = [col for col in drop if col in df.columns]
if drop_existing:
    df.drop(drop_existing, axis=1, inplace=True)
    print(f"Dropped {len(drop_existing)} redundant features")

# Display updated dataset shape
print(f"Dataset shape after cleaning: {df.shape}")

# Split features and target
y = df['FLAG']
X = df.drop('FLAG', axis=1)
print(f"Features: {X.shape}, Target: {y.shape}")

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=123)
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

# Normalize the features
norm = PowerTransformer()
norm_train_f = norm.fit_transform(X_train)
norm_test_f = norm.transform(X_test)

# Before SMOTE - Class distribution
print(f"Before SMOTE - Class distribution: {np.bincount(y_train)}")
fraud_nonfraud_before = np.bincount(y_train)
plt.figure(figsize=(8, 6))
plt.pie(fraud_nonfraud_before, labels=['Non-Fraud', 'Fraud'], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'orange'])
plt.title('Fraud vs Non-Fraud Distribution (Before SMOTE)')
plt.savefig('images/fraud_distribution_before_smote.png')
plt.show()

# Apply SMOTE to handle class imbalance
oversample = SMOTE()
x_tr_resample, y_tr_resample = oversample.fit_resample(norm_train_f, y_train)
print(f"After SMOTE - Class distribution: {np.bincount(y_tr_resample)}")

# After SMOTE - Class distribution
fraud_nonfraud_after = np.bincount(y_tr_resample)
plt.figure(figsize=(8, 6))
plt.pie(fraud_nonfraud_after, labels=['Non-Fraud', 'Fraud'], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'orange'])
plt.title('Fraud vs Non-Fraud Distribution (After SMOTE)')
plt.savefig('images/fraud_distribution_after_smote.png')
plt.show()

# Train Random Forest model
print("Training Random Forest model...")
RF = RandomForestClassifier(random_state=42, n_estimators=100)
RF.fit(x_tr_resample, y_tr_resample)

# Make predictions
preds_RF = RF.predict(norm_test_f)

# Evaluate model
print("\nModel Evaluation:")
print(classification_report(y_test, preds_RF))
print(f"Confusion Matrix:\n{confusion_matrix(y_test, preds_RF)}")
print(f"ROC AUC Score: {roc_auc_score(y_test, preds_RF)}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': RF.feature_importances_
}).sort_values(by='Importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
plt.savefig('images/feature_importance.png')

# Save the model
model_filename = 'models/ethereum_fraud_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(RF, file)
print(f"\nModel saved as {model_filename}")

# Save the scaler for future predictions
scaler_filename = 'models/ethereum_fraud_scaler.pkl'
with open(scaler_filename, 'wb') as file:
    pickle.dump(norm, file)
print(f"Scaler saved as {scaler_filename}")

### Model Training Pipeline Walkthrough

The `model.py` script implements a comprehensive machine learning workflow for fraud detection in Ethereum transactions:

1. **Data Loading and Preprocessing**
   - Loads transaction data from a CSV file
   - Removes categorical features that can't be directly used for modeling
   - Handles missing values by filling with median values
   - Removes features with zero variance that provide no discriminative information
   - Drops redundant or less important features based on domain knowledge

2. **Data Preparation**
   - Splits the data into features (X) and target (y)
   - Further splits into training and testing sets (60/40 split)
   - Normalizes features using PowerTransformer for better model performance

3. **Class Imbalance Handling**
   - Visualizes the class distribution before resampling (typically highly imbalanced in fraud detection)
   - Applies SMOTE (Synthetic Minority Oversampling Technique) to create synthetic samples of the minority class
   - Visualizes the balanced class distribution after SMOTE

4. **Model Training and Evaluation**
   - Trains a Random Forest classifier with 100 trees
   - Makes predictions on the test set
   - Evaluates model using classification metrics (precision, recall, F1-score)
   - Generates confusion matrix and ROC AUC score

5. **Feature Importance Analysis**
   - Extracts feature importance scores from the Random Forest model
   - Visualizes the top 10 most important features

6. **Model Persistence**
   - Saves the trained model and scaler for future use
   - These files will be used by the prediction script

## 2. Prediction System (`predict.py`)

The `predict.py` script uses the trained model to make predictions on new wallet data:

In [None]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

def predict_fraud(wallet_data_path='csv_files/wallet_data.csv'):
    """
    Make fraud predictions on wallet data using the pre-trained model
    
    Parameters:
    wallet_data_path (str): Path to CSV file containing wallet data
    
    Returns:
    pd.DataFrame: DataFrame with wallet addresses and fraud predictions
    """
    # Load the pre-trained model and scaler
    try:
        print("Loading pre-trained model and scaler...")
        with open('models/ethereum_fraud_model.pkl', 'rb') as f:
            model = pickle.load(f)
        
        with open('models/ethereum_fraud_scaler.pkl', 'rb') as f:
            scaler = pickle.load(f)
    except FileNotFoundError as e:
        print(f"Error: {e}")
        print("Ensure that the model and scaler files exist in the 'models' directory.")
        return
    
    # Load wallet data
    print(f"Loading wallet data from {wallet_data_path}...")
    wallet_data = pd.read_csv(wallet_data_path, index_col=0)
    print(f"Wallet data shape: {wallet_data.shape}")
    
    
    # Store wallet addresses if present
    addresses = None
    if 'Address' in wallet_data.columns:
        addresses = wallet_data['Address'].copy()
        wallet_data = wallet_data.drop('Address', axis=1)
    
    # Handle missing values
    wallet_data.fillna(wallet_data.median(), inplace=True)
    
    # Drop FLAG column if present (for testing purposes)
    true_labels = None
    if 'FLAG' in wallet_data.columns:
        true_labels = wallet_data['FLAG'].copy()
        wallet_data = wallet_data.drop('FLAG', axis=1)
    
    # Align feature names with training data
    required_features = scaler.feature_names_in_  # Features used during training
    for feature in required_features:
        if feature not in wallet_data.columns:
            wallet_data[feature] = 0  # Add missing features with default value 0
    wallet_data = wallet_data[required_features]  # Select only required features
    
    # Scale features
    wallet_data_scaled = scaler.transform(wallet_data)
    
    # Make predictions
    print("Making predictions...")
    fraud_probs = model.predict_proba(wallet_data_scaled)[:, 1]
    fraud_preds = model.predict(wallet_data_scaled)
    
    # Create results dataframe
    if addresses is not None:
        results = pd.DataFrame({
            'Address': addresses,
            'Fraud_Prediction': fraud_preds,
            'Fraud_Probability': fraud_probs
        })
    else:
        results = pd.DataFrame({
            'Fraud_Prediction': fraud_preds,
            'Fraud_Probability': fraud_probs
        })
    
    # Sort by fraud probability
    results = results.sort_values('Fraud_Probability', ascending=False).reset_index(drop=True)
    
    # Save results
    results.to_csv('csv_files/fraud_predictions.csv', index=True)
    print("Predictions saved to 'csv_files/fraud_predictions.csv'")
    
    # Display top potential fraudsters
    print(results.head(10))
    
    # Plot fraud probability distribution
    plt.figure(figsize=(10, 6))
    sns.histplot(fraud_probs, bins=50)
    plt.title('Distribution of Fraud Probabilities')
    plt.xlabel('Probability of Fraud')
    plt.ylabel('Count')
    plt.savefig('images/fraud_probability_distribution.png')
    
    # Evaluate against true labels if available
    if true_labels is not None:
        from sklearn.metrics import classification_report, confusion_matrix
        print("\nModel performance on this dataset:")
        print(classification_report(true_labels, fraud_preds))
        print(f"Confusion Matrix:\n{confusion_matrix(true_labels, fraud_preds)}")
    
    return results

if __name__ == "__main__":
    predict_fraud()

### Prediction System Walkthrough

The `predict.py` script provides a streamlined workflow for making predictions on new wallet data:

1. **Model Loading**
   - Loads the pre-trained Random Forest model and PowerTransformer scaler from pickle files
   - Includes error handling to ensure the model files exist

2. **Data Loading and Preprocessing**
   - Loads wallet data from a CSV file
   - Preserves wallet addresses for the final results
   - Handles missing values consistently with the training phase
   - Optionally preserves true labels if present (for evaluation purposes)

3. **Feature Alignment**
   - Ensures that the input data has the same features used during training
   - Adds missing features with default values
   - Selects only the required features in the correct order

4. **Prediction Generation**
   - Scales the features using the saved PowerTransformer
   - Generates both binary predictions (fraud/non-fraud) and fraud probabilities
   - Creates a comprehensive results DataFrame

5. **Results Processing**
   - Sorts results by fraud probability (highest risk first)
   - Saves predictions to a CSV file
   - Displays the top potential fraudsters

6. **Visualization and Evaluation**
   - Generates a histogram of fraud probabilities
   - If true labels are available, evaluates model performance with metrics like precision, recall, and F1-score

## Core Libraries and Their Roles

| Library | Purpose | Key Functions Used |
|---------|---------|-------------------|
| **Pandas** | Data manipulation and analysis | `read_csv`, `DataFrame`, `drop`, `fillna` |
| **NumPy** | Numerical operations | `bincount`, array operations |
| **Scikit-learn** | Machine learning functionality | `train_test_split`, `PowerTransformer`, `RandomForestClassifier`, evaluation metrics |
| **Imbalanced-learn** | Handling class imbalance | `SMOTE` |
| **Matplotlib/Seaborn** | Data visualization | `plt.figure`, `plt.pie`, `sns.barplot`, `sns.histplot` |
| **Pickle** | Model persistence | `dump`, `load` |

## Typical Workflow

1. **Training Phase** (run once):
   - Run `model.py` to train the fraud detection model
   - Review the model evaluation metrics and visualizations
   - Analyze feature importance to understand key fraud indicators

2. **Prediction Phase** (run as needed):
   - Prepare new wallet data in the required format
   - Run `predict.py` to generate fraud predictions
   - Review the sorted results to identify high-risk wallets
   - Analyze the distribution of fraud probabilities

## Key Fraud Indicators (Feature Importance)

The Random Forest model identifies the most important features for detecting fraudulent wallets. These typically include:

1. Transaction frequency patterns
2. Value distribution of transactions
3. Network connectivity measures
4. Temporal behavior patterns
5. ERC20 token interaction patterns

The feature importance visualization in `model.py` provides specific insights into which features are most predictive in the current dataset.

## System Limitations and Considerations

1. **Class Imbalance**: Fraud detection typically deals with highly imbalanced datasets. While SMOTE helps address this, it creates synthetic data that may not perfectly represent real-world fraud patterns.

2. **Feature Engineering**: The system relies on pre-extracted features. Advanced feature engineering could potentially improve performance.

3. **Model Updates**: Fraud patterns evolve over time. Regular retraining with new data is recommended.

4. **False Positives**: High-risk predictions should be investigated further before taking action, as legitimate wallets may sometimes exhibit unusual patterns.

5. **Scalability**: For very large datasets or real-time prediction needs, additional optimizations may be required.

## Conclusion

This Ethereum fraud detection system provides a robust framework for identifying potentially fraudulent wallet addresses based on transaction patterns. The combination of careful data preprocessing, class imbalance handling, and a Random Forest classifier offers good performance for this challenging task.

By leveraging both binary predictions and probability scores, the system enables risk-based prioritization of wallets for further investigation, helping to focus resources on the most suspicious cases.