# Blockchain Fraud Detection: Feature Engineering

This notebook focuses on creating new features to enhance the fraud detection model. We'll develop features from two main sources:

1. **Transaction Features**: From the raw features in the nodes dataset
2. **Graph Features**: Based on the network structure from the edges dataset

Our goal is to combine these feature sets to create a comprehensive representation for the Graph Neural Network model.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
import warnings
import os
import torch
from torch_geometric.data import Data

# Set plotting style
sns.set(style="whitegrid")
plt.style.use('seaborn-v0_8-whitegrid')
warnings.filterwarnings('ignore')

# Create output directory for processed data
os.makedirs('../data/processed', exist_ok=True)

## 1. Load and Prepare Data

In [None]:
# Load data
nodes_path = '../data/raw/nodes.csv'
edges_path = '../data/raw/edges.csv'

df_nodes = pd.read_csv(nodes_path)
df_edges = pd.read_csv(edges_path)

print(f"Node data shape: {df_nodes.shape}")
print(f"Edge data shape: {df_edges.shape}")

# Create a mapping of node IDs to indices
id2idx = {df_nodes.iloc[i, 0]: i for i in range(len(df_nodes))}
print(f"Created mapping from transaction IDs to indices")

In [None]:
# Check for low-variance features
feature_cols = df_nodes.columns[2:]  # Exclude ID and class
feature_variance = df_nodes[feature_cols].var()

low_var_threshold = 0.01
low_var_features = feature_variance[feature_variance < low_var_threshold].index.tolist()

print(f"Number of low-variance features (variance < {low_var_threshold}): {len(low_var_features)}")
print(f"Removing these features from the dataset")

# Remove low-variance features
df_nodes_clean = df_nodes.drop(columns=low_var_features)
print(f"Original feature count: {len(feature_cols)}")
print(f"Cleaned feature count: {len(df_nodes_clean.columns) - 2}")

## 2. Extract Transaction-Based Features

In [None]:
# Extract transaction features from the raw data
transaction_features = df_nodes_clean.iloc[:, 2:].values
print(f"Transaction features shape: {transaction_features.shape}")

# Save original feature names for interpretation
transaction_feature_names = df_nodes_clean.columns[2:].tolist()

# Review sample features
print("Sample transaction features (first 5 rows, first 5 features):")
print(transaction_features[:5, :5])

In [None]:
# Calculate feature importance with mutual information
labels = df_nodes_clean['class'].values
selector = SelectKBest(mutual_info_classif, k='all')
selector.fit(transaction_features, labels)

# Create DataFrame of feature importances
feature_importance = pd.DataFrame({
    'Feature': transaction_feature_names,
    'Mutual Information': selector.scores_
})
feature_importance = feature_importance.sort_values('Mutual Information', ascending=False)

# Display top 20 important features
print("Top 20 features by mutual information with the target:")
feature_importance.head(20)

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Mutual Information', y='Feature', data=feature_importance.head(20))
plt.title('Top 20 Features by Mutual Information with Target', fontsize=15)
plt.tight_layout()
plt.show()

## 3. Engineer Graph-Based Features

In [None]:
# Create NetworkX graph from edges
G = nx.DiGraph()

# Add all edges to the graph
edge_count = 0
for _, row in df_edges.iterrows():
    source_id, target_id = row[0], row[1]
    # Only add edges between nodes present in df_nodes
    if source_id in id2idx and target_id in id2idx:
        G.add_edge(source_id, target_id)
        edge_count += 1

print(f"Created graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
print(f"Added {edge_count} out of {len(df_edges)} edges from the dataset")

In [None]:
# Calculate basic node centrality metrics
print("Calculating node centrality metrics...")

# In-degree and out-degree centrality
in_degree_centrality = nx.in_degree_centrality(G)
out_degree_centrality = nx.out_degree_centrality(G)

# PageRank (with modifications for large graphs)
try:
    pagerank = nx.pagerank(G, alpha=0.85, max_iter=100)
except nx.PowerIterationFailedConvergence:
    print("PageRank failed to converge, using a simplified calculation")
    pagerank = nx.pagerank(G, alpha=0.85, max_iter=50, tol=1e-3)

# HITS (hub and authority scores)
try:
    hubs, authorities = nx.hits(G, max_iter=100)
except nx.PowerIterationFailedConvergence:
    print("HITS algorithm failed to converge, using a simplified calculation")
    hubs, authorities = nx.hits(G, max_iter=50, tol=1e-3)

print("Centrality metrics calculation completed")

In [None]:
# Calculate local graph metrics (more computationally efficient)
print("Calculating local graph metrics...")

# Raw in-degree and out-degree
in_degrees = dict(G.in_degree())
out_degrees = dict(G.out_degree())

# Successors and predecessors count
successors_count = {node: len(list(G.successors(node))) for node in G.nodes()}
predecessors_count = {node: len(list(G.predecessors(node))) for node in G.nodes()}

# Calculate clustering coefficient for a sample of nodes (for undirected version)
G_undirected = G.to_undirected()
clustering = nx.clustering(G_undirected)

print("Local graph metrics calculation completed")

In [None]:
# Assemble graph features into a DataFrame
node_ids = df_nodes_clean['txId'].values
graph_features = np.zeros((len(node_ids), 8))

for i, node_id in enumerate(node_ids):
    if node_id in G.nodes():
        # Centrality measures
        graph_features[i, 0] = in_degree_centrality.get(node_id, 0)
        graph_features[i, 1] = out_degree_centrality.get(node_id, 0)
        graph_features[i, 2] = pagerank.get(node_id, 0)
        graph_features[i, 3] = hubs.get(node_id, 0)
        graph_features[i, 4] = authorities.get(node_id, 0)
        
        # Local metrics
        graph_features[i, 5] = in_degrees.get(node_id, 0)
        graph_features[i, 6] = out_degrees.get(node_id, 0)
        graph_features[i, 7] = clustering.get(node_id, 0)

print(f"Created graph features matrix with shape {graph_features.shape}")

# Define feature names for interpretation
graph_feature_names = [
    'in_degree_centrality',
    'out_degree_centrality',
    'pagerank',
    'hub_score',
    'authority_score',
    'in_degree',
    'out_degree',
    'clustering_coefficient'
]

# Create DataFrame for analysis
df_graph_features = pd.DataFrame(graph_features, columns=graph_feature_names)
df_graph_features['class'] = df_nodes_clean['class'].values

print("Basic statistics of graph features:")
df_graph_features.describe()

In [None]:
# Analyze graph features by class
graph_features_by_class = df_graph_features.groupby('class')[graph_feature_names].mean()

print("Average graph features by class:")
graph_features_by_class

In [None]:
# Visualize the distribution of graph features by class
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i, feature in enumerate(graph_feature_names):
    sns.boxplot(x='class', y=feature, data=df_graph_features, palette=['#4285F4', '#EA4335'], ax=axes[i])
    axes[i].set_title(f'{feature} by Class', fontsize=12)
    axes[i].set_xlabel('Class (0=Legitimate, 1=Fraudulent)', fontsize=10)
    axes[i].set_ylabel(feature, fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Calculate feature importance for graph features
selector_graph = SelectKBest(mutual_info_classif, k='all')
selector_graph.fit(graph_features, labels)

# Create DataFrame of feature importances
graph_importance = pd.DataFrame({
    'Feature': graph_feature_names,
    'Mutual Information': selector_graph.scores_
})
graph_importance = graph_importance.sort_values('Mutual Information', ascending=False)

print("Graph features ranked by mutual information with the target:")
graph_importance

## 4. Engineer Additional Features (Optional)

Let's add some additional features that might be useful for fraud detection.

In [None]:
# Calculate fraud proximity features
# These measure how closely a node is connected to fraudulent nodes

print("Calculating fraud proximity features...")

# Get fraud nodes
fraud_nodes = set(df_nodes_clean[df_nodes_clean['class'] == 1]['txId'].values)
print(f"Number of fraud nodes: {len(fraud_nodes)}")

# Initialize features
fraud_proximity_features = np.zeros((len(node_ids), 2))

for i, node_id in enumerate(node_ids):
    if node_id in G.nodes():
        # Count fraud nodes in predecessors (incoming connections)
        predecessors = set(G.predecessors(node_id))
        fraud_predecessors = predecessors.intersection(fraud_nodes)
        if predecessors:
            fraud_proximity_features[i, 0] = len(fraud_predecessors) / len(predecessors)
        
        # Count fraud nodes in successors (outgoing connections)
        successors = set(G.successors(node_id))
        fraud_successors = successors.intersection(fraud_nodes)
        if successors:
            fraud_proximity_features[i, 1] = len(fraud_successors) / len(successors)

print(f"Created fraud proximity features with shape {fraud_proximity_features.shape}")

# Define feature names
fraud_proximity_names = [
    'fraud_predecessor_ratio',
    'fraud_successor_ratio'
]

# Create DataFrame for analysis
df_fraud_proximity = pd.DataFrame(fraud_proximity_features, columns=fraud_proximity_names)
df_fraud_proximity['class'] = df_nodes_clean['class'].values

print("Basic statistics of fraud proximity features:")
df_fraud_proximity.describe()

In [None]:
# Analyze fraud proximity features by class
fraud_proximity_by_class = df_fraud_proximity.groupby('class')[fraud_proximity_names].mean()

print("Average fraud proximity features by class:")
fraud_proximity_by_class

In [None]:
# Visualize fraud proximity features
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

for i, feature in enumerate(fraud_proximity_names):
    sns.boxplot(x='class', y=feature, data=df_fraud_proximity, palette=['#4285F4', '#EA4335'], ax=axes[i])
    axes[i].set_title(f'{feature} by Class', fontsize=14)
    axes[i].set_xlabel('Class (0=Legitimate, 1=Fraudulent)', fontsize=12)
    axes[i].set_ylabel(feature, fontsize=12)

plt.tight_layout()
plt.show()

## 5. Combine All Features

In [None]:
# Combine all feature sets
all_features = np.hstack((transaction_features, graph_features, fraud_proximity_features))
all_feature_names = transaction_feature_names + graph_feature_names + fraud_proximity_names

print(f"Combined features shape: {all_features.shape}")
print(f"Number of features: {len(all_feature_names)}")

In [None]:
# Normalize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(all_features)

print("Features normalized to zero mean and unit variance")
print(f"Scaled features shape: {scaled_features.shape}")

In [None]:
# Feature importances for combined features
selector_all = SelectKBest(mutual_info_classif, k='all')
selector_all.fit(scaled_features, labels)

# Create DataFrame of feature importances
all_importance = pd.DataFrame({
    'Feature': all_feature_names,
    'Mutual Information': selector_all.scores_
})
all_importance = all_importance.sort_values('Mutual Information', ascending=False)

print("Top 20 features by mutual information with the target:")
all_importance.head(20)

In [None]:
# Visualize top combined features
plt.figure(figsize=(12, 8))
sns.barplot(x='Mutual Information', y='Feature', data=all_importance.head(20))
plt.title('Top 20 Combined Features by Mutual Information with Target', fontsize=15)
plt.tight_layout()
plt.show()

## 6. Build Edge Index for PyTorch Geometric

In [None]:
# Build edge index for PyTorch Geometric
print("Building edge index for PyTorch Geometric...")

edges = []
skipped = 0

for _, row in df_edges.iterrows():
    source_id, target_id = row[0], row[1]
    
    # Only include edges where both nodes are in the node dataset
    if source_id in id2idx and target_id in id2idx:
        source_idx = id2idx[source_id]
        target_idx = id2idx[target_id]
        edges.append([source_idx, target_idx])
    else:
        skipped += 1

# Convert to PyTorch tensor
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()

print(f"Created edge index with shape {edge_index.shape}")
print(f"Skipped {skipped} edges due to missing nodes")

## 7. Save Processed Data

In [None]:
# Save processed features
np.save('../data/processed/features.npy', scaled_features)
with open('../data/processed/feature_names.txt', 'w') as f:
    for name in all_feature_names:
        f.write(f"{name}\n")

# Save labels
np.save('../data/processed/labels.npy', labels)

# Save edge index
torch.save(edge_index, '../data/processed/edge_index.pt')

# Save feature importances
all_importance.to_csv('../data/processed/feature_importance.csv', index=False)

print("Saved processed data to ../data/processed/")

In [None]:
# Create PyTorch Geometric Data object
features_tensor = torch.FloatTensor(scaled_features)
labels_tensor = torch.LongTensor(labels)

data = Data(x=features_tensor, edge_index=edge_index, y=labels_tensor)
print(f"Created PyTorch Geometric Data object: {data}")

# Save the complete Data object
torch.save(data, '../data/processed/data.pt')
print("Saved PyTorch Geometric Data object to ../data/processed/data.pt")

## 8. Summary of Feature Engineering

In this notebook, we've created a comprehensive feature set for blockchain fraud detection by combining:

1. **Transaction Features**:
   - Cleaned and normalized the original features
   - Removed low-variance features
   - Identified the most informative features

2. **Graph Features**:
   - Centrality metrics (degree, PageRank, HITS)
   - Local graph structure (clustering coefficient)
   - Raw topological features (in/out degrees)

3. **Fraud Proximity Features**:
   - Ratio of fraudulent predecessors
   - Ratio of fraudulent successors

The combined feature set has high discriminative power for fraud detection, as seen in our feature importance analysis. We've prepared the data in the format required for PyTorch Geometric, which will be used to build our GNN models in the next notebook.