# Blockchain Fraud Detection: Fraud Case Study

This notebook focuses on analyzing specific fraudulent transactions and understanding the patterns that our GNN model has learned. We'll investigate the characteristics of different types of fraud and visualize their network structures to gain deeper insights into blockchain fraud detection.

In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, SAGEConv, GATConv
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
import warnings

# Set plotting style
sns.set(style="whitegrid")
plt.style.use('seaborn-v0_8-whitegrid')
warnings.filterwarnings('ignore')

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create directory for case study results
os.makedirs('../reports/case_study', exist_ok=True)

## 1. Load Data and Model

In [None]:
# Load raw data
nodes_path = '../data/raw/nodes.csv'
edges_path = '../data/raw/edges.csv'

try:
    df_nodes = pd.read_csv(nodes_path)
    df_edges = pd.read_csv(edges_path)
    print(f"Loaded raw data from CSV files")
except FileNotFoundError:
    raise FileNotFoundError(f"Raw data files not found. Please make sure {nodes_path} and {edges_path} exist.")

print(f"Nodes shape: {df_nodes.shape}")
print(f"Edges shape: {df_edges.shape}")

In [None]:
# Load processed data
try:
    # Try loading the complete Data object
    data = torch.load('../data/processed/data.pt')
    print(f"Loaded PyTorch Geometric Data object: {data}")
except FileNotFoundError:
    # If not found, load individual components
    features = np.load('../data/processed/features.npy')
    labels = np.load('../data/processed/labels.npy')
    edge_index = torch.load('../data/processed/edge_index.pt')
    
    # Convert to PyTorch tensors
    x = torch.FloatTensor(features)
    y = torch.LongTensor(labels)
    
    # Create Data object
    data = Data(x=x, edge_index=edge_index, y=y)
    print(f"Created PyTorch Geometric Data object from components")

# Load feature names if available
try:
    with open('../data/processed/feature_names.txt', 'r') as f:
        feature_names = [line.strip() for line in f.readlines()]
    print(f"Loaded {len(feature_names)} feature names")
except FileNotFoundError:
    feature_names = [f'Feature_{i}' for i in range(data.num_features)]
    print(f"Created generic feature names")

# Move data to device
data = data.to(device)

In [None]:
# Define GCN model
class GCNModel(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers=3, 
                 dropout=0.5, batch_norm=True, residual=True):
        super(GCNModel, self).__init__()
        
        self.num_layers = num_layers
        self.dropout = dropout
        self.batch_norm = batch_norm
        self.residual = residual
        
        # Input layer
        self.convs = torch.nn.ModuleList([GCNConv(input_dim, hidden_dim)])
        
        # Hidden layers
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))
        
        # Output layer
        self.convs.append(GCNConv(hidden_dim, output_dim))
        
        # Batch normalization layers
        if batch_norm:
            self.bns = torch.nn.ModuleList([torch.nn.BatchNorm1d(hidden_dim) for _ in range(num_layers - 1)])
    
    def forward(self, x, edge_index):
        # Input layer
        h = self.convs[0](x, edge_index)
        h = F.relu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Hidden layers with residual connections
        for i in range(1, self.num_layers - 1):
            h_prev = h
            h = self.convs[i](h, edge_index)
            
            if self.batch_norm:
                h = self.bns[i-1](h)
            
            h = F.relu(h)
            
            if self.residual:
                h = h + h_prev
            
            h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Output layer
        h = self.convs[-1](h, edge_index)
        
        return F.log_softmax(h, dim=1)
    
    def get_embeddings(self, x, edge_index, layer=-2):
        """Extract embeddings from a specific layer"""
        # Input layer
        h = self.convs[0](x, edge_index)
        h = F.relu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Hidden layers
        for i in range(1, self.num_layers - 1):
            if i == self.num_layers + layer:  # If this is the requested layer
                return h
                
            h_prev = h
            h = self.convs[i](h, edge_index)
            
            if self.batch_norm:
                h = self.bns[i-1](h)
            
            h = F.relu(h)
            
            if self.residual:
                h = h + h_prev
            
            h = F.dropout(h, p=self.dropout, training=self.training)
        
        # Return embeddings from before the final layer by default
        if layer == -2:
            return h
        
        # Or return final layer output if requested
        h = self.convs[-1](h, edge_index)
        return h

In [None]:
# Load the best model
try:
    with open('../models/best_model_name.txt', 'r') as f:
        best_model_name = f.read().strip()
except FileNotFoundError:
    # Default to GCN if model name is not found
    best_model_name = 'GCN'

# Initialize model with appropriate architecture
input_dim = data.num_features
hidden_dim = 256
output_dim = 2

# For simplicity, we'll use GCN model for this case study
model = GCNModel(input_dim, hidden_dim, output_dim, num_layers=3).to(device)
model_path = '../models/best_model.pt'

# Try to load the model parameters
try:
    model.load_state_dict(torch.load(model_path, map_location=device))
    print(f"Loaded model from {model_path}")
except FileNotFoundError:
    # Try model-specific path
    model_specific_path = f"../models/{best_model_name.lower()}_best.pt"
    try:
        model.load_state_dict(torch.load(model_specific_path, map_location=device))
        print(f"Loaded model from {model_specific_path}")
    except FileNotFoundError:
        raise FileNotFoundError(f"Could not find model file at {model_path} or {model_specific_path}")

# Set model to evaluation mode
model.eval()

## 2. Get Predictions and Node Embeddings

In [None]:
# Generate predictions for all nodes
with torch.no_grad():
    # Forward pass
    out = model(data.x, data.edge_index)
    
    # Get probabilities and predictions
    probs = torch.exp(out)
    preds = out.argmax(dim=1)
    
    # Get embeddings from the second-to-last layer
    embeddings = model.get_embeddings(data.x, data.edge_index, layer=-2)

# Move to CPU for processing
true_labels = data.y.cpu().numpy()
predicted_labels = preds.cpu().numpy()
fraud_probs = probs[:, 1].cpu().numpy()
node_embeddings = embeddings.cpu().numpy()

print(f"Generated predictions and embeddings for {len(predicted_labels)} nodes")
print(f"Node embeddings shape: {node_embeddings.shape}")

In [None]:
# Create a mapping from node indices to original transaction IDs
node_indices = np.arange(len(df_nodes))
node_ids = df_nodes.iloc[:, 0].values  # First column contains the transaction IDs
idx_to_id = dict(zip(node_indices, node_ids))

# Create a results DataFrame
results_df = pd.DataFrame({
    'node_idx': node_indices,
    'tx_id': node_ids,
    'true_label': true_labels,
    'predicted_label': predicted_labels,
    'fraud_probability': fraud_probs,
    'correct_prediction': true_labels == predicted_labels
})

# Add time step information if available
if 'time_step' in df_nodes.columns:
    results_df['time_step'] = df_nodes['time_step'].values

# Print basic statistics
print("Prediction Results Summary:")
print(f"Total nodes: {len(results_df)}")
print(f"Actual fraudulent nodes: {results_df['true_label'].sum()}")
print(f"Predicted fraudulent nodes: {results_df['predicted_label'].sum()}")
print(f"Correctly predicted nodes: {results_df['correct_prediction'].sum()}")
print(f"Accuracy: {results_df['correct_prediction'].mean():.4f}")

## 3. Analyze Different Types of Fraud

Let's try to identify different patterns of fraudulent behavior by clustering the embeddings of fraudulent transactions.

In [None]:
# Extract embeddings of actual fraudulent transactions
fraud_idx = np.where(true_labels == 1)[0]
fraud_embeddings = node_embeddings[fraud_idx]
fraud_results = results_df[results_df['true_label'] == 1].copy()

print(f"Analyzing {len(fraud_idx)} fraudulent transactions")

In [None]:
# Apply t-SNE to visualize fraud embeddings in 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(fraud_embeddings) - 1), n_iter=1000)
fraud_embeddings_2d = tsne.fit_transform(fraud_embeddings)

# Create a plotting DataFrame
fraud_plot_df = pd.DataFrame({
    'x': fraud_embeddings_2d[:, 0],
    'y': fraud_embeddings_2d[:, 1],
    'tx_id': fraud_results['tx_id'].values,
    'correct_prediction': fraud_results['correct_prediction'].values,
    'fraud_probability': fraud_results['fraud_probability'].values
})

# Visualize the fraud embeddings
plt.figure(figsize=(10, 8))
scatter = plt.scatter(fraud_plot_df['x'], fraud_plot_df['y'], 
                       c=fraud_plot_df['fraud_probability'], cmap='Reds', 
                       alpha=0.7, s=50, edgecolors='w')
plt.colorbar(scatter, label='Fraud Probability')
plt.title('t-SNE Visualization of Fraudulent Transaction Embeddings', fontsize=15)
plt.xlabel('t-SNE Dimension 1', fontsize=12)
plt.ylabel('t-SNE Dimension 2', fontsize=12)
plt.tight_layout()
plt.savefig('../reports/case_study/fraud_embeddings.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Use clustering to identify different fraud patterns
# Let's try DBSCAN for density-based clustering
dbscan = DBSCAN(eps=3.0, min_samples=5)  # Adjust parameters based on your data
fraud_clusters = dbscan.fit_predict(fraud_embeddings_2d)

# Add cluster information to the DataFrame
fraud_plot_df['cluster'] = fraud_clusters

# Count points per cluster
n_clusters = len(set(fraud_clusters)) - (1 if -1 in fraud_clusters else 0)
n_noise = list(fraud_clusters).count(-1)
print(f"DBSCAN identified {n_clusters} clusters, with {n_noise} noise points")

# If DBSCAN doesn't work well, try KMeans as an alternative
if n_clusters <= 1:
    print("DBSCAN clustering not effective, trying KMeans instead")
    n_clusters = min(5, len(fraud_embeddings))  # Limit to 5 clusters or fewer
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    fraud_clusters = kmeans.fit_predict(fraud_embeddings_2d)
    fraud_plot_df['cluster'] = fraud_clusters
    print(f"KMeans identified {n_clusters} clusters")

# Create a colormap for clusters
cmap = plt.cm.get_cmap('tab10', n_clusters)

# Visualize the clusters
plt.figure(figsize=(12, 10))

# Plot each cluster with a different color
for cluster_id in range(-1, n_clusters):  # -1 is for noise points in DBSCAN
    cluster_points = fraud_plot_df[fraud_plot_df['cluster'] == cluster_id]
    
    if cluster_id == -1:
        # Noise points in black
        plt.scatter(cluster_points['x'], cluster_points['y'], color='black', 
                    alpha=0.5, s=30, label=f'Noise ({len(cluster_points)} points)')
    else:
        # Cluster points in color
        plt.scatter(cluster_points['x'], cluster_points['y'], color=cmap(cluster_id), 
                    alpha=0.7, s=50, label=f'Cluster {cluster_id} ({len(cluster_points)} points)')

plt.title('Fraud Patterns: Clustered Embeddings', fontsize=15)
plt.xlabel('t-SNE Dimension 1', fontsize=12)
plt.ylabel('t-SNE Dimension 2', fontsize=12)
plt.legend(fontsize=10)
plt.tight_layout()
plt.savefig('../reports/case_study/fraud_clusters.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Analyze the relationship between clusters and correct predictions
cluster_stats = fraud_plot_df.groupby('cluster').agg({
    'tx_id': 'count',
    'correct_prediction': 'mean',
    'fraud_probability': 'mean'
}).reset_index()

cluster_stats.columns = ['Cluster', 'Count', 'Detection_Rate', 'Avg_Probability']
cluster_stats = cluster_stats.sort_values('Count', ascending=False)

print("Fraud Cluster Statistics:")
print(cluster_stats)

In [None]:
# Visualize the cluster statistics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot detection rate by cluster
ax1.bar(cluster_stats['Cluster'].astype(str), cluster_stats['Detection_Rate'], color='teal')
ax1.set_ylim([0, 1])
ax1.set_xlabel('Cluster', fontsize=12)
ax1.set_ylabel('Detection Rate', fontsize=12)
ax1.set_title('Fraud Detection Rate by Cluster', fontsize=14)
for i, v in enumerate(cluster_stats['Detection_Rate']):
    ax1.text(i, v + 0.02, f'{v:.2f}', ha='center')

# Plot average probability by cluster
ax2.bar(cluster_stats['Cluster'].astype(str), cluster_stats['Avg_Probability'], color='purple')
ax2.set_ylim([0, 1])
ax2.set_xlabel('Cluster', fontsize=12)
ax2.set_ylabel('Average Fraud Probability', fontsize=12)
ax2.set_title('Average Fraud Probability by Cluster', fontsize=14)
for i, v in enumerate(cluster_stats['Avg_Probability']):
    ax2.text(i, v + 0.02, f'{v:.2f}', ha='center')

plt.tight_layout()
plt.savefig('../reports/case_study/cluster_statistics.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Network Analysis of Fraud Patterns

Let's examine the network structure of each fraud cluster to understand different fraud patterns.

In [None]:
# Create a NetworkX graph from the edges
G = nx.DiGraph()

# Add all edges
for _, row in df_edges.iterrows():
    G.add_edge(row[0], row[1])

print(f"Created graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

In [None]:
# Create a mapping from node/transaction ID to cluster
tx_id_to_cluster = dict(zip(fraud_plot_df['tx_id'], fraud_plot_df['cluster']))

# Function to extract a subgraph around a transaction ID
def extract_transaction_neighborhood(G, tx_id, depth=1):
    """Extract a subgraph of nodes within 'depth' hops from tx_id"""
    # Start with the transaction node
    nodes = {tx_id}
    frontier = {tx_id}
    
    # Add predecessors and successors up to depth
    for _ in range(depth):
        new_frontier = set()
        for node in frontier:
            # Add predecessors (incoming edges)
            if node in G:
                predecessors = set(G.predecessors(node))
                new_frontier.update(predecessors)
                
                # Add successors (outgoing edges)
                successors = set(G.successors(node))
                new_frontier.update(successors)
        
        # Update nodes and frontier
        nodes.update(new_frontier)
        frontier = new_frontier
    
    # Extract the subgraph
    return G.subgraph(nodes)

In [None]:
# Function to visualize a transaction network with node colors
def visualize_transaction_network(subgraph, central_node=None, title="Transaction Network", 
                                  output_path=None, figsize=(12, 10)):
    """Visualize a transaction subgraph with coloring based on fraud labels"""
    plt.figure(figsize=figsize)
    
    # Create a position layout
    pos = nx.spring_layout(subgraph, seed=42)
    
    # Prepare node colors and sizes
    node_colors = []
    node_sizes = []
    
    for node in subgraph.nodes():
        # Highlight the central node
        if node == central_node:
            node_colors.append('red')
            node_sizes.append(500)
        # Color fraudulent nodes based on their cluster if known
        elif node in tx_id_to_cluster:
            cluster = tx_id_to_cluster[node]
            if cluster == -1:  # Noise points
                node_colors.append('gray')
            else:
                node_colors.append(plt.cm.tab10(cluster))
            node_sizes.append(300)
        # Color known legitimate nodes
        elif node in idx_to_id.values() and node in df_nodes.iloc[:, 0].values:
            # Get the row index for this node ID
            node_idx = df_nodes.index[df_nodes.iloc[:, 0] == node][0]
            if df_nodes.iloc[node_idx, 1] == 0:  # Check if it's legitimate
                node_colors.append('lightblue')
                node_sizes.append(200)
            else:  # It's fraudulent but not clustered
                node_colors.append('orange')
                node_sizes.append(300)
        # Unknown or uncategorized nodes
        else:
            node_colors.append('lightgray')
            node_sizes.append(150)
    
    # Draw the network
    nx.draw_networkx_nodes(subgraph, pos, node_color=node_colors, node_size=node_sizes, 
                          alpha=0.8, edgecolors='black', linewidths=0.5)
    nx.draw_networkx_edges(subgraph, pos, alpha=0.4, arrows=True)
    
    # Add minimal labels to avoid clutter
    if len(subgraph) < 50:  # Only add labels for small graphs
        labels = {}
        for node in subgraph.nodes():
            if node == central_node:
                labels[node] = f"ID: {node}"
            elif node in tx_id_to_cluster:
                labels[node] = f"F{tx_id_to_cluster[node]}"
            elif len(str(node)) > 10:
                labels[node] = str(node)[:3] + '...' + str(node)[-3:]
            else:
                labels[node] = str(node)
        nx.draw_networkx_labels(subgraph, pos, labels, font_size=8)
    
    plt.title(title, fontsize=15)
    plt.axis('off')
    
    # Add legend
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Central Transaction'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='lightblue', markersize=10, label='Legitimate'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='orange', markersize=10, label='Fraudulent (Unclustered)'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='gray', markersize=10, label='Noise Points')
    ]
    
    # Add cluster colors to legend
    for cluster in range(n_clusters):
        legend_elements.append(
            Line2D([0], [0], marker='o', color='w', markerfacecolor=plt.cm.tab10(cluster), 
                   markersize=10, label=f'Fraud Cluster {cluster}')
        )
    
    plt.legend(handles=legend_elements, loc='upper right', fontsize=10)
    
    plt.tight_layout()
    
    # Save figure if path provided
    if output_path:
        plt.savefig(output_path, dpi=300, bbox_inches='tight')
    
    plt.show()

In [None]:
# Analyze representative transactions from each cluster
for cluster_id in sorted(cluster_stats['Cluster'].unique()):
    if cluster_id == -1 and n_noise < 5:  # Skip noise if there are few noise points
        continue
        
    print(f"\nAnalyzing Cluster {cluster_id}:")
    
    # Get transactions in this cluster
    cluster_txs = fraud_plot_df[fraud_plot_df['cluster'] == cluster_id]['tx_id'].values
    
    if len(cluster_txs) == 0:
        continue
    
    # Select a representative transaction
    if len(cluster_txs) <= 5:
        # For small clusters, just pick the first one
        tx_id = cluster_txs[0]
    else:
        # For larger clusters, pick one with a high fraud probability
        probs = fraud_plot_df[fraud_plot_df['cluster'] == cluster_id]['fraud_probability'].values
        idx = np.argsort(probs)[-3]  # Get the third highest probability
        tx_id = cluster_txs[idx]
    
    print(f"Representative transaction ID: {tx_id}")
    
    # Extract and visualize the transaction's neighborhood
    try:
        subgraph = extract_transaction_neighborhood(G, tx_id, depth=1)
        print(f"Extracted subgraph with {subgraph.number_of_nodes()} nodes and {subgraph.number_of_edges()} edges")
        
        # Limit to smaller graphs for visualization
        if subgraph.number_of_nodes() > 50:
            print(f"Subgraph is too large. Sampling a smaller neighborhood.")
            subgraph = extract_transaction_neighborhood(G, tx_id, depth=0)
            if subgraph.number_of_nodes() > 50:
                # If still too large, just show immediate neighbors
                neighbors = list(G.predecessors(tx_id)) + list(G.successors(tx_id))
                if len(neighbors) > 20:
                    neighbors = np.random.choice(neighbors, 20, replace=False)
                neighbors = list(neighbors) + [tx_id]
                subgraph = G.subgraph(neighbors)
        
        # Visualize the subgraph
        title = f"Transaction Network: Fraud Cluster {cluster_id}"
        output_path = f"../reports/case_study/cluster_{cluster_id}_network.png"
        visualize_transaction_network(subgraph, central_node=tx_id, 
                                      title=title, output_path=output_path)
        
        # Calculate network statistics
        in_degree = G.in_degree(tx_id) if tx_id in G else 0
        out_degree = G.out_degree(tx_id) if tx_id in G else 0
        
        print(f"Network statistics for transaction {tx_id}:")
        print(f"In-degree (incoming transactions): {in_degree}")
        print(f"Out-degree (outgoing transactions): {out_degree}")
        
    except Exception as e:
        print(f"Error analyzing transaction {tx_id}: {str(e)}")

## 5. Time-based Analysis of Fraud Patterns

If time step information is available, let's analyze how fraud patterns evolve over time.

In [None]:
# Check if time step information is available
if 'time_step' in results_df.columns:
    # Add time step information to fraud_plot_df
    fraud_plot_df = fraud_plot_df.merge(
        results_df[['tx_id', 'time_step']], 
        on='tx_id', how='left'
    )
    
    # Analyze fraud clusters by time step
    time_cluster_counts = fraud_plot_df.groupby(['time_step', 'cluster']).size().unstack(fill_value=0)
    
    # Plot the evolution of clusters over time
    plt.figure(figsize=(14, 8))
    time_cluster_counts.plot(kind='bar', stacked=True, cmap='tab10')
    plt.title('Evolution of Fraud Clusters Over Time', fontsize=15)
    plt.xlabel('Time Step', fontsize=12)
    plt.ylabel('Number of Fraudulent Transactions', fontsize=12)
    plt.legend(title='Cluster', fontsize=10)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.savefig('../reports/case_study/fraud_evolution.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Calculate cluster proportions by time step
    time_cluster_props = time_cluster_counts.div(time_cluster_counts.sum(axis=1), axis=0)
    
    # Plot the proportions
    plt.figure(figsize=(14, 8))
    time_cluster_props.plot(kind='bar', stacked=True, cmap='tab10')
    plt.title('Proportion of Fraud Clusters Over Time', fontsize=15)
    plt.xlabel('Time Step', fontsize=12)
    plt.ylabel('Proportion of Fraudulent Transactions', fontsize=12)
    plt.legend(title='Cluster', fontsize=10)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.savefig('../reports/case_study/fraud_proportions.png', dpi=300, bbox_inches='tight')
    plt.show()
    
else:
    print("Time step information not available. Skipping time-based analysis.")

## 6. Analysis of False Negatives

Let's analyze fraudulent transactions that the model failed to detect (false negatives) to understand their characteristics.

In [None]:
# Extract false negatives (fraudulent transactions classified as legitimate)
false_negatives = results_df[(results_df['true_label'] == 1) & (results_df['predicted_label'] == 0)]
print(f"Analyzing {len(false_negatives)} false negatives (missed fraud)")

In [None]:
# Analysis of false negatives by cluster
fn_plot_df = fraud_plot_df[fraud_plot_df['correct_prediction'] == False]
fn_cluster_counts = fn_plot_df.groupby('cluster').size()
fn_cluster_props = (fn_plot_df.groupby('cluster').size() / 
                    fraud_plot_df.groupby('cluster').size())

# Plot the false negative analysis by cluster
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Number of false negatives by cluster
fn_cluster_counts.plot(kind='bar', ax=ax1, color='orange')
ax1.set_title('False Negatives by Cluster', fontsize=14)
ax1.set_xlabel('Cluster', fontsize=12)
ax1.set_ylabel('Number of False Negatives', fontsize=12)

# Proportion of false negatives by cluster
fn_cluster_props.plot(kind='bar', ax=ax2, color='red')
ax2.set_title('False Negative Rate by Cluster', fontsize=14)
ax2.set_xlabel('Cluster', fontsize=12)
ax2.set_ylabel('False Negative Rate', fontsize=12)
ax2.set_ylim([0, 1])

plt.tight_layout()
plt.savefig('../reports/case_study/false_negatives_by_cluster.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Analyze a sample of false negatives
if len(false_negatives) > 0:
    # Select a few false negative examples
    num_examples = min(3, len(false_negatives))
    fn_examples = false_negatives.sample(num_examples, random_state=42)
    
    for i, (_, row) in enumerate(fn_examples.iterrows()):
        print(f"\nFalse Negative Example {i+1}:")
        print(f"Transaction ID: {row['tx_id']}")
        print(f"Fraud Probability: {row['fraud_probability']:.4f}")
        
        # Extract and visualize the transaction's neighborhood
        try:
            subgraph = extract_transaction_neighborhood(G, row['tx_id'], depth=1)
            print(f"Extracted subgraph with {subgraph.number_of_nodes()} nodes and {subgraph.number_of_edges()} edges")
            
            # Limit to smaller graphs for visualization
            if subgraph.number_of_nodes() > 50:
                print(f"Subgraph is too large. Sampling a smaller neighborhood.")
                subgraph = extract_transaction_neighborhood(G, row['tx_id'], depth=0)
                if subgraph.number_of_nodes() > 50:
                    # If still too large, just show immediate neighbors
                    neighbors = list(G.predecessors(row['tx_id'])) + list(G.successors(row['tx_id']))
                    if len(neighbors) > 20:
                        neighbors = np.random.choice(neighbors, 20, replace=False)
                    neighbors = list(neighbors) + [row['tx_id']]
                    subgraph = G.subgraph(neighbors)
            
            # Calculate network statistics
            in_degree = G.in_degree(row['tx_id']) if row['tx_id'] in G else 0
            out_degree = G.out_degree(row['tx_id']) if row['tx_id'] in G else 0
            
            print(f"Network statistics:")
            print(f"In-degree (incoming transactions): {in_degree}")
            print(f"Out-degree (outgoing transactions): {out_degree}")
            
            # Visualize the subgraph
            title = f"False Negative Example {i+1}: Transaction {row['tx_id']}"
            output_path = f"../reports/case_study/false_negative_{i+1}.png"
            visualize_transaction_network(subgraph, central_node=row['tx_id'], 
                                         title=title, output_path=output_path)
            
        except Exception as e:
            print(f"Error analyzing transaction {row['tx_id']}: {str(e)}")
else:
    print("No false negatives to analyze.")

## 7. Feature Importance for Different Fraud Patterns

Let's analyze which features are most important for detecting each type of fraud pattern.

In [None]:
# Extract feature importance from the model's last layer
with torch.no_grad():
    # Get the weights from the last layer
    last_layer_weights = model.convs[-1].lin.weight.detach().cpu().numpy()
    
    # Extract weights for the fraud class (index 1)
    fraud_weights = last_layer_weights[1, :]
    
# Create a feature importance DataFrame
feature_importance_df = pd.DataFrame({
    'feature_index': np.arange(len(fraud_weights)),
    'feature_name': feature_names[:len(fraud_weights)] if len(feature_names) >= len(fraud_weights) else [f"Feature_{i}" for i in range(len(fraud_weights))],
    'importance': np.abs(fraud_weights),
    'raw_weight': fraud_weights
}).sort_values('importance', ascending=False)

In [None]:
# Display top 20 most important features
print("Top 20 Features for Fraud Detection:")
feature_importance_df.head(20)

In [None]:
# Analyze feature importance for each cluster
# For each cluster, we'll calculate the average feature values and compare to overall

# Get feature values for fraudulent transactions
fraud_features = data.x[fraud_idx].cpu().numpy()

# Calculate overall mean feature values for fraudulent transactions
overall_fraud_means = fraud_features.mean(axis=0)

# Create a DataFrame to store feature statistics by cluster
cluster_feature_stats = pd.DataFrame()

# Calculate feature means for each cluster
for cluster_id in np.sort(fraud_plot_df['cluster'].unique()):
    # Get indices of transactions in this cluster
    cluster_tx_ids = fraud_plot_df[fraud_plot_df['cluster'] == cluster_id]['tx_id'].values
    
    # Map transaction IDs to indices in the data
    id_to_idx = dict(zip(df_nodes.iloc[:, 0].values, range(len(df_nodes))))
    cluster_indices = [id_to_idx[tx_id] for tx_id in cluster_tx_ids if tx_id in id_to_idx]
    
    if len(cluster_indices) > 0:
        # Get feature values for this cluster
        cluster_features = data.x[cluster_indices].cpu().numpy()
        
        # Calculate mean feature values
        cluster_means = cluster_features.mean(axis=0)
        
        # Calculate relative difference from overall fraud means
        relative_diff = (cluster_means - overall_fraud_means) / (overall_fraud_means + 1e-10)  # Avoid division by zero
        
        # Add to DataFrame
        cluster_feature_stats[f'Cluster_{cluster_id}_Mean'] = cluster_means
        cluster_feature_stats[f'Cluster_{cluster_id}_RelDiff'] = relative_diff
    
# Add overall means and feature names
cluster_feature_stats['Overall_Fraud_Mean'] = overall_fraud_means
cluster_feature_stats['Feature_Index'] = np.arange(len(overall_fraud_means))
cluster_feature_stats['Feature_Name'] = feature_names[:len(overall_fraud_means)] if len(feature_names) >= len(overall_fraud_means) else [f"Feature_{i}" for i in range(len(overall_fraud_means))]
cluster_feature_stats['Importance'] = feature_importance_df['importance'].values

# Sort by overall importance
cluster_feature_stats = cluster_feature_stats.sort_values('Importance', ascending=False)

In [None]:
# Display distinctive features for each cluster
top_n = 10  # Number of top features to display

for cluster_id in np.sort(fraud_plot_df['cluster'].unique()):
    if cluster_id == -1:  # Skip noise
        continue
        
    rel_diff_col = f'Cluster_{cluster_id}_RelDiff'
    if rel_diff_col in cluster_feature_stats.columns:
        # Sort by absolute relative difference for this cluster
        cluster_distinctive = cluster_feature_stats.copy()
        cluster_distinctive['Abs_RelDiff'] = np.abs(cluster_distinctive[rel_diff_col])
        cluster_distinctive = cluster_distinctive.sort_values('Abs_RelDiff', ascending=False)
        
        print(f"\nTop {top_n} Distinctive Features for Cluster {cluster_id}:")
        for i, (_, row) in enumerate(cluster_distinctive.head(top_n).iterrows()):
            rel_diff = row[rel_diff_col]
            direction = "higher" if rel_diff > 0 else "lower"
            print(f"{i+1}. {row['Feature_Name']}: {abs(rel_diff):.2f} times {direction} than average")

In [None]:
# Visualize distinctive features for each cluster
for cluster_id in np.sort(fraud_plot_df['cluster'].unique()):
    if cluster_id == -1:  # Skip noise
        continue
        
    rel_diff_col = f'Cluster_{cluster_id}_RelDiff'
    if rel_diff_col in cluster_feature_stats.columns:
        # Get top distinctive features
        cluster_distinctive = cluster_feature_stats.copy()
        cluster_distinctive['Abs_RelDiff'] = np.abs(cluster_distinctive[rel_diff_col])
        cluster_distinctive = cluster_distinctive.sort_values('Abs_RelDiff', ascending=False).head(10)
        
        # Plot
        plt.figure(figsize=(12, 6))
        bars = plt.barh(range(len(cluster_distinctive)), 
                        cluster_distinctive[rel_diff_col], 
                        color=[plt.cm.RdBu(0.1) if x < 0 else plt.cm.RdBu(0.9) for x in cluster_distinctive[rel_diff_col]])
        plt.yticks(range(len(cluster_distinctive)), 
                  [f"{name} (Imp: {imp:.3f})" for name, imp in 
                   zip(cluster_distinctive['Feature_Name'], cluster_distinctive['Importance'])])
        plt.axvline(x=0, color='black', linestyle='--')
        plt.title(f"Distinctive Features for Fraud Cluster {cluster_id}", fontsize=15)
        plt.xlabel("Relative Difference from Average (+ = higher, - = lower)", fontsize=12)
        plt.tight_layout()
        plt.savefig(f"../reports/case_study/cluster_{cluster_id}_features.png", dpi=300, bbox_inches='tight')
        plt.show()

## 8. Summary and Insights

Based on our analysis of the blockchain fraud patterns, we can draw the following insights:

In [None]:
# Generate summary statistics
summary_stats = {
    'total_transactions': len(results_df),
    'fraud_transactions': results_df['true_label'].sum(),
    'fraud_rate': results_df['true_label'].mean(),
    'detection_rate': results_df[results_df['true_label'] == 1]['correct_prediction'].mean(),
    'false_positive_rate': len(results_df[(results_df['true_label'] == 0) & (results_df['predicted_label'] == 1)]) / len(results_df[results_df['true_label'] == 0]),
    'clusters_identified': n_clusters,
    'top_features': feature_importance_df['feature_name'].head(5).tolist()
}

print("Blockchain Fraud Analysis Summary:")
print(f"Total Transactions: {summary_stats['total_transactions']}")
print(f"Fraudulent Transactions: {summary_stats['fraud_transactions']} ({summary_stats['fraud_rate']:.2%})")
print(f"Fraud Detection Rate: {summary_stats['detection_rate']:.2%}")
print(f"False Positive Rate: {summary_stats['false_positive_rate']:.2%}")
print(f"Fraud Patterns (Clusters) Identified: {summary_stats['clusters_identified']}")
print(f"Top 5 Important Features: {', '.join(summary_stats['top_features'])}")

In [None]:
# Generate a markdown summary report
summary_md = f"""# Blockchain Fraud Case Study: Patterns and Insights

## Summary Statistics
- **Total Transactions:** {summary_stats['total_transactions']}
- **Fraudulent Transactions:** {summary_stats['fraud_transactions']} ({summary_stats['fraud_rate']:.2%})
- **Fraud Detection Rate:** {summary_stats['detection_rate']:.2%}
- **False Positive Rate:** {summary_stats['false_positive_rate']:.2%}
- **Distinct Fraud Patterns:** {summary_stats['clusters_identified']} clusters identified

## Key Fraud Indicators
The most important features for detecting fraud are:
{', '.join([f'**{f}**' for f in summary_stats['top_features']])}

## Fraud Pattern Analysis

"""

# Add information about each cluster
for cluster_id in sorted(cluster_stats['Cluster'].unique()):
    if cluster_id == -1:  # Skip noise in the summary
        continue
        
    cluster_count = cluster_stats[cluster_stats['Cluster'] == cluster_id]['Count'].iloc[0]
    detection_rate = cluster_stats[cluster_stats['Cluster'] == cluster_id]['Detection_Rate'].iloc[0]
    
    summary_md += f"""### Fraud Pattern {cluster_id}
- **Transactions:** {cluster_count} ({cluster_count / summary_stats['fraud_transactions']:.2%} of all fraud)
- **Detection Rate:** {detection_rate:.2%}
- **Key Characteristics:**
"""
    
    # Add distinctive features for this cluster
    rel_diff_col = f'Cluster_{cluster_id}_RelDiff'
    if rel_diff_col in cluster_feature_stats.columns:
        cluster_distinctive = cluster_feature_stats.copy()
        cluster_distinctive['Abs_RelDiff'] = np.abs(cluster_distinctive[rel_diff_col])
        cluster_distinctive = cluster_distinctive.sort_values('Abs_RelDiff', ascending=False).head(5)
        
        for _, row in cluster_distinctive.iterrows():
            rel_diff = row[rel_diff_col]
            direction = "higher" if rel_diff > 0 else "lower"
            summary_md += f"  - {row['Feature_Name']}: {abs(rel_diff):.2f} times {direction} than average\n"
    
    summary_md += "\n"

# Add conclusions
summary_md += """## Conclusions and Recommendations

The analysis reveals distinct patterns of fraudulent behavior in the blockchain transaction network. Key insights include:

1. **Multiple Fraud Patterns:** We identified several distinct clusters of fraudulent transactions, each with unique characteristics.

2. **Network Structure:** The network analysis shows that fraudulent transactions often have distinctive connection patterns in the transaction graph.

3. **Feature Importance:** Certain features are consistently important across fraud patterns, while others are specific to particular types of fraud.

4. **Detection Challenges:** Some fraud patterns are more challenging to detect than others, with varying detection rates across clusters.

### Recommendations for Improving Fraud Detection:

1. **Enhanced Feature Engineering:** Develop specialized features for each fraud pattern to improve detection across all types.

2. **Pattern-Specific Models:** Consider training separate models for different fraud patterns or using ensemble methods.

3. **Graph Features:** Further leverage the transaction network structure, as graph-based features are highly informative.

4. **Continuous Monitoring:** Implement real-time monitoring for emerging fraud patterns, as they may evolve over time.
"""

# Save the summary report
with open('../reports/case_study/summary.md', 'w') as f:
    f.write(summary_md)

print("Saved summary report to '../reports/case_study/summary.md'")

## 9. Conclusion

In this notebook, we conducted a detailed case study of blockchain fraud patterns using our trained GNN model. We identified distinct clusters of fraudulent transactions, analyzed their network structures, and examined the features that best characterize each fraud pattern.

Our analysis revealed that:
1. Fraudulent transactions can be grouped into several distinct patterns, each with unique characteristics
2. Different fraud patterns have varying detection rates, with some being more challenging to identify than others
3. Network structure and connection patterns provide valuable insights into fraudulent behavior
4. Certain features are consistently important across all fraud types, while others are specific to particular patterns
5. The combination of transaction features and graph structure is powerful for fraud detection

These insights can guide the development of more specialized fraud detection strategies and inform future feature engineering efforts. By understanding the distinctive characteristics of different fraud patterns, we can build more effective models that target specific types of fraudulent activity in blockchain networks.