# Bitcoin Transaction Fraud Detection: Data Preparation

This notebook prepares the Bitcoin transaction dataset for fraud detection using Graph Neural Networks (GNNs). We'll load the transaction data, preprocess it, and prepare it for training GNN models.

## Table of Contents
1. [Setup](#Setup)
2. [Loading Data](#Loading-Data)
3. [Data Exploration](#Data-Exploration)
4. [Data Preprocessing](#Data-Preprocessing)
5. [Creating Graph Representation](#Creating-Graph-Representation)
6. [Data Splitting](#Data-Splitting)
7. [Saving Processed Data](#Saving-Processed-Data)

## Setup

Let's import the necessary libraries and configure the environment.

In [None]:
import os
import pandas as pd
import numpy as np
import torch
from torch_geometric.data import Data
from sklearn.model_selection import train_test_split
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Create directories for data
os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/processed', exist_ok=True)

## Loading Data

The blockchain dataset consists of three main files:
1. `classes.csv` - Contains transaction IDs and their classes (legitimate or fraudulent)
2. `edgelist.csv` - Contains transaction connections (edges)
3. `Features.csv` - Contains transaction features

Let's load these files and examine their contents.

In [None]:
def load_data(classes_path, edgelist_path, features_path):
    """
    Load blockchain dataset from CSV files with the special format.
    
    Parameters:
    -----------
    classes_path : str
        Path to the classes CSV file with txId and class columns
    edgelist_path : str
        Path to the edgelist CSV file with txId1 and txId2 columns
    features_path : str
        Path to the features CSV file
        
    Returns:
    --------
    df_nodes : pandas.DataFrame
        DataFrame containing node data with txId, class, and features
    df_edges : pandas.DataFrame
        DataFrame containing edge data
    """
    logger.info(f"Loading data from {classes_path}, {edgelist_path}, and {features_path}")
    
    # Load classes data
    df_classes = pd.read_csv(classes_path)
    logger.info(f"Loaded {len(df_classes)} transactions with class information")
    
    # Load edge data
    df_edges = pd.read_csv(edgelist_path)
    logger.info(f"Loaded {len(df_edges)} edges")
    
    # Load features data - this has a non-standard format
    try:
        # Read the features file
        df_features = pd.read_csv(features_path)
        
        # First column should be txId
        df_features = df_features.rename(columns={df_features.columns[0]: 'txId'})
        
        # Second column is a constant (1) - drop it
        df_features = df_features.drop(columns=[df_features.columns[1]])
        
        # Rename remaining columns to feature_0, feature_1, etc.
        feature_cols = [col for col in df_features.columns if col != 'txId']
        feature_rename = {col: f'feature_{i}' for i, col in enumerate(feature_cols)}
        df_features = df_features.rename(columns=feature_rename)
        
        logger.info(f"Loaded {len(df_features.columns)-1} features for {len(df_features)} transactions")
    except Exception as e:
        logger.error(f"Error loading features: {str(e)}")
        raise
    
    # Merge classes and features based on txId
    df_nodes = pd.merge(df_classes, df_features, on='txId', how='inner')
    logger.info(f"Combined data has {len(df_nodes)} transactions with both class and feature information")
    
    return df_nodes, df_edges

# Paths to data files
classes_path = 'data/raw/classes.csv'
edgelist_path = 'data/raw/edgelist.csv'
features_path = 'data/raw/Features.csv'

# Check if files exist
files_exist = all(os.path.exists(p) for p in [classes_path, edgelist_path, features_path])

if files_exist:
    # Load data
    df_nodes, df_edges = load_data(classes_path, edgelist_path, features_path)
else:
    logger.error("Required input files not found. Please place them in the data/raw directory.")

## Data Exploration

Let's explore the loaded data to better understand its structure and characteristics.

In [None]:
# Check the first few rows of the node data
print("Node data (first 5 rows):")
df_nodes.head()

In [None]:
# Check the first few rows of the edge data
print("Edge data (first 5 rows):")
df_edges.head()

In [None]:
# Get basic statistics about the dataset
print("Dataset statistics:")
print(f"Number of transactions: {len(df_nodes)}")
print(f"Number of edges: {len(df_edges)}")
print(f"Number of features per transaction: {len(df_nodes.columns) - 2}")

# Check class distribution
class_counts = df_nodes['class'].value_counts()
print("\nClass distribution:")
print(class_counts)
print(f"Percentage fraudulent: {class_counts.get(1, 0) / len(df_nodes) * 100:.2f}%")

# Visualize class distribution
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='class', data=df_nodes, palette='viridis')
plt.title('Class Distribution')
plt.xlabel('Class (0: Legitimate, 1: Fraudulent)')
plt.ylabel('Count')

# Add count labels on top of bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():,}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Explore feature distributions
feature_cols = [col for col in df_nodes.columns if col not in ['txId', 'class']]

# Calculate feature statistics
feature_stats = df_nodes[feature_cols].describe().T
feature_stats['variance'] = df_nodes[feature_cols].var()
feature_stats = feature_stats.sort_values('variance', ascending=False)

# Display the top 10 features with highest variance
print("Top 10 features with highest variance:")
feature_stats.head(10)

In [None]:
# Visualize the distribution of a few high-variance features
top_features = feature_stats.index[:4]  # Top 4 high-variance features

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, feature in enumerate(top_features):
    sns.histplot(df_nodes[feature], bins=50, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Explore feature correlations with the target class
correlations = []
for feature in feature_cols:
    corr = df_nodes[feature].corr(df_nodes['class'])
    correlations.append((feature, corr))

# Sort by absolute correlation
correlations.sort(key=lambda x: abs(x[1]), reverse=True)

# Display top 15 correlated features
print("Top 15 features by correlation with the target:")
pd.DataFrame(correlations[:15], columns=['Feature', 'Correlation'])

In [None]:
# Create a small network visualization to understand the graph structure
# We'll take a sample for visualization purposes
sample_size = min(1000, len(df_nodes))  # Limit to 1000 nodes for visualization
sampled_nodes = df_nodes.sample(sample_size, random_state=42)

# Create a graph with sampled nodes
G = nx.DiGraph()
for _, node in sampled_nodes.iterrows():
    G.add_node(node['txId'], fraud=(node['class'] == 1))

# Add edges connecting the sampled nodes
for _, edge in df_edges.iterrows():
    if edge['txId1'] in G.nodes and edge['txId2'] in G.nodes:
        G.add_edge(edge['txId1'], edge['txId2'])

# Set node colors based on class (red for fraud, blue for legitimate)
node_colors = ['red' if G.nodes[node]['fraud'] else 'blue' for node in G.nodes]

# Visualize the graph
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G, seed=42)  # Position nodes using force-directed layout
nx.draw_networkx(G, pos, with_labels=False, node_size=50, node_color=node_colors, alpha=0.7, arrows=False)
plt.title(f'Sample Transaction Network (Red = Fraudulent, Blue = Legitimate)')
plt.axis('off')
plt.tight_layout()
plt.show()

# Print some statistics about the sampled graph
print(f"Sampled graph statistics:")
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Average degree: {sum(dict(G.degree()).values()) / G.number_of_nodes():.2f}")

## Data Preprocessing

Now that we've explored the data, let's preprocess it for training GNN models. This includes handling unknown classes, mapping non-numeric class labels to integers, removing low-variance features, and normalizing the features.

In [None]:
def create_node_mapping(df_nodes):
    """
    Create a mapping between transaction IDs and indices for graph construction.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        DataFrame containing node data
        
    Returns:
    --------
    id2idx : dict
        Dictionary mapping transaction IDs to indices
    """
    # Create mapping of txId to index
    id2idx = {tx_id: i for i, tx_id in enumerate(df_nodes['txId'])}
    logger.info(f"Created mapping for {len(id2idx)} transactions")
    return id2idx

def preprocess_data(df_nodes, df_edges):
    """
    Preprocess the dataset by handling unknown classes and normalizing features.
    
    Parameters:
    -----------
    df_nodes : pandas.DataFrame
        DataFrame containing node data
    df_edges : pandas.DataFrame
        DataFrame containing edge data
        
    Returns:
    --------
    df_processed : pandas.DataFrame
        Processed DataFrame
    df_edges : pandas.DataFrame
        Processed edges DataFrame
    """
    logger.info("Preprocessing data")
    
    # Make a copy to avoid modifying the original
    df_processed = df_nodes.copy()
    
    # Handle unknown classes
    unknown_mask = df_processed['class'] == 'unknown'
    unknown_count = unknown_mask.sum()
    logger.info(f"Found {unknown_count} transactions with unknown class ({unknown_count/len(df_processed):.2%})")
    
    if unknown_count > 0:
        df_processed = df_processed[~unknown_mask]
        logger.info(f"Removed transactions with unknown class. Remaining: {len(df_processed)}")
    
    # Convert non-numeric class labels to integers
    if df_processed['class'].dtype == 'object':
        unique_classes = df_processed['class'].unique()
        class_map = {cls: i for i, cls in enumerate(unique_classes)}
        df_processed['class'] = df_processed['class'].map(class_map)
        logger.info(f"Mapped class values to integers: {class_map}")
    
    # Identify feature columns (exclude txId and class)
    feature_cols = [col for col in df_processed.columns if col not in ['txId', 'class']]
    
    # Check for and remove low-variance features
    variance = df_processed[feature_cols].var()
    low_var_threshold = 0.01
    low_var_cols = variance[variance < low_var_threshold].index.tolist()
    
    if low_var_cols:
        logger.info(f"Removing {len(low_var_cols)} low-variance features")
        df_processed = df_processed.drop(columns=low_var_cols)
        # Update feature columns
        feature_cols = [col for col in df_processed.columns if col not in ['txId', 'class']]
    
    # Normalize features
    scaler = StandardScaler()
    df_processed[feature_cols] = scaler.fit_transform(df_processed[feature_cols])
    
    # Make sure edges only include transactions with known classes
    valid_tx_ids = set(df_processed['txId'])
    edges_before = len(df_edges)
    df_edges = df_edges[df_edges['txId1'].isin(valid_tx_ids) & df_edges['txId2'].isin(valid_tx_ids)]
    edges_after = len(df_edges)
    logger.info(f"Filtered edges to include only known transactions: {edges_before} -> {edges_after}")
    
    return df_processed, df_edges

# Preprocess the data
df_processed, df_edges = preprocess_data(df_nodes, df_edges)

# Create node mapping
id2idx = create_node_mapping(df_processed)

## Creating Graph Representation

Now we'll create the graph representation for PyTorch Geometric, which includes the edge index tensor, node features, and node labels.

In [None]:
def build_edge_index(df_edges, id2idx):
    """
    Construct edge_index tensor for PyTorch Geometric.
    
    Parameters:
    -----------
    df_edges : pandas.DataFrame
        DataFrame containing edge data
    id2idx : dict
        Dictionary mapping transaction IDs to indices
        
    Returns:
    --------
    edge_index : torch.LongTensor
        Edge index tensor for PyTorch Geometric
    """
    logger.info("Building edge index tensor")
    
    edges = []
    skipped = 0
    
    for _, row in df_edges.iterrows():
        source_id, target_id = row['txId1'], row['txId2']
        
        # Check if both nodes exist in the mapping
        if source_id in id2idx and target_id in id2idx:
            source_idx = id2idx[source_id]
            target_idx = id2idx[target_id]
            edges.append([source_idx, target_idx])
        else:
            skipped += 1
    
    if skipped > 0:
        logger.warning(f"Skipped {skipped} edges with missing transaction IDs")
    
    if not edges:
        logger.warning("No valid edges found")
        edge_index = torch.zeros((2, 0), dtype=torch.long)
    else:
        # Convert to torch tensor
        edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()
    
    logger.info(f"Built edge index with shape {edge_index.shape}")
    
    return edge_index

# Build edge index tensor
edge_index = build_edge_index(df_edges, id2idx)

# Print statistics about the edge index
print(f"Edge index shape: {edge_index.shape}")
print(f"Number of edges: {edge_index.shape[1]}")
print(f"Number of unique source nodes: {len(set(edge_index[0].numpy()))}")
print(f"Number of unique target nodes: {len(set(edge_index[1].numpy()))}")

## Data Splitting

Now we'll split the data into training, validation, and test sets. We'll use stratified sampling to ensure that each split has the same class distribution as the original dataset.

In [None]:
def create_data_splits(df_processed, edge_index, train_size=0.7, val_size=0.15, test_size=0.15, random_state=42):
    """
    Create train/validation/test splits and PyTorch Geometric Data object.
    
    Parameters:
    -----------
    df_processed : pandas.DataFrame
        Processed DataFrame
    edge_index : torch.LongTensor
        Edge index tensor
    train_size, val_size, test_size : float
        Proportions for train/val/test splits
    random_state : int
        Random seed
        
    Returns:
    --------
    data : torch_geometric.data.Data
        PyTorch Geometric Data object
    split_idx : dict
        Dictionary containing indices for train/val/test splits
    """
    logger.info("Creating data splits")
    
    # Get feature matrix
    feature_cols = [col for col in df_processed.columns if col not in ['txId', 'class']]
    features = torch.FloatTensor(df_processed[feature_cols].values)
    
    # Get labels
    labels = torch.LongTensor(df_processed['class'].values)
    
    # Create Data object
    data = Data(x=features, edge_index=edge_index, y=labels)
    
    # Create splits
    indices = np.arange(len(df_processed))
    
    try:
        # First split: train vs. (val+test)
        train_idx, temp_idx = train_test_split(
            indices, 
            train_size=train_size, 
            stratify=df_processed['class'].values,
            random_state=random_state
        )
        
        # Second split: val vs. test
        val_size_adjusted = val_size / (val_size + test_size)
        val_idx, test_idx = train_test_split(
            temp_idx,
            train_size=val_size_adjusted,
            stratify=df_processed.iloc[temp_idx]['class'].values,
            random_state=random_state
        )
    except ValueError as e:
        # If stratified split fails (e.g., too few samples in some class),
        # fall back to regular split
        logger.warning(f"Stratified split failed: {str(e)}. Using random split.")
        
        train_idx, temp_idx = train_test_split(
            indices, 
            train_size=train_size, 
            random_state=random_state
        )
        
        val_size_adjusted = val_size / (val_size + test_size)
        val_idx, test_idx = train_test_split(
            temp_idx,
            train_size=val_size_adjusted,
            random_state=random_state
        )
    
    # Create split dictionary
    split_idx = {
        'train': train_idx,
        'val': val_idx,
        'test': test_idx
    }
    
    logger.info(f"Created splits: train={len(train_idx)}, val={len(val_idx)}, test={len(test_idx)}")
    
    return data, split_idx

# Create data splits and PyTorch Geometric Data object
data, split_idx = create_data_splits(df_processed, edge_index)

# Print information about the data object
print("PyTorch Geometric Data object:")
print(f"Number of nodes: {data.num_nodes}")
print(f"Number of edges: {data.num_edges}")
print(f"Number of features: {data.num_features}")
print(f"Number of classes: {len(torch.unique(data.y))}")

# Print information about the splits
print("\nData splits:")
print(f"Training set: {len(split_idx['train'])} nodes ({len(split_idx['train'])/data.num_nodes:.2%})")
print(f"Validation set: {len(split_idx['val'])} nodes ({len(split_idx['val'])/data.num_nodes:.2%})")
print(f"Test set: {len(split_idx['test'])} nodes ({len(split_idx['test'])/data.num_nodes:.2%})")

# Check class distribution in each split
train_class_dist = np.bincount(data.y[split_idx['train']].numpy())
val_class_dist = np.bincount(data.y[split_idx['val']].numpy())
test_class_dist = np.bincount(data.y[split_idx['test']].numpy())

print("\nClass distribution in each split:")
print(f"Training set: {train_class_dist} ({train_class_dist / train_class_dist.sum() * 100}%)")
print(f"Validation set: {val_class_dist} ({val_class_dist / val_class_dist.sum() * 100}%)")
print(f"Test set: {test_class_dist} ({test_class_dist / test_class_dist.sum() * 100}%)")

## Saving Processed Data

Finally, let's save the processed data and splits for later use in training and evaluation.

In [None]:
def save_processed_data(data, split_idx, output_dir='data/processed'):
    """
    Save processed data and splits to disk.
    
    Parameters:
    -----------
    data : torch_geometric.data.Data
        PyTorch Geometric Data object
    split_idx : dict
        Dictionary containing indices for train/val/test splits
    output_dir : str
        Directory to save processed data
    """
    os.makedirs(output_dir, exist_ok=True)
    
    logger.info(f"Saving processed data to {output_dir}")
    
    # Save data object
    torch.save(data, os.path.join(output_dir, 'data.pt'))
    
    # Save edge index separately
    torch.save(data.edge_index, os.path.join(output_dir, 'edge_index.pt'))
    
    # Save features and labels
    np.save(os.path.join(output_dir, 'features.npy'), data.x.numpy())
    np.save(os.path.join(output_dir, 'labels.npy'), data.y.numpy())
    
    # Save splits
    for split in split_idx:
        np.save(os.path.join(output_dir, f'{split}_idx.npy'), split_idx[split])
    
    logger.info(f"Successfully saved processed data to {output_dir}")

# Save processed data
save_processed_data(data, split_idx)

print("Data preparation completed successfully!")
print("Saved processed data to data/processed/ directory.")

## Summary

In this notebook, we:

1. Loaded the Bitcoin transaction dataset (transactions, features, and edges)
2. Explored the data to understand its structure and characteristics
3. Preprocessed the data by handling unknown classes, removing low-variance features, and normalizing features
4. Created a graph representation using PyTorch Geometric
5. Split the data into training, validation, and test sets
6. Saved the processed data for later use

The processed data is now ready for feature engineering and model training in subsequent notebooks.