# Data Graph Construction

Construct the Bitcoin transaction graph from raw data.

**Steps:**
1. Load node features from CSV
2. Load node labels (fraud/non-fraud)
3. Load edge list (transactions)
4. Create PyTorch Geometric Data object
5. Save to artifacts for training

**Data:** Elliptic++ Bitcoin Dataset
- ~203k nodes (Bitcoin addresses)
- ~230k edges (transactions)
- 181 features per node (temporal + structural)

## Setup

In [None]:
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

import sys
import pandas as pd
import numpy as np
import torch
from torch_geometric.data import Data
from torch_geometric.utils import add_self_loops
import matplotlib.pyplot as plt
from tqdm import tqdm
from pathlib import Path

# Add project root
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.config import DATASET_CONFIG, TRAINING_CONFIG, DATA_DIR, ARTIFACTS_DIR, FIGURES_DIR
from src.utils import set_random_seeds

# Set random seeds
set_random_seeds(TRAINING_CONFIG['random_seed'])
print("✓ Setup complete!")

### What This Cell Does
This cell imports all necessary libraries and sets up the environment:
- **os, sys, Path**: File system operations
- **pandas**: Reading CSV files
- **numpy**: Numerical operations
- **torch**: PyTorch framework (neural networks)
- **torch_geometric**: Graph neural network library
- **matplotlib**: Visualization
- **tqdm**: Progress bars

Then it:
1. Sets environment variable for GPU compatibility
2. Adds project root to Python path (so we can import `src/`)
3. Loads configuration from `src/config.py`
4. Sets random seeds for reproducibility

## Load Features

In [None]:
print("\nLoading features...")
features_file = DATA_DIR / DATASET_CONFIG['features_file']
df_features = pd.read_csv(features_file, header=0)
print(f"Shape: {df_features.shape}")

node_ids = df_features.iloc[:, 0].values
timesteps = df_features.iloc[:, 1].values
features = df_features.iloc[:, 2:].values.astype(np.float32)

node_id_to_idx = {nid: idx for idx, nid in enumerate(node_ids)}
print(f"Nodes: {len(node_ids):,}, Features: {features.shape[1]}")

### What This Cell Does (Load Features)
This cell loads the **node features** (182 numbers per Bitcoin address):

1. **Read CSV file**: `txs_features.csv` contains:
   - Column 0: node_id (unique address identifier)
   - Column 1: timestep (time period)
   - Columns 2-183: 182 numerical features

2. **Extract components**:
   - `node_ids`: Address identifiers (needed for mapping)
   - `timesteps`: Time information
   - `features`: The 182 numerical values per node

3. **Create mapping**: `node_id_to_idx` is a dictionary:
   - Key: original node ID from CSV
   - Value: index 0-203k (needed for edges later)

## Load Labels

In [None]:
print("\nLoading labels...")
classes_file = DATA_DIR / DATASET_CONFIG['classes_file']
df_classes = pd.read_csv(classes_file)
print("Original class distribution:")
print(df_classes[DATASET_CONFIG['label_column']].value_counts())

labels = np.full(len(node_ids), -1, dtype=np.int64)

illicit_label = DATASET_CONFIG['illicit_label']
licit_label = DATASET_CONFIG['licit_label']

for _, row in tqdm(df_classes.iterrows(), total=len(df_classes), desc="Processing labels"):
    node_id = row[DATASET_CONFIG['id_column']]
    if node_id in node_id_to_idx:
        idx = node_id_to_idx[node_id]
        if row[DATASET_CONFIG['label_column']] == illicit_label:
            labels[idx] = 1  # Illicit
        elif row[DATASET_CONFIG['label_column']] == licit_label:
            labels[idx] = 0  # Licit
        # Unknown (3) remains -1

labeled_mask = labels != -1
print(f"\nLabeled: {labeled_mask.sum():,}, Unlabeled: {(~labeled_mask).sum():,}")
print(f"Licit (0): {(labels==0).sum():,}, Illicit (1): {(labels==1).sum():,}")

### What This Cell Does (Load Labels)
This cell loads the **classification labels** (whether each address is illicit or licit):

1. **Read CSV file**: `txs_classes.csv` contains:
   - Column: node_id (address identifier)
   - Column: class (0=licit, 1=illicit, 3=unknown)

2. **Initialize label array**: Create array of size 203k with -1 (unlabeled):
   - Initialize all nodes as unknown/unlabeled initially
   - Then fill in known labels

3. **Map labels to indices**: For each node in classes file:
   - Find its index using `node_id_to_idx` dictionary
   - Set label[index] = 0 (licit) or 1 (illicit)
   - Unknown nodes (3) remain -1 for semi-supervised learning

4. **Calculate statistics**: 
   - Count how many nodes have labels vs don't
   - Show class balance (how many illicit vs licit)

## Load Edges

In [None]:
print("\nLoading edges...")
edgelist_file = DATA_DIR / DATASET_CONFIG['edgelist_file']
df_edges = pd.read_csv(edgelist_file)
print(f"Edges in file: {len(df_edges):,}")

edge_list = []
for _, row in tqdm(df_edges.iterrows(), total=len(df_edges), desc="Processing edges"):
    src_id = row[DATASET_CONFIG['edge_source_column']]
    dst_id = row[DATASET_CONFIG['edge_target_column']]
    
    if src_id in node_id_to_idx and dst_id in node_id_to_idx:
        src_idx = node_id_to_idx[src_id]
        dst_idx = node_id_to_idx[dst_id]
        edge_list.append([src_idx, dst_idx])
        edge_list.append([dst_idx, src_idx])  # Make undirected

edge_index = torch.tensor(edge_list, dtype=torch.long).t().contiguous()
print(f"Edge index shape (before self-loops): {edge_index.shape}")

# Add self-loops for GAT
edge_index, _ = add_self_loops(edge_index, num_nodes=len(node_ids))
print(f"Edge index shape (after self-loops): {edge_index.shape}")

### What This Cell Does (Load Edges)
This cell loads the **transaction edges** that connect addresses:

1. **Read CSV file**: `txs_edgelist.csv` contains:
   - Column: source_id (from address)
   - Column: target_id (to address) 
   - Each row is ONE transaction direction

2. **Map edges to indices**:
   - For each edge (source → target):
   - Look up source_id in `node_id_to_idx` to get source index
   - Look up target_id in `node_id_to_idx` to get target index
   - Keep only edges between nodes we have features for (skip if not in node_id_to_idx)

3. **Make graph undirected**:
   - Add reverse edge (target → source) for each original edge
   - This means each transaction becomes TWO edges (bidirectional)
   - Graph Attention works better with undirected edges for Bitcoin data

4. **Add self-loops**:
   - For graph neural networks, add edges where source = target
   - Helps model learn node's own feature importance alongside neighbors

5. **Create edge_index tensor**:
   - Stack all edges into [2, num_edges] tensor
   - First row: source indices
   - Second row: target indices
   - This is PyTorch Geometric edge representation format

## Create PyTorch Geometric Data

In [None]:
x = torch.tensor(features, dtype=torch.float32)
y = torch.tensor(labels, dtype=torch.long)
timestep_tensor = torch.tensor(timesteps, dtype=torch.long)
labeled_mask_tensor = torch.tensor(labeled_mask, dtype=torch.bool)
unlabeled_mask_tensor = torch.tensor(~labeled_mask, dtype=torch.bool)

data = Data(
    x=x,
    edge_index=edge_index,
    y=y,
    timestep=timestep_tensor,
    labeled_mask=labeled_mask_tensor,
    unlabeled_mask=unlabeled_mask_tensor
)

print("\n" + "="*50)
print(data)
print("="*50)

### What This Cell Does (Create PyTorch Geometric Data)
This cell packages all the loaded data into a **PyTorch Geometric Data object**:

1. **Convert to PyTorch tensors**:
   - `x`: Node features (203k × 182) - the 182 numerical values per address
   - `y`: Labels (203k,) - 0 for licit, 1 for illicit, -1 for unlabeled
   - `timestep`: Time period for each node (for temporal analysis)
   - `labeled_mask`: Boolean array showing which nodes have labels
   - `unlabeled_mask`: Boolean array showing which nodes don't have labels

2. **Create Data object**:
   - Combines all the above into ONE object
   - `edge_index`: The transaction graph connectivity
   - PyTorch Geometric models expect this format
   - This is the standard way to represent graphs in deep learning

3. **Why this matters**:
   - The Data object is what gets passed to neural network models
   - Models expect: node features (x), edges (edge_index), labels (y)
   - The masks let the model know which nodes to use for training vs evaluation

## Save Graph

In [None]:
# Create directories if needed
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# Save graph
save_path = ARTIFACTS_DIR / 'elliptic_graph.pt'
torch.save(data, save_path)
print(f"✓ Graph saved to {save_path}")

### What This Cell Does (Save Graph)
This cell saves the constructed graph to disk for **reuse in later training**:

1. **Create directories**:
   - Ensure `artifacts/` folder exists (where models and data are saved)
   - Ensure `figures/` folder exists (where visualizations go)

2. **Save as PyTorch checkpoint**:
   - Save the entire Data object to `elliptic_graph.pt`
   - Uses pickle format (PyTorch's standard `torch.save()`)
   - File size: ~200MB (includes all features, edges, labels)

3. **Why save instead of rebuild**:
   - Graph creation takes 5-10 minutes (reading CSVs, mapping nodes, processing edges)
   - Saving allows later notebooks to load in milliseconds
   - Consistency: Everyone uses the exact same graph (no random variations)
   - `sanity_checks.ipynb` and training notebooks will load and reuse this

## Visualize Data Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Labels
label_counts = [(labels==0).sum(), (labels==1).sum(), (~labeled_mask).sum()]
axes[0].bar(['Licit', 'Illicit', 'Unknown'], label_counts, color=['green', 'red', 'gray'])
axes[0].set_ylabel('Count')
axes[0].set_title('Label Distribution')
axes[0].set_yscale('log')

# Timesteps
axes[1].hist(timesteps, bins=50, edgecolor='black')
axes[1].set_xlabel('Timestep')
axes[1].set_ylabel('Count')
axes[1].set_title('Temporal Distribution')

plt.tight_layout()
save_path = FIGURES_DIR / 'data_distribution.png'
plt.savefig(save_path, dpi=150)
print(f"✓ Figure saved to {save_path}")
plt.show()

### What This Cell Does (Visualize Data Distribution)
This cell creates visualizations to understand the **dataset composition**:

1. **Left plot - Label distribution** (logged scale):
   - Shows how many licit addresses (green)
   - Shows how many illicit addresses (red) 
   - Shows how many unlabeled/unknown addresses (gray)
   - Logged scale because unlabeled count is much larger
   - Reveals class imbalance: far fewer illicit than licit addresses

2. **Right plot - Temporal distribution**:
   - Histogram of timesteps (time periods)
   - Shows which time periods have most transactions
   - Important for understanding if data is evenly distributed in time
   - Helps verify dataset covers multiple time periods

3. **Save figure**:
   - Saves both plots to `figures/data_distribution.png`
   - Good for reports, presentations, documentation

## Summary

In [None]:
print("\n" + "="*50)
print("✅ GRAPH CONSTRUCTION COMPLETE!")
print("="*50)
print("\nNext steps:")
print("1. Run 01_baseline_gat_training.ipynb for baseline model")
print("2. Run 02_quantum_gat_training.ipynb for quantum model")
print("3. Run 03_data_sanity_checks.ipynb for validation")

### What This Cell Does (Summary)
This cell provides a **completion summary and next steps**:

1. **Marks completion**:
   - Prints success message indicating graph is ready
   - Confirms that `elliptic_graph.pt` has been saved

2. **Tells what to do next**:
   - Run `sanity_checks.ipynb` to verify graph structure
   - Run `baseline_gat_training.ipynb` to train the baseline model
   - Run `quantum_gat_training.ipynb` to train the quantum model

3. **Workflow progression**:
   - This notebook (create_graph): Build the graph → ✓ DONE
   - Next: Verify it's correct (sanity_checks)
   - Then: Train models using this graph