## Selected Dataset: Elliptic Data Set

**Selection Date**: 2025-08-29  
**Selected from**: dataset-exploration-risk-scoring-research.md  
**Rank**: #1 out of 10 evaluated datasets  
**Suitability Score**: 92/100  

### Dataset Overview

### **Elliptic Data Set**
- **Source Platform**: Kaggle
- **Direct URL**: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
- **Dataset Size**: 200,000 transactions � 166 features, ~697.46MB
- **Problem Relevance**: High - Bitcoin illicit transaction classification
- **Data Quality**: Excellent - professionally curated by Elliptic Co.
- **License Type**: Open (with attribution requirements)
- **Last Updated**: 2019 (stable reference dataset)
- **Preprocessing Needs**: Minimal - ready for ML training

### Key Features and Structure

### Feature Categories
- **Local Features**: 94 transaction-specific features
- **Aggregate Features**: 72 neighborhood/graph-based features  
- **Total Features**: 166 feature dimensions
- **Temporal Component**: Time step information included
- **Labels**: Binary classification (illicit/licit)

### Specific Features Include:
- Transaction fees
- Input/output volumes  
- Neighbor aggregates
- Time steps
- BTC amounts
- Graph topology metrics
- Wallet clustering information

### Importing libraries

Import required packages

In [1]:
import kaggle
import warnings
import pandas as pd
import plotly.io as pio
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from IPython.display import display
from pathlib import Path
from azure_utils import AzureBlobDownloader 
warnings.filterwarnings('ignore')

# Set renderer for VS Code compatibility
pio.renderers.default = "plotly_mimetype+notebook"

azureClient = AzureBlobDownloader("https://stmvppos.blob.core.windows.net", "mvpkytsup")

### Dataset Visualization Strategy

Let's visualize the dataset in order to check which pre-process to apply:
 1. Create the dataset directory and download the original data from source
 2. Load all datasets
 3. Display the feature dataset
 4. Display the classes and edges

**1. Create the dataset directory and download the original data from source**

In [2]:
# Download dataset from Azure Blob Storage or Kaggle if not already present
dataset_str  = "elliptic_bitcoin_dataset"
dataset_dir = Path(dataset_str)
original_str  = "original"
original_dir = Path("../original")
specific_dir = original_dir / dataset_dir

if(specific_dir.exists() and any(specific_dir.iterdir())):
    print(f"Dataset already exists in {original_dir}, skipping download.")
else:
    print("Attempting to download dataset from Azure Blob Storage...")

    if azureClient.download_documents(dataset_str):
        print("Download from Azure Blob Storage completed successfully.")
    else:
        print("Falling back to Kaggle download...")
        original_dir.mkdir(exist_ok=True)

        # Download from Kaggle as a fallback
        dataset_name = "ellipticco/elliptic-data-set"
        kaggle.api.dataset_download_files(
            dataset_name, 
            path=str(original_dir), 
            unzip=True
        )

Attempting to download dataset from Azure Blob Storage...
Successfully downloaded 3 files from Azure Blob Storage
Download from Azure Blob Storage completed successfully.


**2. Load all datasets**


In [None]:
features_file = specific_dir / "elliptic_txs_features.csv"
if features_file.exists(): df_features = pd.read_csv(features_file, header=None)
else: df_features = pd.DataFrame()
classes_file = specific_dir / "elliptic_txs_classes.csv"
if classes_file.exists(): df_classes = pd.read_csv(classes_file)
else: df_classes = pd.DataFrame()
edges_file = specific_dir / "elliptic_txs_edgelist.csv"
if edges_file.exists(): df_edges = pd.read_csv(edges_file)
else: df_edges = pd.DataFrame()

**3. Display the feature dataset**

In [None]:
# Summary of all datasets
print(f"\n📊 Dataset Summary:")
print(f"  - Features: {df_features.shape[0]:,} transactions × {df_features.shape[1]} features")
print(f"  - Classes: {df_classes.shape[0]:,} labeled transactions")
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")

# Try Plotly table with explicit display
try:
    print("\n📋 Interactive Plotly Table (first 100 rows, first 10 columns):")
    fig = go.Figure(data=[go.Table(
        header=dict(
            values=list(df_features.columns[:10]),
            fill_color='paleturquoise',
            align='left'
        ),
        cells=dict(
            values=[df_features.iloc[:10000, i] for i in range(10)],
            fill_color='lavender',
            align='left'
        )
    )])
    fig.update_layout(title="Features Dataset - Interactive Table")
    
    # Multiple display attempts for VS Code compatibility
    fig.show(renderer="plotly_mimetype+notebook")
    
except Exception as e:
    print(f"⚠️ Plotly display failed: {e}")
    print("Using pandas display instead:")
    display(df_features.head(10))

**4. Display the classes and edges**

In [None]:
# Display sample data using pandas
print("\n🔍 Classes Dataset Sample:")
display(df_classes.head(10))

print("\n🔍 Edges Dataset Sample:")
display(df_edges.head(10))

### Dataset Analysis Strategy

Let's analyze the following: 
1. Check if the data is already in standard scale format
2. Check for dataset class balance 

**1. Check if the data is already in standard scale format**

In [None]:
# Check if df_features is in standard scale format (mean ≈ 0, std ≈ 1)
numerical_features = df_features.iloc[:, 1:]  # Exclude txId column
means = numerical_features.mean()
stds = numerical_features.std()

# Standardization check
is_standardized = (abs(means) < 0.1).all() and ((stds > 0.9) & (stds < 1.1)).all()

print(f"📊 Feature Standardization Check:")
print(f"  Mean range: [{means.min():.3f}, {means.max():.3f}]")
print(f"  Std range: [{stds.min():.3f}, {stds.max():.3f}]")
print(f"  Is standardized: {'✅ Yes' if is_standardized else '❌ No'}")

**2. Check for dataset class balance**

In [None]:
# Class balance analysis
class_counts = df_classes['class'].value_counts()
labeled_only = class_counts[class_counts.index != 'unknown']
imbalance_ratio = labeled_only.max() / labeled_only.min() if len(labeled_only) >= 2 else 1.0

print(f"⚖️ Class Balance: {len(df_classes):,} total samples")
for cls, count in class_counts.items():
    print(f"  {cls}: {count:,} ({count/len(df_classes)*100:.1f}%)")
print(f"  Imbalance ratio: {imbalance_ratio:.1f}:1")
print(f"  Status: {'✅ Balanced' if imbalance_ratio <= 1.5 else '⚠️ Imbalanced' if imbalance_ratio <= 3.0 else '❌ Highly imbalanced'}")

# Plot distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
class_counts.plot(kind='bar', ax=ax1, color=['lightblue', 'orange', 'lightcoral'])
ax1.set_title('Class Counts')
ax1.tick_params(axis='x', rotation=45)
class_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%', colors=['lightblue', 'orange', 'lightcoral'])
ax2.set_title('Class Distribution')
ax2.set_ylabel('')
plt.tight_layout()
plt.show()

### Dataset Preparations Strategy

Let's prepare the data to training steps:
1. Filter transactions to include only labeled data (classes 1 and 2)
2. Compress and save the processed dataset for next steps
3. Update external data warehouse

**1. Filter transactions to include only labeled data (classes 1 and 2)**

In [None]:
# Display complete dataset
print("Complete dataset")
df_complete = df_features.merge(df_classes, left_on=df_features.columns[0], right_on=df_classes.columns[0], how='inner')
df_complete = df_complete.drop('txId', axis=1)
df_complete = df_complete.rename(columns={df_complete.columns[0]: 'txId'})
display(df_complete.head(10))

# Display labeled dataset
print("Labeled dataset")
labeledRowSelector = df_complete['class'].isin(['1', '2'])
df_labeled = df_complete[labeledRowSelector].copy().reset_index(drop=True)
display(df_labeled.head(10))

# Display unlabeled dataset
print("Unlabeled dataset")
unlabeledRowSelector = df_complete['class'] == 'unknown'
df_unlabeled = df_complete[unlabeledRowSelector].copy().reset_index(drop=True)
display(df_unlabeled.head(10))

# Display edges dataset with renamed columns
print("Edges dataset")
df_edges = df_edges.rename(columns={df_edges.columns[0]: 'source', df_edges.columns[1]: 'destination'})
display(df_edges.head(10))

**2. Compress and save the processed dataset for next steps**

In [None]:
processed_dir = Path("../processed") / dataset_dir
if(processed_dir.exists() == False): processed_dir.mkdir(exist_ok=True)

# Save df_complete to HDF5 with Blosc compression
df_complete.to_hdf(processed_dir / "df_complete.h5", key="df_complete", 
                   complevel=5, complib="blosc", format="table")
print(f"df_complete saved: {df_complete.shape}")

# Save df_labeled to HDF5 with Blosc compression
df_labeled.to_hdf(processed_dir / "df_labeled.h5", key="df_labeled", 
                   complevel=5, complib="blosc", format="table")
print(f"df_labeled saved: {df_labeled.shape}")

# Save df_unlabeled to HDF5 with Blosc compression
df_unlabeled.to_hdf(processed_dir / "df_unlabeled.h5", key="df_unlabeled", 
                   complevel=5, complib="blosc", format="table")
print(f"df_unlabeled saved: {df_unlabeled.shape}")

# Save df_edges to HDF5 with Blosc compression
df_edges.to_hdf(processed_dir / "df_edges.h5", key="df_edges", 
                   complevel=5, complib="blosc", format="table")
print(f"df_edges saved: {df_edges.shape}")

#TODO: Save all data frames in azure blob storage

**3. Update external data warehouse**