# Selected Dataset: Elliptic Data Set

**Selection Date**: 2025-08-29  
**Selected from**: dataset-exploration-risk-scoring-research.md  
**Rank**: #1 out of 10 evaluated datasets  
**Suitability Score**: 92/100  

## Dataset Overview

### **Elliptic Data Set**
- **Source Platform**: Kaggle
- **Direct URL**: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
- **Dataset Size**: 200,000 transactions � 166 features, ~6GB
- **Problem Relevance**: High - Bitcoin illicit transaction classification
- **Data Quality**: Excellent - professionally curated by Elliptic Co.
- **License Type**: Open (with attribution requirements)
- **Last Updated**: 2019 (stable reference dataset)
- **Preprocessing Needs**: Minimal - ready for ML training

## Key Features and Structure

### Feature Categories
- **Local Features**: 94 transaction-specific features
- **Aggregate Features**: 72 neighborhood/graph-based features  
- **Total Features**: 166 feature dimensions
- **Temporal Component**: Time step information included
- **Labels**: Binary classification (illicit/licit)

### Specific Features Include:
- Transaction fees
- Input/output volumes  
- Neighbor aggregates
- Time steps
- BTC amounts
- Graph topology metrics
- Wallet clustering information

### Importing libraries

Install and import required packages for dataset download

In [6]:
import kaggle
from pathlib import Path
import pandas as pd
import numpy as np

### Download the original dataset

In [11]:
# Create directory structure for original datasets
original_dir = Path("../original")
original_dir.mkdir(exist_ok=True)

# Dataset information
dataset_name = "ellipticco/elliptic-data-set"
print(f"Downloading dataset: {dataset_name}")

# Download the dataset to the original folder
try:
    # Download and extract the dataset
    kaggle.api.dataset_download_files(
        dataset_name, 
        path=str(original_dir), 
        unzip=True
    )
    print(f"✅ Dataset successfully downloaded to: {original_dir.absolute()}")
    
    # List downloaded files
    downloaded_files = list(original_dir.glob("*"))
    print(f"\nDownloaded files ({len(downloaded_files)}):")
    for file_path in downloaded_files:
        size_mb = file_path.stat().st_size / (1024 * 1024) if file_path.is_file() else 0
        print(f"  - {file_path.name} ({size_mb:.2f} MB)")
        
except Exception as e:
    print(f"❌ Error downloading dataset: {e}")
    print("\nTroubleshooting tips:")
    print("1. Ensure Kaggle API is configured: ~/.kaggle/kaggle.json")
    print("2. Accept the dataset terms on Kaggle website")
    print("3. Verify dataset name: ellipticco/elliptic-data-set")

Downloading dataset: ellipticco/elliptic-data-set


KeyboardInterrupt: 

## Load and display the data

In [None]:
# Load and explore the Elliptic dataset
# Load the dataset files
original_dir = Path("../original/elliptic_bitcoin_dataset")

print("Loading Elliptic dataset files...\n")

# Load the main dataset files (based on typical Elliptic dataset structure)
try:
    # Load transaction features (elliptic_txs_features.csv)
    features_file = original_dir / "elliptic_txs_features.csv"
    if features_file.exists():
        df_features = pd.read_csv(features_file, header=None)
        print(f"✅ Features dataset loaded: {df_features.shape}")
        print("First 10 rows of transaction features:")
        print(df_features.head(10))
        print("-" * 80)
    
    # Load transaction classes/labels (elliptic_txs_classes.csv)
    classes_file = original_dir / "elliptic_txs_classes.csv"
    if classes_file.exists():
        df_classes = pd.read_csv(classes_file)
        print(f"\n✅ Classes dataset loaded: {df_classes.shape}")
        print("First 10 rows of transaction classes:")
        print(df_classes.head(10))
        print("-" * 80)
    
    # Load transaction edges/relationships (elliptic_txs_edgelist.csv)
    edges_file = original_dir / "elliptic_txs_edgelist.csv"
    if edges_file.exists():
        df_edges = pd.read_csv(edges_file)
        print(f"\n✅ Edges dataset loaded: {df_edges.shape}")
        print("First 10 rows of transaction edges:")
        print(df_edges.head(10))
        print("-" * 80)
    
    # Summary of all datasets
    print(f"\n📊 Dataset Summary:")
    if 'df_features' in locals():
        print(f"  - Features: {df_features.shape[0]:,} transactions × {df_features.shape[1]} features")
    if 'df_classes' in locals():
        print(f"  - Classes: {df_classes.shape[0]:,} labeled transactions")
    if 'df_edges' in locals():
        print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")
    
    # Check class distribution if classes are available
    if 'df_classes' in locals():
        print(f"\n🏷️ Class Distribution:")
        class_counts = df_classes.iloc[:, 1].value_counts() if df_classes.shape[1] > 1 else df_classes.iloc[:, 0].value_counts()
        for class_label, count in class_counts.items():
            print(f"  - {class_label}: {count:,} transactions ({count/len(df_classes)*100:.1f}%)")

except Exception as e:
    print(f"❌ Error loading datasets: {e}")
    print("\nAvailable files in original directory:")
    for file_path in original_dir.glob("*"):
        if file_path.is_file():
            print(f"  - {file_path.name}")
    print("\nTrying to load any CSV files found...")
    
    # Fallback: load any CSV files found
    csv_files = list(original_dir.glob("*.csv"))
    for i, csv_file in enumerate(csv_files):
        try:
            df_temp = pd.read_csv(csv_file)
            print(f"\n📁 File {i+1}: {csv_file.name}")
            print(f"Shape: {df_temp.shape}")
            print("First 10 rows:")
            print(df_temp.head(10))
            print("-" * 80)
        except Exception as file_error:
            print(f"❌ Could not load {csv_file.name}: {file_error}")

Loading Elliptic dataset files...

         0    1         2         3         4          5         6    \
0  230425980    1 -0.171469 -0.184668 -1.201369  -0.121970 -0.043875   
1    5530458    1 -0.171484 -0.184668 -1.201369  -0.121970 -0.043875   
2  232022460    1 -0.172107 -0.184668 -1.201369  -0.121970 -0.043875   
3  232438397    1  0.163054  1.963790 -0.646376  12.409294 -0.063725   
4  230460314    1  1.011523 -0.081127 -1.201369   1.153668  0.333276   
5  230459870    1  0.961040 -0.081127 -1.201369   1.303743  0.333276   
6  230333930    1 -0.171264 -0.184668 -1.201369  -0.121970 -0.043875   
7  230595899    1 -0.171755 -0.184668 -1.201369  -0.046932 -0.043875   
8  232013274    1 -0.123127 -0.184668 -1.201369  -0.121970 -0.043875   
9  232029206    1 -0.005027  0.578941 -0.091383   4.380281 -0.063725   

        7          8         9    ...       157       158       159       160  \
0 -0.113002  -0.061584 -0.162097  ... -0.562153 -0.600999  1.461330  1.461369   
1 -0.11300

KeyboardInterrupt: 