# Selected Dataset: Elliptic Data Set

**Selection Date**: 2025-08-29  
**Selected from**: dataset-exploration-risk-scoring-research.md  
**Rank**: #1 out of 10 evaluated datasets  
**Suitability Score**: 92/100  

## Dataset Overview

### **Elliptic Data Set**
- **Source Platform**: Kaggle
- **Direct URL**: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
- **Dataset Size**: 200,000 transactions � 166 features, ~6GB
- **Problem Relevance**: High - Bitcoin illicit transaction classification
- **Data Quality**: Excellent - professionally curated by Elliptic Co.
- **License Type**: Open (with attribution requirements)
- **Last Updated**: 2019 (stable reference dataset)
- **Preprocessing Needs**: Minimal - ready for ML training

## Key Features and Structure

### Feature Categories
- **Local Features**: 94 transaction-specific features
- **Aggregate Features**: 72 neighborhood/graph-based features  
- **Total Features**: 166 feature dimensions
- **Temporal Component**: Time step information included
- **Labels**: Binary classification (illicit/licit)

### Specific Features Include:
- Transaction fees
- Input/output volumes  
- Neighbor aggregates
- Time steps
- BTC amounts
- Graph topology metrics
- Wallet clustering information

### Importing libraries

Install and import required packages

In [1]:
import kaggle
from pathlib import Path
import pandas as pd
import numpy as np

### Download the original dataset

Create the dataset directory and download the original data from source.

In [2]:
original_dir = Path("../original")
specific_dir = original_dir / "elliptic_bitcoin_dataset"
if(specific_dir.exists() and any(specific_dir.iterdir())):
    print(f"Dataset already exists in {original_dir}, skipping download.")
else:
    original_dir.mkdir(exist_ok=True)
    dataset_name = "ellipticco/elliptic-data-set"
    kaggle.api.dataset_download_files(
        dataset_name, 
        path=str(original_dir), 
        unzip=True
    )
    
    downloaded_files = list(original_dir.glob("*"))
    for file_path in downloaded_files:
        size_mb = file_path.stat().st_size / (1024 * 1024) if file_path.is_file() else 0
        print(f"  - {file_path.name} ({size_mb:.2f} MB)")

Dataset URL: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
  - elliptic_bitcoin_dataset (0.00 MB)


## Dataset Visualization

Let's visualize the dataset in order to check which pre-process to apply:
1. Missing Values:


In [14]:

features_file = specific_dir / "elliptic_txs_features.csv"
if features_file.exists(): df_features = pd.read_csv(features_file, header=None)
else: df_features = pd.DataFrame()
classes_file = specific_dir / "elliptic_txs_classes.csv"
if classes_file.exists(): df_classes = pd.read_csv(classes_file)
else: df_classes = pd.DataFrame()
edges_file = specific_dir / "elliptic_txs_edgelist.csv"
if edges_file.exists(): df_edges = pd.read_csv(edges_file)
else: df_edges = pd.DataFrame()

# Summary of all datasets
print(f"\n📊 Dataset Summary:")
print(f"  - Features: {df_features.shape[0]:,} transactions × {df_features.shape[1]} features")
print(df_features.iloc[:, :4].head(10)) 
print(f"  - Classes: {df_classes.shape[0]:,} labeled transactions")
print(df_classes.head(10))
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")
print(df_edges.head(10))



📊 Dataset Summary:
  - Features: 203,769 transactions × 167 features
           0  1         2         3
0  230425980  1 -0.171469 -0.184668
1    5530458  1 -0.171484 -0.184668
2  232022460  1 -0.172107 -0.184668
3  232438397  1  0.163054  1.963790
4  230460314  1  1.011523 -0.081127
5  230459870  1  0.961040 -0.081127
6  230333930  1 -0.171264 -0.184668
7  230595899  1 -0.171755 -0.184668
8  232013274  1 -0.123127 -0.184668
9  232029206  1 -0.005027  0.578941
  - Classes: 203,769 labeled transactions
        txId    class
0  230425980  unknown
1    5530458  unknown
2  232022460  unknown
3  232438397        2
4  230460314  unknown
5  230459870  unknown
6  230333930  unknown
7  230595899  unknown
8  232013274  unknown
9  232029206        2
  - Edges: 234,355 transaction relationships
       txId1      txId2
0  230425980    5530458
1  232022460  232438397
2  230460314  230459870
3  230333930  230595899
4  232013274  232029206
5  232344069   27553029
6   36411953  230405052
7   34194980 

## Apply pre-processing transformation 
Let's apply the feature transformation to normalize the data:
 1. Feature Standard Scaling: ....