# MVP Know Your Transaction (KYT) - Real-Time Transaction Risk Scoring Engine

## Project Overview

This notebook presents a comprehensive implementation of a Real-Time Transaction Risk Scoring Engine for Anti-Money Laundering (AML) compliance in cryptocurrency transactions. The project addresses the critical need for sub-second risk assessment of Bitcoin transactions by combining traditional AML indicators with blockchain-specific risk factors.

## Domain Context: Financial AML for Transactions

### Core Domain Definition
Anti-Money Laundering (AML) for transactions encompasses the comprehensive framework of laws, regulations, procedures, and technological solutions designed to prevent criminals from disguising illegally obtained funds as legitimate income through the global financial system. This domain includes detection, prevention, and reporting of money laundering, terrorist financing, tax evasion, market manipulation, and misuse of public funds.

### Market Context (2025)
- **Global Economic Impact**: Money laundering costs $800 billion to $2 trillion annually
- **Compliance Market**: RegTech market projected to exceed $22 billion by mid-2025
- **AI Adoption**: 56% of financial institutions use AI/ML for AML, projected to reach 90% by 2025
- **Technology Evolution**: Sub-second transaction analysis becoming standard with 45% reduction in false positives through machine learning

### Key Domain Components
- **Regulatory Framework**: FATF recommendations, Five Pillars of AML Compliance
- **Transaction Monitoring**: AI-powered behavioral risk scoring, real-time processing
- **Advanced Analytics**: Graph analytics for network detection, entity-centric analysis
- **Data Integration**: Internal transaction records combined with external OSINT sources

## Problem Definition: Real-Time Transaction Risk Scoring Engine

### Problem Statement
Develop a system that assigns risk scores (0-100) to cryptocurrency transactions in real-time, integrating traditional AML indicators with blockchain-specific risk factors including wallet clustering, transaction graph analysis, and counterparty reputation scoring.

### Technical Requirements
- **Problem Type**: Classification with Regression scoring (dual-objective system)
- **Processing Speed**: Sub-second analysis for high-frequency transactions
- **Difficulty Level**: High - requires complex multi-dimensional data processing
- **Output Format**: Risk scores from 0-100 with binary classification (illicit/licit)

### Data Landscape
The system processes multiple data dimensions:
- Transaction metadata (amounts, timestamps, fees)
- Wallet addresses and clustering information
- Transaction graph relationships and network topology
- Counterparty databases and reputation scores
- Sanctions lists and regulatory databases
- Temporal patterns and behavioral baselines

### Recommended Architecture: Sequential Pipeline
Based on real-time processing requirements, the system implements a two-stage approach:
1. **Stage 1**: Fast binary classification for suspicious/non-suspicious determination
2. **Stage 2**: Detailed risk scoring (0-100) for transactions flagged as suspicious

This architecture optimizes performance by only applying expensive regression analysis to transactions that pass the initial screening threshold.

## Dataset: Elliptic Data Set

### Dataset Overview
- **Source**: Kaggle (https://www.kaggle.com/datasets/ellipticco/elliptic-data-set)
- **Scale**: 200,000 Bitcoin transactions with 166 feature dimensions
- **Size**: Approximately 6GB of professionally curated data
- **Quality**: Industry-standard dataset used extensively in academic research
- **Suitability Score**: 92/100 for this specific use case

### Feature Categories
- **Local Features (94)**: Transaction-specific attributes including fees, input/output volumes, BTC amounts
- **Aggregate Features (72)**: Neighborhood and graph-based metrics derived from transaction relationships
- **Temporal Component**: Time step information for temporal pattern analysis
- **Graph Topology**: Wallet clustering information and network connectivity metrics

### Label Distribution
- **Illicit Transactions**: ~2% (professionally verified criminal activity)
- **Licit Transactions**: ~21% (verified legitimate transactions)
- **Unknown Transactions**: ~77% (unlabeled data suitable for semi-supervised learning)

### Why This Dataset is Optimal
1. **Professional Curation**: Created by Elliptic Co., a leading blockchain analytics company
2. **Rich Feature Set**: 166 dimensions provide comprehensive basis for both classification and risk scoring
3. **Graph Integration**: Essential blockchain-specific features for transaction network analysis
4. **Real-time Applicability**: Preprocessed features enable sub-second inference
5. **Academic Validation**: Extensively tested and validated in research literature

## Implementation Strategy

### Sequential Pipeline Implementation
```python
def score_transaction(transaction_features):
    # Stage 1: Fast binary classification
    is_suspicious = binary_classifier.predict_proba(transaction_features)[1]
    
    if is_suspicious > threshold:
        # Stage 2: Detailed risk scoring
        risk_score = regression_model.predict(transaction_features)[0]
        return min(max(risk_score, 0), 100)  # Clamp to 0-100
    else:
        return is_suspicious * 30  # Low base score for non-suspicious
```

### Alternative Approaches Considered
- **Multi-task Learning**: Single neural network with dual outputs
- **Parallel Ensemble**: Simultaneous classification and regression branches
- **Hierarchical Risk Scoring**: Tiered classification with tier-specific regression models

### Feature Engineering for Risk Scoring
Risk scores will be derived from weighted combinations of:
- Transaction amount percentiles (30% weight)
- Neighbor risk aggregates (40% weight)
- Temporal anomaly scores (30% weight)

## Expected Outcomes

### Performance Targets
- **Processing Latency**: Sub-second transaction analysis
- **False Positive Rate**: < 10% for mature system implementation
- **Detection Accuracy**: > 95% for known money laundering typologies
- **Risk Score Precision**: Continuous 0-100 scoring with interpretable thresholds

### Business Value
- **Compliance Enhancement**: Automated SAR generation and regulatory reporting
- **Operational Efficiency**: 50%+ reduction in manual investigation time
- **Risk Mitigation**: Proactive threat prevention vs. reactive response
- **Customer Experience**: Minimal friction for legitimate transactions

This notebook serves as the primary entry point for the MVP KYT implementation, providing both technical implementation and business context for real-time cryptocurrency transaction risk assessment.

## Import Libraries

Comprehensive import of all required libraries for data science, machine learning, and blockchain transaction analysis.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine learning - Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, RobustScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.svm import SVC, SVR
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, roc_curve,
    precision_recall_curve, f1_score, accuracy_score, precision_score, recall_score,
    mean_squared_error, mean_absolute_error, r2_score
)

# Advanced ML libraries
try:
    import xgboost as xgb
    print("✅ XGBoost available")
except ImportError:
    print("⚠️  XGBoost not available - install with: pip install xgboost")
    
try:
    import lightgbm as lgb
    print("✅ LightGBM available")
except ImportError:
    print("⚠️  LightGBM not available - install with: pip install lightgbm")

# Deep learning (optional)
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras.models import Sequential, Model
    from tensorflow.keras.layers import Dense, Dropout, Input, concatenate
    from tensorflow.keras.optimizers import Adam
    print("✅ TensorFlow available")
except ImportError:
    print("⚠️  TensorFlow not available - install with: pip install tensorflow")

# Graph analysis (for blockchain network analysis)
try:
    import networkx as nx
    print("✅ NetworkX available for graph analysis")
except ImportError:
    print("⚠️  NetworkX not available - install with: pip install networkx")

# Data source integration
try:
    import kaggle
    print("✅ Kaggle API available")
except ImportError:
    print("⚠️  Kaggle API not available - install with: pip install kaggle")

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, pearsonr, spearmanr

# Utility libraries
import json
import pickle
import joblib
from datetime import datetime, timedelta
import time
import os
import sys

# Configuration
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("\n" + "="*80)
print("🚀 MVP KYT - Real-Time Transaction Risk Scoring Engine")
print("📊 All libraries imported successfully")
print("🔒 AML/KYT analysis environment ready")
print("="*80)