## MVP Know Your Transaction (KYT) - Real-Time Transaction Risk Scoring Engine

### Project Overview

This notebook presents a comprehensive implementation of a Real-Time Transaction Risk Scoring Engine for Anti-Money Laundering (AML) compliance in cryptocurrency transactions. The project addresses the critical need for sub-second risk assessment of Bitcoin transactions by combining traditional AML indicators with blockchain-specific risk factors.

### Domain Context: Financial AML for Transactions

#### Core Domain Definition
Anti-Money Laundering (AML) for transactions encompasses the comprehensive framework of laws, regulations, procedures, and technological solutions designed to prevent criminals from disguising illegally obtained funds as legitimate income through the global financial system. This domain includes detection, prevention, and reporting of money laundering, terrorist financing, tax evasion, market manipulation, and misuse of public funds.

### Problem Definition: Real-Time Transaction Risk Classification Engine

#### Problem Statement
Develop a system that assigns risk classifications to cryptocurrency transactions in real-time, integrating traditional AML indicators with blockchain-specific risk factors including wallet clustering, transaction graph analysis, and counterparty reputation scoring.

#### Technical Requirements
- **Problem Type**: Classification 
- **Processing Speed**: Sub-second analysis for high-frequency transactions
- **Difficulty Level**: High - requires complex multi-dimensional data processing
- **Output Format**: Risk binary classification (illicit/licit)

#### Data Landscape
The system processes multiple data dimensions:
- Transaction metadata (amounts, timestamps, fees)
- Wallet addresses and clustering information
- Transaction graph relationships and network topology
- Counterparty databases and reputation scores
- Sanctions lists and regulatory databases
- Temporal patterns and behavioral baselines

### References

This notebook implementation is based on the comprehensive research and analysis conducted during the project development phase. The following reference documents were used in the composition of this initial description:

- **Domain Research**: [current-domain.md](domains/current-domain.md) - Contains detailed market analysis, regulatory framework research, and commercial viability assessment for the Financial AML domain
- **Problem Analysis**: [current-problem.md](problems/current-problem.md) - Provides comprehensive problem refinement, technical requirements analysis, and solution approach evaluation
- **Dataset Evaluation**: [current-dataset.md](datasets/current-dataset.md) - Documents dataset selection criteria, suitability scoring, and detailed feature analysis for the Elliptic dataset
- **Dataset Analysis & Preprocessing**: [dataset-analysis-and-preprocessing.ipynb](datasets/scripts/dataset-analysis-and-preprocessing.ipynb) - Comprehensive Jupyter notebook containing Elliptic dataset download, exploratory data analysis, feature engineering, preprocessing pipeline, and ML preparation steps

These reference documents contain the foundational research that informed the technical approach, feature engineering strategy, and implementation decisions reflected in this notebook.

---

This notebook serves as the primary entry point for the MVP KYT implementation, providing both technical implementation and business context for real-time cryptocurrency transaction risk assessment.

### Import Libraries

Comprehensive import of all required libraries for machine learning procedures.

In [None]:
import os
import joblib
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

### Loading pre-processed datasets

The pre-processing step reduced the dimensionality from 166 features to only 46.

In [15]:

# The complete dataset already pre-processed 
df_complete = pd.read_hdf("./datasets/processed/df_complete.h5", key="df_complete")
print(f"Loaded from HDF5: {df_complete.shape} - All subsequent operations will use compressed data")

# The filtered labeled dataset already pre-processed
df_labeled = pd.read_hdf("./datasets/processed/df_labeled.h5", key="df_labeled")
print(f"Loaded from HDF5: {df_labeled.shape} - All subsequent operations will use compressed data")

# The filtered unlabeled dataset already pre-processed
df_unlabeled = pd.read_hdf("./datasets/processed/df_unlabeled.h5", key="df_unlabeled")
print(f"Loaded from HDF5: {df_unlabeled.shape} - All subsequent operations will use compressed data")

# The edges dataset that maps relationships between transaction nodes
df_edges = pd.read_hdf("./datasets/processed/df_edges.h5", key="df_edges")
print(f"Loaded from HDF5: {df_edges.shape} - All subsequent operations will use compressed data")


# Summary of all datasets
print(f"\n📊 Dataset Summary:")
print(f"  - Features: {df_complete.shape[0]:,} transactions × {df_complete.shape[1] -2} features")
print(f"  - Labeled: {df_labeled.shape[0]:,} transactions")
print(f"  - Unlabeled: {df_unlabeled.shape[0]:,} transactions")
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")

Loaded from HDF5: (203769, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (46564, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (157205, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (234355, 2) - All subsequent operations will use compressed data

📊 Dataset Summary:
  - Features: 203,769 transactions × 46 features
  - Labeled: 46,564 transactions
  - Unlabeled: 157,205 transactions
  - Edges: 234,355 transaction relationships


### Machine Learning Strategy

Let's try three machine learning approaches with the data and compare than, each methodology will have it's winning model. 

1. Train models using the supervised methodology using only the labeled sub-dataset;
2. Train models using the non-supervised approach using the complete dataset ignoring the edge dataset; **[FUTURE DEVELOPMENT]**
3. Train models using the non-supervised approach using the complete dataset taking to account the relationship between those transactions, by using the edge dataset. **[FUTURE DEVELOPMENT]**  

#### 1. Train models using the supervised methodology

Let`s prepare the dataset for training and validation

[Description]
[Whys]
[What]
[When]

In [16]:
# Defining some parameters
np.random.seed(7)

# Prepare data (df_labeled already loaded: 46,564 × 48)
seed = 7 # random seed
test_size = 0.20 # test set size
X = df_labeled.drop(['class', 'txId'], axis=1)  # 46 features
y = df_labeled['class']  # Binary target
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=test_size, shuffle=True, random_state=seed, stratify=y) # stratified holdout

# Cross-validation setup
n_splits = 10 # PARAMETER: number of folds
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

Let's define which models to use

[Description]
[Whys]
[What]
[When]

In [17]:
# Defining the individual models
reg = ('LR', LogisticRegression(max_iter=200))
knn = ('KNN', KNeighborsClassifier())
cart = ('CART', DecisionTreeClassifier())
naive = ('NB', GaussianNB())
svm = ('SVM', SVC())

models = []
models.append(reg)
models.append(knn)
models.append(cart)
models.append(naive)
models.append(svm)

# Defining ensemble models
bagging = ('Bag', BaggingClassifier())
forest = ('RF', RandomForestClassifier())
extra = ('ET', ExtraTreesClassifier())
ada = ('Ada', AdaBoostClassifier())
gradient = ('GB', GradientBoostingClassifier())
voting = ('Voting', VotingClassifier(models))

# Creating the pipelines
pipelines = []

# Defining the pipelines, for future experimentation 
pipelines.append(('LR', Pipeline([reg]))) 
#pipelines.append(('KNN', Pipeline([knn])))
#pipelines.append(('CART', Pipeline([cart])))
#pipelines.append(('NB', Pipeline([naive])))
#pipelines.append(('SVM', Pipeline([svm])))
#pipelines.append(('Bag', Pipeline([bagging])))
#pipelines.append(('RF', Pipeline([forest])))
#pipelines.append(('ET', Pipeline([extra])))
#pipelines.append(('Ada', Pipeline([ada])))
#pipelines.append(('GB', Pipeline([gradient])))
#pipelines.append(('Vot', Pipeline([voting])))

Let's prepare the parameter distributions for a random grid search

[Description]
[Whys]
[What]
[When]

In [18]:
# Define parameter distributions for RandomizedSearchCV
param_distributions = {
    'LR': {
        'LR__C': uniform(0.01, 100),                    # Regularization strength (inverse)
        'LR__solver': ['liblinear', 'saga'],            # Optimization algorithm (compatible with l1/l2)
        'LR__penalty': ['l1', 'l2', 'none'],            # Regularization type (removed elasticnet)
        'LR__max_iter': [200, 500, 1000]               # Maximum iterations for convergence
    },
    'KNN': {
        'KNN__n_neighbors': randint(3, 21),             # Number of nearest neighbors to consider
        'KNN__weights': ['uniform', 'distance'],        # Weight function for neighbors
        'KNN__metric': ['euclidean', 'manhattan', 'minkowski'],  # Distance metric
        'KNN__p': randint(1, 3)                         # Power parameter for minkowski metric
    },
    'CART': {
        'CART__max_depth': randint(3, 20),               # Maximum tree depth (prevent overfitting)
        'CART__min_samples_split': randint(2, 20),       # Min samples required to split node
        'CART__min_samples_leaf': randint(1, 10),        # Min samples required at leaf node
        'CART__criterion': ['gini', 'entropy']           # Split quality measure
    },
    'NB': {
        'NB__var_smoothing': uniform(1e-12, 1e-6),      # Laplace smoothing for numerical stability
        'NB__priors': [None]                             # Class probabilities (let model learn from data)
    },
    'SVM': {
        'SVM__C': uniform(0.1, 100),                    # Regularization parameter (margin vs errors)
        'SVM__kernel': ['rbf', 'poly', 'sigmoid'],      # Kernel function type
        'SVM__gamma': ['scale', 'auto']                 # Kernel coefficient influence
    },
    'RF': {
        'RF__n_estimators': randint(50, 300),           # Number of trees in forest
        'RF__max_depth': randint(3, 20),                # Maximum depth of each tree
        'RF__min_samples_split': randint(2, 20),        # Min samples to split internal node
        'RF__min_samples_leaf': randint(1, 10),         # Min samples at leaf node
        'RF__max_features': ['sqrt', 'log2', None],     # Features to consider for best split
        'RF__bootstrap': [True, False]                  # Whether to use bootstrap samples
    },
    'ET': {
        'ET__n_estimators': randint(50, 300),           # Number of trees in ensemble
        'ET__max_depth': randint(3, 20),                # Maximum depth of trees
        'ET__min_samples_split': randint(2, 20),        # Min samples to split node
        'ET__min_samples_leaf': randint(1, 10),         # Min samples at leaf
        'ET__max_features': ['sqrt', 'log2', None],     # Random feature subset size
        'ET__bootstrap': [True, False]                  # Bootstrap sampling toggle
    },
    'GB': {
        'GB__n_estimators': randint(50, 200),           # Number of boosting stages
        'GB__learning_rate': uniform(0.01, 0.3),        # Step size shrinkage (prevent overfitting)
        'GB__max_depth': randint(3, 10),                # Maximum depth of individual estimators
        'GB__subsample': uniform(0.5, 0.5)              # Fraction of samples for fitting trees (0.5-1.0)
    },
    'Ada': {
        'Ada__n_estimators': randint(50, 200),          # Maximum number of weak learners
        'Ada__learning_rate': uniform(0.1, 0.9),        # Weight applied to each classifier (0.1-1.0)
        'Ada__algorithm': ['SAMME', 'SAMME.R']          # Boosting algorithm variant
    },
    'Bag': {
        'Bag__n_estimators': randint(50, 200),          # Number of base estimators
        'Bag__max_samples': uniform(0.5, 0.5),          # Fraction of samples for each estimator
        'Bag__max_features': uniform(0.5, 0.5)          # Fraction of features for each estimator
    }
}

Let's execute the training phase using the random grid search with cross validation and rank the best models by accuracy.

[Description]
[Whys]
[What]
[When]

In [None]:
# Training parameters
n_iter = 50  # PARAMETER: Number of parameter combinations to try
scoring = 'accuracy' # PARAMETER: Scoring metric
n_jobs = -1  # PARAMETER: Use all available cores
verbosity = 0  # PARAMETER: Verbosity level

optimized_models = []
results = []
names = []

print("🔍 Training Models with RandomizedSearchCV Optimization...")
print(f"Training {len(pipelines)} models: Basic + Ensemble methods")
print("-" * 60)

for name, model in pipelines:
    print(f"Training {name}...", end=" ")
    # Use RandomizedSearchCV for models with parameters definitions
    if name in param_distributions and param_distributions[name]:
        random_search = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_distributions[name],
            n_iter=n_iter,
            cv=cv,
            verbose=verbosity,
            scoring=scoring,
            random_state=seed,
            n_jobs=n_jobs
        )
        random_search.fit(X_train, y_train)
        best_model = random_search.best_estimator_
        best_score = random_search.best_score_
        std_score = random_search.cv_results_['std_test_score'][random_search.best_index_]
        
        print(f"✅ {best_score:.4f} (±{std_score:.4f})")
    else:
        cv_results = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
        best_model = model
        best_score = cv_results.mean()
        std_score = cv_results.std()
        
        print(f"✅ {best_score:.4f} (±{std_score:.4f})")
    
    optimized_models.append((name, best_model))
    results.append(best_score)
    names.append(name)

Let's save all models

[Description]
[Whys]
[What]
[When]

In [None]:
try:
    # Try to save models from training
    os.makedirs("./models", exist_ok=True)
    for name, model in optimized_models:
        joblib.dump(model, f"./models/{name}.pkl", compress=True)
    print(f"💾 Saved {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
except NameError:
    # Load models if optimized_models doesn't exist
    optimized_models = []
    if os.path.exists("./models"):
        for file in os.listdir("./models"):
            if file.endswith('.pkl'):
                name = file.replace('.pkl', '')
                model = joblib.load(f"./models/{file}")
                optimized_models.append((name, model))
        print(f"📁 Loaded {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
    else:
        print("❌ No models found")

Let`s select the best models

[Description]
[Whys]
[What]
[When]

In [None]:

print("\n" + "="*60)
print("🏆 FINAL MODEL RANKINGS:")
print("="*60)

# Sort and display results
sorted_results = sorted(zip(names, results), key=lambda x: x[1], reverse=True)
for i, (name, score) in enumerate(sorted_results, 1):
    emoji = "🥇" if i == 1 else "🥈" if i == 2 else "🥉" if i == 3 else "  "
    print(f"{emoji} {i:2d}. {name:<4}: {score:.4f}")

best_model_name = sorted_results[0][0]
best_model_score = sorted_results[0][1]
print(f"\n🏆 Champion Model: {best_model_name} ({best_model_score:.4f})")

# Create best_optimized_models with top 3 models
best_optimized_models = []
for name, score in sorted_results[:3]:
    # Find the corresponding model from optimized_models
    for opt_name, opt_model in optimized_models:
        if opt_name == name:
            best_optimized_models.append((opt_name, opt_model))
            break

print(f"\n🎯 Top 3 models selected: {[name for name, _ in best_optimized_models]}")

Let's validate the best models

[Description]
[Whys]
[What]
[When]

Let`s apply the model into unknown data

[Description]
[Whys]
[What]
[When]

### 2. Train models using the non-supervised approach

### 3. Train models using the non-supervised approach using the edge dataset.  