## MVP Know Your Transaction (KYT) - Real-Time Transaction Risk Scoring Engine

### Project Overview

This notebook presents a comprehensive implementation of a Real-Time Transaction Risk Scoring Engine for Anti-Money Laundering (AML) compliance in cryptocurrency transactions. The project addresses the critical need for sub-second risk assessment of Bitcoin transactions by combining traditional AML indicators with blockchain-specific risk factors.

### Domain Context: Financial AML for Transactions

#### Core Domain Definition
Anti-Money Laundering (AML) for transactions encompasses the comprehensive framework of laws, regulations, procedures, and technological solutions designed to prevent criminals from disguising illegally obtained funds as legitimate income through the global financial system. This domain includes detection, prevention, and reporting of money laundering, terrorist financing, tax evasion, market manipulation, and misuse of public funds.

### Problem Definition: Real-Time Transaction Risk Classification Engine

#### Problem Statement
Develop a system that assigns risk classifications to cryptocurrency transactions in real-time, integrating traditional AML indicators with blockchain-specific risk factors including wallet clustering, transaction graph analysis, and counterparty reputation scoring.

#### Technical Requirements
- **Problem Type**: Classification 
- **Processing Speed**: Sub-second analysis for high-frequency transactions
- **Difficulty Level**: High - requires complex multi-dimensional data processing
- **Output Format**: Risk binary classification (illicit/licit)

#### Data Landscape
The system processes multiple data dimensions:
- Transaction metadata (amounts, timestamps, fees)
- Wallet addresses and clustering information
- Transaction graph relationships and network topology
- Counterparty databases and reputation scores
- Sanctions lists and regulatory databases
- Temporal patterns and behavioral baselines

### References

This notebook implementation is based on the comprehensive research and analysis conducted during the project development phase. The following reference documents were used in the composition of this initial description:

- **Domain Research**: [current-domain.md](domains/current-domain.md) - Contains detailed market analysis, regulatory framework research, and commercial viability assessment for the Financial AML domain
- **Problem Analysis**: [current-problem.md](problems/current-problem.md) - Provides comprehensive problem refinement, technical requirements analysis, and solution approach evaluation
- **Dataset Evaluation**: [current-dataset.md](datasets/current-dataset.md) - Documents dataset selection criteria, suitability scoring, and detailed feature analysis for the Elliptic dataset
- **Dataset Analysis & Preprocessing**: [dataset-analysis-and-preprocessing.ipynb](datasets/scripts/dataset-analysis-and-preprocessing.ipynb) - Comprehensive Jupyter notebook containing Elliptic dataset download, exploratory data analysis, feature engineering, preprocessing pipeline, and ML preparation steps

These reference documents contain the foundational research that informed the technical approach, feature engineering strategy, and implementation decisions reflected in this notebook.

---

This notebook serves as the primary entry point for the MVP KYT implementation, providing both technical implementation and business context for real-time cryptocurrency transaction risk assessment.

### Import Libraries

Comprehensive import of all required libraries for machine learning procedures.

In [None]:
import os
import warnings
import joblib
import numpy as np
import pandas as pd
import optuna
from optuna.samplers import TPESampler
from optuna.pruners import MedianPruner
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score 
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform

# Suppress all warnings
#warnings.filterwarnings('ignore')
#optuna.logging.set_verbosity(optuna.logging.CRITICAL)

### Loading pre-processed datasets

The pre-processing step reduced the dimensionality from 166 features to only 46.

In [None]:

# The complete dataset already pre-processed 
df_complete = pd.read_hdf("./datasets/processed/df_complete.h5", key="df_complete")
print(f"Loaded from HDF5: {df_complete.shape} - All subsequent operations will use compressed data")

# The filtered labeled dataset already pre-processed
df_labeled = pd.read_hdf("./datasets/processed/df_labeled.h5", key="df_labeled")
print(f"Loaded from HDF5: {df_labeled.shape} - All subsequent operations will use compressed data")

# The filtered unlabeled dataset already pre-processed
df_unlabeled = pd.read_hdf("./datasets/processed/df_unlabeled.h5", key="df_unlabeled")
print(f"Loaded from HDF5: {df_unlabeled.shape} - All subsequent operations will use compressed data")

# The edges dataset that maps relationships between transaction nodes
df_edges = pd.read_hdf("./datasets/processed/df_edges.h5", key="df_edges")
print(f"Loaded from HDF5: {df_edges.shape} - All subsequent operations will use compressed data")


# Summary of all datasets
print(f"\n📊 Dataset Summary:")
print(f"  - Features: {df_complete.shape[0]:,} transactions × {df_complete.shape[1] -2} features")
print(f"  - Labeled: {df_labeled.shape[0]:,} transactions")
print(f"  - Unlabeled: {df_unlabeled.shape[0]:,} transactions")
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")

### Machine Learning Strategy

Let's apply the machine learning technics:

1. Define overall parameters and make data splits
2. Defining all training models to be used
3. Defining all pipelines to be used during training
4. Defining model parameters distribution for a optimization search 
5. Defining the object function and metrics to optimize
6. Execute the optimized training
7. Save all resulting models
8. Validate all models and select the best models
9. Use best models to predict unknown data 


**1.Define overall parameters and make data splits**

Let`s prepare the dataset for training and validation

[Description]
[Whys]
[What]
[When]

In [None]:
# Defining overall parameters
random_seed = 4354 # PARAMETER: random seed
test_size_split = 0.20 # PARAMETER: test set size
n_stratified_splits = 2 # PARAMETER: number of folds
n_pca_components = 0.95 # PARAMETER: PCA components to keep

np.random.seed(random_seed)

# Prepare data (df_labeled already loaded: 46,564 × 48)
x_labeled = df_labeled.drop(['class', 'txId'], axis=1)  # 46 features
y_labeled = df_labeled['class']  # Binary target

X_train, X_test, y_train, y_test = train_test_split(x_labeled, y_labeled,
    test_size=test_size_split, 
    shuffle=True, 
    random_state=random_seed, 
    stratify=y_labeled) # stratified holdout

# Cross-validation setup
cv = StratifiedKFold(n_splits=n_stratified_splits, 
                     shuffle=True, 
                     random_state=random_seed)

**2. Defining all training models to be used**

Let's define which models to use

[Description]
[Whys]
[What]
[When]

In [None]:
# Defining the individual models
reg = ('LR', LogisticRegression())
knn = ('KNN', KNeighborsClassifier())
cart = ('CART', DecisionTreeClassifier())
naive = ('NB', GaussianNB())
svm = ('SVM', SVC())

models = []
models.append(reg)
models.append(knn)
models.append(cart)
models.append(naive)
models.append(svm)

# Defining ensemble models
bagging = ('Bag', BaggingClassifier())
forest = ('RF', RandomForestClassifier())
extra = ('ET', ExtraTreesClassifier())
ada = ('Ada', AdaBoostClassifier())
gradient = ('GB', GradientBoostingClassifier())
voting = ('Voting', VotingClassifier(models))

**3. Defining all pipelines to be used during training**

Let's define which ML pipelines to use during training.

[Description]
[Whys]
[What]
[When]

This section focuses on selecting the most relevant features for distinguishing between licit (class 1) and illicit (class 2) transactions. We are aiming to reduce the feature dimensionality at the same time as maximizing the dissimilarity of the original dataset, thus extracting the most discriminative features and improving the model training performance. Let`s apply the following technics:

In [None]:
# Creating the pipelines
pipelines = []
std  = ('std', StandardScaler())  # Standardization
pca = ('pca', PCA(n_components=n_pca_components))  # Feature reduction

# Defining the pipelines, for future experimentation 
pipelines.append(('LR', Pipeline([std, pca, reg]))) 
pipelines.append(('KNN', Pipeline([std, pca, knn])))
pipelines.append(('CART', Pipeline([std, pca, cart])))
pipelines.append(('NB', Pipeline([std, pca, naive])))
pipelines.append(('SVM', Pipeline([std, pca, svm])))
pipelines.append(('Bag', Pipeline([std, pca, bagging])))
pipelines.append(('RF', Pipeline([std, pca, forest])))
pipelines.append(('ET', Pipeline([std, pca, extra])))
pipelines.append(('Ada', Pipeline([std, pca, ada])))
pipelines.append(('GB', Pipeline([std, pca, gradient])))
#pipelines.append(('Vot', Pipeline([std, pca, voting])))

**4. Defining model parameters distribution for a grid search approach**

Let's prepare the parameter distributions for a random grid search

[Description]
[Whys]
[What]
[When]

In [None]:
# Fixed parameter suggestions with unique names
def suggest_params(trial, name):
    if name == 'LR':
        return {
            'LR__C': trial.suggest_float('LR_C', 1e-4, 1e2, log=True),
            'LR__solver': trial.suggest_categorical('LR_solver', ['lbfgs', 'newton-cg', 'sag', 'saga']),
            'LR__penalty': trial.suggest_categorical('LR_penalty', ['l2']),
            'LR__max_iter': trial.suggest_categorical('LR_max_iter', [1000, 2000, 5000]),
            'LR__tol': trial.suggest_float('LR_tol', 1e-6, 1e-3, log=True)
        }
    elif name == 'KNN':
        return {
            'KNN__n_neighbors': trial.suggest_int('KNN_n_neighbors', 3, 21, step=2),
            'KNN__weights': trial.suggest_categorical('KNN_weights', ['uniform', 'distance']),
            'KNN__metric': trial.suggest_categorical('KNN_metric', ['euclidean', 'manhattan', 'minkowski']),
            'KNN__p': trial.suggest_int('KNN_p', 1, 3)
        }
    elif name == 'CART':
        return {
            'CART__max_depth': trial.suggest_int('CART_max_depth', 3, 20),
            'CART__min_samples_split': trial.suggest_int('CART_min_samples_split', 10, 50),
            'CART__min_samples_leaf': trial.suggest_int('CART_min_samples_leaf', 5, 20),
            'CART__criterion': trial.suggest_categorical('CART_criterion', ['gini', 'entropy'])
        }
    elif name == 'NB':
        return {
            'NB__var_smoothing': trial.suggest_float('NB_var_smoothing', 1e-12, 1e-6, log=True)
        }
    elif name == 'SVM':
        return {
            'SVM__C': trial.suggest_float('SVM_C', 1e-2, 1e3, log=True),
            'SVM__kernel': trial.suggest_categorical('SVM_kernel', ['rbf', 'poly', 'sigmoid']),
            'SVM__gamma': trial.suggest_categorical('SVM_gamma', ['scale', 'auto']),
            'SVM__probability': True
        }
    elif name == 'RF':
        return {
            'RF__n_estimators': trial.suggest_int('RF_n_estimators', 100, 500, step=50),
            'RF__max_depth': trial.suggest_int('RF_max_depth', 10, 25),
            'RF__min_samples_split': trial.suggest_int('RF_min_samples_split', 5, 20),
            'RF__min_samples_leaf': trial.suggest_int('RF_min_samples_leaf', 2, 10),
            'RF__max_features': trial.suggest_categorical('RF_max_features', ['sqrt', 'log2', None]),
            'RF__bootstrap': trial.suggest_categorical('RF_bootstrap', [True, False])
        }
    elif name == 'ET':
        return {
            'ET__n_estimators': trial.suggest_int('ET_n_estimators', 100, 500, step=50),
            'ET__max_depth': trial.suggest_int('ET_max_depth', 10, 25),
            'ET__min_samples_split': trial.suggest_int('ET_min_samples_split', 5, 20),
            'ET__min_samples_leaf': trial.suggest_int('ET_min_samples_leaf', 2, 10),
            'ET__max_features': trial.suggest_categorical('ET_max_features', ['sqrt', 'log2', None]),
            'ET__bootstrap': trial.suggest_categorical('ET_bootstrap', [True, False])
        }
    elif name == 'GB':
        return {
            'GB__n_estimators': trial.suggest_int('GB_n_estimators', 100, 300, step=25),
            'GB__learning_rate': trial.suggest_float('GB_learning_rate', 1e-2, 3e-1, log=True),
            'GB__max_depth': trial.suggest_int('GB_max_depth', 3, 8),
            'GB__subsample': trial.suggest_float('GB_subsample', 0.7, 0.9)
        }
    elif name == 'Ada':
        return {
            'Ada__n_estimators': trial.suggest_int('Ada_n_estimators', 50, 200, step=25),
            'Ada__learning_rate': trial.suggest_float('Ada_learning_rate', 0.5, 1.5),
            'Ada__algorithm': trial.suggest_categorical('Ada_algorithm', ['SAMME', 'SAMME.R'])
        }
    elif name == 'Bag':
        return {
            'Bag__n_estimators': trial.suggest_int('Bag_n_estimators', 50, 200, step=25),
            'Bag__max_samples': trial.suggest_float('Bag_max_samples', 0.6, 0.9),
            'Bag__max_features': trial.suggest_float('Bag_max_features', 0.7, 1.0)
        }
    else:
        return {}

**5. Defining the object function and metrics to optimize**

Let`s define which metrics to optimize in the training search

[Description]
[Whys]
[What]
[When]

In [None]:

# Multi-metric objective function for AML/KYT systems
def aml_objective(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    #precision = precision_score(y_true, y_pred, zero_division=0)
    #recall = recall_score(y_true, y_pred, zero_division=0)
    #f1 = f1_score(y_true, y_pred, zero_division=0)
    #return 0.35*recall + 0.25*precision + 0.25*f1
    return accuracy

aml_scorer = make_scorer(aml_objective, greater_is_better=True)

**6. Execute the optimized training**

Let's execute the training phase using the random grid search with cross validation and rank the best models by accuracy.

[Description]
[Whys]
[What]
[When]

In [None]:
# Optuna Bayesian Optimization - Fixed
op_n_trials = 2
op_n_jobs = 1
optimized_models = []
results = []

print("🚀 Optuna Optimization")
print("-" * 40)

for name, pipe in pipelines:
    print(f"{name}:", end=" ")
    
    def objective(trial):
        params = suggest_params(trial, name)
        if params: 
            pipe.set_params(**params)
        scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=op_n_jobs)
        return scores.mean()
    
    study = optuna.create_study(direction='maximize', sampler=TPESampler(seed=random_seed))
    
    try:
        study.optimize(objective, n_trials=op_n_trials, show_progress_bar=True)
        best_score = study.best_value if study.trials else 0.0
        if study.best_params:
            pipe.set_params(**study.best_params)
    except:
        best_score = 0.0
    
    print(f"✅ {best_score:.3f}")
    optimized_models.append((name, pipe))
    results.append(best_score)

# Visualization
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar([name for name, _ in optimized_models], results, color='skyblue', alpha=0.7)
ax.set_ylabel('AML Score')
ax.set_title('Model Performance')
for bar, score in zip(bars, results):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{score:.3f}', ha='center', va='bottom')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"🎯 Best: {[name for name, _ in optimized_models][np.argmax(results)]}")

**7. Save all resulting models**

Let's save all models

[Description]
[Whys]
[What]
[When]

In [None]:
try:
    # Try to save models from training
    folderDir = "./models/mvp-kyt-sup-main-v2"
    os.makedirs(folderDir, exist_ok=True)
    for name, pipe in optimized_models:
        joblib.dump(pipe, f"{folderDir}/{name}.pkl", compress=True)
    print(f"💾 Saved {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
except NameError:
    # Load models if optimized_models doesn't exist
    optimized_models = []
    if os.path.exists(folderDir):
        for file in os.listdir(folderDir):
            if file.endswith('.pkl'):
                name = file.replace('.pkl', '')
                pipe = joblib.load(f"{folderDir}/{file}")
                optimized_models.append((name, pipe))
        print(f"📁 Loaded {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
    else:
        print("❌ No models found")

**8. Validate all models and select the best models**

Let's validate and select the best models

[Description]
[Whys]
[What]
[When]

In [None]:
# Select best 3 models based on test accuracy
test_results = []
for name, pipe in optimized_models:
    y_pred = pipe.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    test_results.append((name, pipe, accuracy))
    print(f"{name}: Test Accuracy = {accuracy:.4f}")

# Bar chart instead of boxplot for single values
fig, ax = plt.subplots(figsize=(12, 6))
names = [x[0] for x in test_results]
accuracies = [x[2] for x in test_results]
bars = ax.bar(names, accuracies, color='lightcoral', alpha=0.7)
ax.set_ylabel('Test Accuracy')
ax.set_title('Models Test Accuracy Comparison')
for bar, acc in zip(bars, accuracies):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
            f'{acc:.4f}', ha='center', va='bottom')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Sort by test accuracy and get top 3
test_results.sort(key=lambda x: x[2], reverse=True)
best_optimized_models = [(name, model) for name, model, acc in test_results[:3]]

print(f"\n🏆 Final top 3 models: {[name for name, _ in best_optimized_models]}")
print(f"📊 Test accuracies: {[f'{acc:.4f}' for _, _, acc in test_results[:3]]}")

**9. Use best models to predict unknown data**

Let`s apply the model into unknown data

[Description]
[Whys]
[What]
[When]

In [None]:
# Apply best model to unlabeled data
best_model = best_optimized_models[0][1]  # Champion model
X_unlabeled = df_unlabeled.drop(['class', 'txId'], axis=1)

predictions = best_model.predict(X_unlabeled)
try:
    probabilities = best_model.predict_proba(X_unlabeled)[:, 1]
except:
    probabilities = predictions.astype(float)

print(f"🔮 Predictions on {len(X_unlabeled)} unlabeled transactions")
print(f"Illicit predictions: {sum(predictions)} ({sum(predictions)/len(predictions)*100:.1f}%)")
print(f"Risk scores range: {probabilities.min():.3f} - {probabilities.max():.3f}")

# Show first 10 predictions
print(f"\nFirst 10 predictions: {predictions[:10]}")
print(f"First 10 risk scores: {probabilities[:10]}")

### Conclusions

[conclusion]