## MVP Know Your Transaction (KYT) - Real-Time Transaction Risk Scoring Engine

### Project Overview

This notebook presents a comprehensive implementation of a Real-Time Transaction Risk Scoring Engine for Anti-Money Laundering (AML) compliance in cryptocurrency transactions. The project addresses the critical need for sub-second risk assessment of Bitcoin transactions by combining traditional AML indicators with blockchain-specific risk factors.

### Domain Context: Financial AML for Transactions

#### Core Domain Definition
Anti-Money Laundering (AML) for transactions encompasses the comprehensive framework of laws, regulations, procedures, and technological solutions designed to prevent criminals from disguising illegally obtained funds as legitimate income through the global financial system. This domain includes detection, prevention, and reporting of money laundering, terrorist financing, tax evasion, market manipulation, and misuse of public funds.

### Problem Definition: Real-Time Transaction Risk Classification Engine

#### Problem Statement
Develop a system that assigns risk classifications to cryptocurrency transactions in real-time, integrating traditional AML indicators with blockchain-specific risk factors including wallet clustering, transaction graph analysis, and counterparty reputation scoring.

#### Technical Requirements
- **Problem Type**: Classification 
- **Processing Speed**: Sub-second analysis for high-frequency transactions
- **Difficulty Level**: High - requires complex multi-dimensional data processing
- **Output Format**: Risk binary classification (illicit/licit)

#### Data Landscape
The system processes multiple data dimensions:
- Transaction metadata (amounts, timestamps, fees)
- Wallet addresses and clustering information
- Transaction graph relationships and network topology
- Counterparty databases and reputation scores
- Sanctions lists and regulatory databases
- Temporal patterns and behavioral baselines

### References

This notebook implementation is based on the comprehensive research and analysis conducted during the project development phase. The following reference documents were used in the composition of this initial description:

- **Domain Research**: [current-domain.md](domains/current-domain.md) - Contains detailed market analysis, regulatory framework research, and commercial viability assessment for the Financial AML domain
- **Problem Analysis**: [current-problem.md](problems/current-problem.md) - Provides comprehensive problem refinement, technical requirements analysis, and solution approach evaluation
- **Dataset Evaluation**: [current-dataset.md](datasets/current-dataset.md) - Documents dataset selection criteria, suitability scoring, and detailed feature analysis for the Elliptic dataset
- **Dataset Analysis & Preprocessing**: [dataset-analysis-and-preprocessing.ipynb](datasets/scripts/dataset-analysis-and-preprocessing.ipynb) - Comprehensive Jupyter notebook containing Elliptic dataset download, exploratory data analysis, feature engineering, preprocessing pipeline, and ML preparation steps

These reference documents contain the foundational research that informed the technical approach, feature engineering strategy, and implementation decisions reflected in this notebook.

---

This notebook serves as the primary entry point for the MVP KYT implementation, providing both technical implementation and business context for real-time cryptocurrency transaction risk assessment.

### Import Libraries

Comprehensive import of all required libraries for machine learning procedures.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

### Loading pre-processed datasets

The pre-processing step reduced the dimensionality from 166 features to only 46.

In [4]:

# The complete dataset already pre-processed 
df_complete = pd.read_hdf("./datasets/processed/df_complete.h5", key="df_complete")
print(f"Loaded from HDF5: {df_complete.shape} - All subsequent operations will use compressed data")

# The filtered labeled dataset already pre-processed
df_labeled = pd.read_hdf("./datasets/processed/df_labeled.h5", key="df_labeled")
print(f"Loaded from HDF5: {df_labeled.shape} - All subsequent operations will use compressed data")

# The filtered unlabeled dataset already pre-processed
df_unlabeled = pd.read_hdf("./datasets/processed/df_unlabeled.h5", key="df_unlabeled")
print(f"Loaded from HDF5: {df_unlabeled.shape} - All subsequent operations will use compressed data")

# The edges dataset that maps relationships between transaction nodes
df_edges = pd.read_hdf("./datasets/processed/df_edges.h5", key="df_edges")
print(f"Loaded from HDF5: {df_edges.shape} - All subsequent operations will use compressed data")


# Summary of all datasets
print(f"\nðŸ“Š Dataset Summary:")
print(f"  - Features: {df_complete.shape[0]:,} transactions Ã— {df_complete.shape[1] -2} features")
print(f"  - Labeled: {df_labeled.shape[0]:,} transactions")
print(f"  - Unlabeled: {df_unlabeled.shape[0]:,} transactions")
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")

Loaded from HDF5: (203769, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (46564, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (157205, 48) - All subsequent operations will use compressed data
Loaded from HDF5: (234355, 2) - All subsequent operations will use compressed data

ðŸ“Š Dataset Summary:
  - Features: 203,769 transactions Ã— 46 features
  - Labeled: 46,564 transactions
  - Unlabeled: 157,205 transactions
  - Edges: 234,355 transaction relationships


### Machine Learning Strategy

Let's try three machine learning approaches with the data and compare than, each methodology will have it's winning model. 

1. Train models using the supervised methodology using only the labeled sub-dataset;
2. Train models using the non-supervised approach using the complete dataset ignoring the edge dataset;
3. Train models using the non-supervised approach using the complete dataset taking to account the relationship between those transactions, by using the edge dataset.  

#### 1. Train models using the supervised methodology

Let`s prepare the dataset for training and validation

[Description]
[Whys]
[What]
[When]

In [None]:
# Defining some parameters
np.random.seed(7)

# Prepare data (df_labeled already loaded: 46,564 Ã— 48)
seed = 7 # random seed
test_size = 0.20 # test set size
X = df_labeled.drop(['class', 'txId'], axis=1)  # 46 features
y = df_labeled['class']  # Binary target
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=test_size, shuffle=True, random_state=seed, stratify=y) # stratified holdout

# Cross-validation setup
n_splits = 10 # PARAMETER: number of folds
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

Let`s define the models to use

[Description]
[Whys]
[What]
[When]

In [None]:
# Defining some models parameters
num_trees = 100 # PARAMETER: number of trees
max_features = 3 # PARAMETER: max features for RandomForest and ExtraTrees

# Defining the individual models
reg = ('LR', LogisticRegression(max_iter=200))
knn = ('KNN', KNeighborsClassifier())
cart = ('CART', DecisionTreeClassifier())
naive = ('NB', GaussianNB())
svm = ('SVM', SVC())

models = []
models.append(reg)
models.append(knn)
models.append(cart)
models.append(naive)
models.append(svm)

# Defining ensemble models
bagging = ('Bag', BaggingClassifier(base_estimator=cart, n_estimators=num_trees))
forest = ('RF', RandomForestClassifier(n_estimators=num_trees, max_features=max_features))
extra = ('ET', ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features))
ada = ('Ada', AdaBoostClassifier(n_estimators=num_trees))
gradient = ('GB', GradientBoostingClassifier(n_estimators=num_trees))
voting = ('Voting', VotingClassifier(models))

Let`s create our pipeline and execute the training

[Description]
[Whys]
[What]
[When]

In [None]:
# Creating the pipelines
pipelines = []
results = []
names = []

# Defining the pipelines
pipelines.append(('LR', Pipeline([reg]))) 
pipelines.append(('KNN', Pipeline([knn])))
pipelines.append(('CART', Pipeline([cart])))
pipelines.append(('NB', Pipeline([naive])))
pipelines.append(('SVM', Pipeline([svm])))
pipelines.append(('Bag', Pipeline([bagging])))
pipelines.append(('RF', Pipeline([forest])))
pipelines.append(('ET', Pipeline([extra])))
pipelines.append(('Ada', Pipeline([ada])))
pipelines.append(('GB', Pipeline([gradient])))
pipelines.append(('Vot', Pipeline([voting])))

# Evaluating each model in the pipeline
scoring = 'accuracy' # PARAMETER: scoring metric
for name, model in pipelines:
    cv_results = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %.3f (%.3f)" % (name, cv_results.mean(), cv_results.std()) 
    print(msg)

Let`s validate the model

[Description]
[Whys]
[What]
[When]

Let`s apply the model into unknown data

[Description]
[Whys]
[What]
[When]

### 2. Train models using the non-supervised approach

### 3. Train models using the non-supervised approach using the edge dataset.  