## MVP Know Your Transaction (KYT) - Real-Time Transaction Risk Scoring Engine

### Project Overview

This notebook presents a comprehensive implementation of a Real-Time Transaction Risk Scoring Engine for Anti-Money Laundering (AML) compliance in cryptocurrency transactions. The project addresses the critical need for sub-second risk assessment of Bitcoin transactions by combining traditional AML indicators with blockchain-specific risk factors.

### Domain Context: Financial AML for Transactions

#### Core Domain Definition
Anti-Money Laundering (AML) for transactions encompasses the comprehensive framework of laws, regulations, procedures, and technological solutions designed to prevent criminals from disguising illegally obtained funds as legitimate income through the global financial system. This domain includes detection, prevention, and reporting of money laundering, terrorist financing, tax evasion, market manipulation, and misuse of public funds.

### Problem Definition: Real-Time Transaction Risk Classification Engine

#### Problem Statement
Develop a system that assigns risk classifications to cryptocurrency transactions in real-time, integrating traditional AML indicators with blockchain-specific risk factors including wallet clustering, transaction graph analysis, and counterparty reputation scoring.

#### Technical Requirements
- **Problem Type**: Classification 
- **Processing Speed**: Sub-second analysis for high-frequency transactions
- **Difficulty Level**: High - requires complex multi-dimensional data processing
- **Output Format**: Risk binary classification (illicit/licit)

#### Data Landscape
The system processes multiple data dimensions:
- Transaction metadata (amounts, timestamps, fees)
- Wallet addresses and clustering information
- Transaction graph relationships and network topology
- Counterparty databases and reputation scores
- Sanctions lists and regulatory databases
- Temporal patterns and behavioral baselines

### References

This notebook implementation is based on the comprehensive research and analysis conducted during the project development phase. The following reference documents were used in the composition of this initial description:

- **Domain Research**: [current-domain.md](domains/current-domain.md) - Contains detailed market analysis, regulatory framework research, and commercial viability assessment for the Financial AML domain
- **Problem Analysis**: [current-problem.md](problems/current-problem.md) - Provides comprehensive problem refinement, technical requirements analysis, and solution approach evaluation
- **Dataset Evaluation**: [current-dataset.md](datasets/current-dataset.md) - Documents dataset selection criteria, suitability scoring, and detailed feature analysis for the Elliptic dataset
- **Dataset Analysis & Preprocessing**: [dataset-analysis-and-preprocessing.ipynb](datasets/scripts/dataset-analysis-and-preprocessing.ipynb) - Comprehensive Jupyter notebook containing Elliptic dataset download, exploratory data analysis, feature engineering, preprocessing pipeline, and ML preparation steps

These reference documents contain the foundational research that informed the technical approach, feature engineering strategy, and implementation decisions reflected in this notebook.

---

This notebook serves as the primary entry point for the MVP KYT implementation, providing both technical implementation and business context for real-time cryptocurrency transaction risk assessment.

### Import Libraries

Comprehensive import of all required libraries for machine learning procedures.

In [31]:
import os
import joblib
import numpy as np
import pandas as pd
import optuna
from optuna.samplers import TPESampler
from optuna.pruners import MedianPruner
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score 
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform

### Loading pre-processed datasets

The pre-processing step reduced the dimensionality from 166 features to only 46.

In [32]:

# The complete dataset already pre-processed 
df_complete = pd.read_hdf("./datasets/processed/df_complete.h5", key="df_complete")
print(f"Loaded from HDF5: {df_complete.shape} - All subsequent operations will use compressed data")

# The filtered labeled dataset already pre-processed
df_labeled = pd.read_hdf("./datasets/processed/df_labeled.h5", key="df_labeled")
print(f"Loaded from HDF5: {df_labeled.shape} - All subsequent operations will use compressed data")

# The filtered unlabeled dataset already pre-processed
df_unlabeled = pd.read_hdf("./datasets/processed/df_unlabeled.h5", key="df_unlabeled")
print(f"Loaded from HDF5: {df_unlabeled.shape} - All subsequent operations will use compressed data")

# The edges dataset that maps relationships between transaction nodes
df_edges = pd.read_hdf("./datasets/processed/df_edges.h5", key="df_edges")
print(f"Loaded from HDF5: {df_edges.shape} - All subsequent operations will use compressed data")


# Summary of all datasets
print(f"\n📊 Dataset Summary:")
print(f"  - Features: {df_complete.shape[0]:,} transactions × {df_complete.shape[1] -2} features")
print(f"  - Labeled: {df_labeled.shape[0]:,} transactions")
print(f"  - Unlabeled: {df_unlabeled.shape[0]:,} transactions")
print(f"  - Edges: {df_edges.shape[0]:,} transaction relationships")

Loaded from HDF5: (203769, 168) - All subsequent operations will use compressed data
Loaded from HDF5: (46564, 168) - All subsequent operations will use compressed data
Loaded from HDF5: (157205, 168) - All subsequent operations will use compressed data
Loaded from HDF5: (234355, 2) - All subsequent operations will use compressed data

📊 Dataset Summary:
  - Features: 203,769 transactions × 166 features
  - Labeled: 46,564 transactions
  - Unlabeled: 157,205 transactions
  - Edges: 234,355 transaction relationships


### Machine Learning Strategy

Let's apply the machine learning technics:

1. Define overall parameters and make data splits
2. Defining all training models to be used
3. Defining all pipelines to be used during training
4. Defining model parameters distribution for a optimization search 
5. Defining the object function and metrics to optimize
6. Execute the optimized training
7. Save all resulting models
8. Validate all models and select the best models
9. Use best models to predict unknown data 


**1.Define overall parameters and make data splits**

Let`s prepare the dataset for training and validation

[Description]
[Whys]
[What]
[When]

In [33]:
# Defining overall parameters
random_seed = 4354 # PARAMETER: random seed
test_size_split = 0.20 # PARAMETER: test set size
n_stratified_splits = 2 # PARAMETER: number of folds
n_pca_components = 0.95 # PARAMETER: PCA components to keep

np.random.seed(random_seed)

# Prepare data (df_labeled already loaded: 46,564 × 48)
x_labeled = df_labeled.drop(['class', 'txId'], axis=1)  # 46 features
y_labeled = df_labeled['class']  # Binary target

X_train, X_test, y_train, y_test = train_test_split(x_labeled, y_labeled,
    test_size=test_size_split, 
    shuffle=True, 
    random_state=random_seed, 
    stratify=y_labeled) # stratified holdout

# Cross-validation setup
cv = StratifiedKFold(n_splits=n_stratified_splits, 
                     shuffle=True, 
                     random_state=random_seed)

**2. Defining all training models to be used**

Let's define which models to use

[Description]
[Whys]
[What]
[When]

In [34]:
# Defining the individual models
reg = ('LR', LogisticRegression())
knn = ('KNN', KNeighborsClassifier())
cart = ('CART', DecisionTreeClassifier())
naive = ('NB', GaussianNB())
svm = ('SVM', SVC())

models = []
models.append(reg)
models.append(knn)
models.append(cart)
models.append(naive)
models.append(svm)

# Defining ensemble models
bagging = ('Bag', BaggingClassifier())
forest = ('RF', RandomForestClassifier())
extra = ('ET', ExtraTreesClassifier())
ada = ('Ada', AdaBoostClassifier())
gradient = ('GB', GradientBoostingClassifier())
voting = ('Voting', VotingClassifier(models))

**3. Defining all pipelines to be used during training**

Let's define which ML pipelines to use during training.

[Description]
[Whys]
[What]
[When]

This section focuses on selecting the most relevant features for distinguishing between licit (class 1) and illicit (class 2) transactions. We are aiming to reduce the feature dimensionality at the same time as maximizing the dissimilarity of the original dataset, thus extracting the most discriminative features and improving the model training performance. Let`s apply the following technics:

In [35]:
# Creating the pipelines
pipelines = []
std  = ('std', StandardScaler())  # Standardization
pca = ('pca', PCA(n_components=n_pca_components))  # Feature reduction

# Defining the pipelines, for future experimentation 
pipelines.append(('LR', Pipeline([std, pca, reg]))) 
pipelines.append(('KNN', Pipeline([std, pca, knn])))
pipelines.append(('CART', Pipeline([std, pca, cart])))
pipelines.append(('NB', Pipeline([std, pca, naive])))
pipelines.append(('SVM', Pipeline([std, pca, svm])))
pipelines.append(('Bag', Pipeline([std, pca, bagging])))
pipelines.append(('RF', Pipeline([std, pca, forest])))
pipelines.append(('ET', Pipeline([std, pca, extra])))
pipelines.append(('Ada', Pipeline([std, pca, ada])))
pipelines.append(('GB', Pipeline([std, pca, gradient])))
#pipelines.append(('Vot', Pipeline([std, pca, voting])))

**4. Defining model parameters distribution for a grid search approach**

Let's prepare the parameter distributions for a random grid search

[Description]
[Whys]
[What]
[When]

In [36]:
# Optuna Parameter Suggestions - All original parameters preserved with comments
def suggest_params(trial, name):
    params = {
        'LR': {
            'LR__C': trial.suggest_float('LR__C', 1e-4, 1e2, log=True),  # Regularization strength
            'LR__solver': trial.suggest_categorical('LR__solver', ['lbfgs', 'newton-cg', 'sag', 'saga']),  # Optimization algorithm
            'LR__penalty': trial.suggest_categorical('LR__penalty', ['l2', 'none']),  # Regularization type
            'LR__max_iter': trial.suggest_categorical('LR__max_iter', [1000, 2000, 5000]),  # Convergence limit
            'LR__tol': trial.suggest_float('LR__tol', 1e-6, 1e-3, log=True)  # Convergence tolerance
        },
        'KNN': {
            'KNN__n_neighbors': trial.suggest_int('KNN__n_neighbors', 3, 21, step=2),  # Number of neighbors
            'KNN__weights': trial.suggest_categorical('KNN__weights', ['uniform', 'distance']),  # Weight function
            'KNN__metric': trial.suggest_categorical('KNN__metric', ['euclidean', 'manhattan', 'minkowski']),  # Distance metric
            'KNN__p': trial.suggest_int('KNN__p', 1, 3)  # Minkowski power parameter
        },
        'CART': {
            'CART__max_depth': trial.suggest_int('CART__max_depth', 3, 20),  # Maximum tree depth
            'CART__min_samples_split': trial.suggest_int('CART__min_samples_split', 10, 50),  # Min samples to split
            'CART__min_samples_leaf': trial.suggest_int('CART__min_samples_leaf', 5, 20),  # Min samples per leaf
            'CART__criterion': trial.suggest_categorical('CART__criterion', ['gini', 'entropy'])  # Split quality measure
        },
        'NB': {
            'NB__var_smoothing': trial.suggest_float('NB__var_smoothing', 1e-12, 1e-6, log=True)  # Variance smoothing
        },
        'SVM': {
            'SVM__C': trial.suggest_float('SVM__C', 1e-2, 1e3, log=True),  # Regularization parameter
            'SVM__kernel': trial.suggest_categorical('SVM__kernel', ['rbf', 'poly', 'sigmoid']),  # Kernel function
            'SVM__gamma': trial.suggest_categorical('SVM__gamma', ['scale', 'auto']),  # Kernel coefficient
            'SVM__probability': True  # Enable probability estimates for AML risk scoring
        },
        'RF': {
            'RF__n_estimators': trial.suggest_int('RF__n_estimators', 100, 500, step=50),  # Number of trees
            'RF__max_depth': trial.suggest_int('RF__max_depth', 10, 25),  # Individual tree depth
            'RF__min_samples_split': trial.suggest_int('RF__min_samples_split', 5, 20),  # Conservative splitting
            'RF__min_samples_leaf': trial.suggest_int('RF__min_samples_leaf', 2, 10),  # Leaf size constraint
            'RF__max_features': trial.suggest_categorical('RF__max_features', ['sqrt', 'log2', None]),  # Feature subset
            'RF__bootstrap': trial.suggest_categorical('RF__bootstrap', [True, False])  # Bootstrap sampling
        },
        'ET': {
            'ET__n_estimators': trial.suggest_int('ET__n_estimators', 100, 500, step=50),  # Number of trees
            'ET__max_depth': trial.suggest_int('ET__max_depth', 10, 25),  # Maximum depth
            'ET__min_samples_split': trial.suggest_int('ET__min_samples_split', 5, 20),  # Min samples to split
            'ET__min_samples_leaf': trial.suggest_int('ET__min_samples_leaf', 2, 10),  # Min samples at leaf
            'ET__max_features': trial.suggest_categorical('ET__max_features', ['sqrt', 'log2', None]),  # Feature subset
            'ET__bootstrap': trial.suggest_categorical('ET__bootstrap', [True, False])  # Bootstrap sampling
        },
        'GB': {
            'GB__n_estimators': trial.suggest_int('GB__n_estimators', 100, 300, step=25),  # Boosting stages
            'GB__learning_rate': trial.suggest_float('GB__learning_rate', 1e-2, 3e-1, log=True),  # Learning rate
            'GB__max_depth': trial.suggest_int('GB__max_depth', 3, 8),  # Individual tree depth
            'GB__subsample': trial.suggest_float('GB__subsample', 0.7, 0.9)  # Subsample fraction
        },
        'Ada': {
            'Ada__n_estimators': trial.suggest_int('Ada__n_estimators', 50, 200, step=25),  # Number of weak learners
            'Ada__learning_rate': trial.suggest_float('Ada__learning_rate', 0.5, 1.5),  # Learning rate
            'Ada__algorithm': trial.suggest_categorical('Ada__algorithm', ['SAMME', 'SAMME.R'])  # AdaBoost algorithm
        },
        'Bag': {
            'Bag__n_estimators': trial.suggest_int('Bag__n_estimators', 50, 200, step=25),  # Number of estimators
            'Bag__max_samples': trial.suggest_float('Bag__max_samples', 0.6, 0.9),  # Sample fraction
            'Bag__max_features': trial.suggest_float('Bag__max_features', 0.7, 1.0)  # Feature fraction
        }
    }
    return params.get(name, {})

**5. Defining the object function and metrics to optimize**

Let`s define which metrics to optimize in the training search

[Description]
[Whys]
[What]
[When]

In [37]:
# Multi-metric objective function for AML/KYT systems
def aml_objective(y_true, y_pred, y_proba=None):
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    auc = roc_auc_score(y_true, y_proba) if y_proba is not None and len(np.unique(y_true)) > 1 else 0.5
    
    # Weighted combination: Recall(35%) + Precision(25%) + F1(25%) + AUC(15%)
    return 0.35*recall + 0.25*precision + 0.25*f1 + 0.15*auc

# Create sklearn-compatible scorer for cross_val_score
def aml_scorer_func(estimator, X, y):
    y_pred = estimator.predict(X)
    try:
        y_proba = estimator.predict_proba(X)[:, 1]
    except:
        y_proba = None
    return aml_objective(y, y_pred, y_proba)

aml_scorer = make_scorer(aml_scorer_func, greater_is_better=True)

**6. Execute the optimized training**

Let's execute the training phase using the random grid search with cross validation and rank the best models by accuracy.

[Description]
[Whys]
[What]
[When]

In [38]:
# Optuna Bayesian Optimization with cross_val_score
op_n_trials = 20  # PARAMETER: Trials per model
op_n_jobs = 2  # PARAMETER: Parallel jobs for cross_val_score
optimized_models = []
results = []

print("🚀 Optuna + cross_val_score Model Optimization")
print("-" * 70)

for name, pipe in pipelines:
    print(f"Training {name}...", end=" ")
    def objective(trial):
        params = suggest_params(trial, name)
        if params: pipe.set_params(**params)
        scores = cross_val_score(pipe, X_train, y_train, 
                                 cv=cv, 
                                 scoring=aml_scorer, 
                                 n_jobs=op_n_jobs)
        return scores.mean()
    study = optuna.create_study(direction='maximize', 
                                sampler=TPESampler(seed=random_seed),
                                pruner=MedianPruner())
    study.optimize(objective, n_trials=op_n_trials, show_progress_bar=True)
    
    # Apply best parameters
    if study.best_params:
        pipe.set_params(**study.best_params)
    
    best_score = study.best_value or 0.0
    n_completed = len([t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE])
    
    print(f"✅ {best_score:.4f} [{n_completed} trials]")
    
    optimized_models.append((name, pipe))
    results.append(best_score)

# Fixed visualization
fig, ax = plt.subplots(figsize=(12, 6))
names = [name for name, _ in optimized_models]
bars = ax.bar(names, results, color='skyblue', edgecolor='black', alpha=0.7)
ax.set_ylabel('Combined AML Score')
ax.set_title('Optuna Model Performance (Multi-Metric Objective)')
ax.grid(axis='y', alpha=0.3)

# Add score labels
for bar, score in zip(bars, results):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{score:.3f}', ha='center', va='bottom', fontweight='bold')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"\n🎯 Best performing model: {names[np.argmax(results)]} ({max(results):.4f})")

[I 2025-09-21 17:42:35,292] A new study created in memory with name: no-name-70f26644-78d7-4b8f-8871-100cec9063dd


🚀 Optuna + cross_val_score Model Optimization
----------------------------------------------------------------------
Training LR... 

  0%|          | 0/20 [00:00<?, ?it/s]

[W 2025-09-21 17:42:35,546] Trial 0 failed with parameters: {'LR__C': 0.0015309100761935702, 'LR__solver': 'lbfgs', 'LR__penalty': 'none', 'LR__max_iter': 1000, 'LR__tol': 0.00010994048924701591, 'KNN__n_neighbors': 15, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan', 'KNN__p': 2, 'CART__max_depth': 16, 'CART__min_samples_split': 30, 'CART__min_samples_leaf': 20, 'CART__criterion': 'entropy', 'NB__var_smoothing': 4.065685518531922e-09, 'SVM__C': 3.234473744318124, 'SVM__kernel': 'sigmoid', 'SVM__gamma': 'scale', 'RF__n_estimators': 350, 'RF__max_depth': 24, 'RF__min_samples_split': 13, 'RF__min_samples_leaf': 7, 'RF__max_features': 'sqrt', 'RF__bootstrap': False, 'ET__n_estimators': 200, 'ET__max_depth': 12, 'ET__min_samples_split': 15, 'ET__min_samples_leaf': 4, 'ET__max_features': 'sqrt', 'ET__bootstrap': True, 'GB__n_estimators': 275, 'GB__learning_rate': 0.03905637462116368, 'GB__max_depth': 7, 'GB__subsample': 0.8102200417708098, 'Ada__n_estimators': 100, 'Ada__learning_rat

ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/pipeline.py", line 663, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 1358, in wrapper
    estimator._validate_params()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 471, in _validate_params
    validate_parameter_constraints(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._parameter_constraints,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        self.get_params(deep=False),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        caller_name=self.__class__.__name__,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
    ...<2 lines>...
    )
sklearn.utils._param_validation.InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'elasticnet', 'l2', 'l1'} or None. Got 'none' instead.

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/pipeline.py", line 663, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 1358, in wrapper
    estimator._validate_params()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/base.py", line 471, in _validate_params
    validate_parameter_constraints(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._parameter_constraints,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        self.get_params(deep=False),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        caller_name=self.__class__.__name__,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/raphael-pizzaia/miniconda/envs/venv-analytics/lib/python3.13/site-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
    ...<2 lines>...
    )
sklearn.utils._param_validation.InvalidParameterError: The 'penalty' parameter of LogisticRegression must be a str among {'l2', 'l1', 'elasticnet'} or None. Got 'none' instead.


**7. Save all resulting models**

Let's save all models

[Description]
[Whys]
[What]
[When]

In [None]:
try:
    # Try to save models from training
    folderDir = "./models/mvp-kyt-sup-main-v2"
    os.makedirs(folderDir, exist_ok=True)
    for name, pipe in optimized_models:
        joblib.dump(pipe, f"{folderDir}/{name}.pkl", compress=True)
    print(f"💾 Saved {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
except NameError:
    # Load models if optimized_models doesn't exist
    optimized_models = []
    if os.path.exists(folderDir):
        for file in os.listdir(folderDir):
            if file.endswith('.pkl'):
                name = file.replace('.pkl', '')
                pipe = joblib.load(f"{folderDir}/{file}")
                optimized_models.append((name, pipe))
        print(f"📁 Loaded {len(optimized_models)} models: {[name for name, _ in optimized_models]}")
    else:
        print("❌ No models found")

**8. Validate all models and select the best models**

Let's validate and select the best models

[Description]
[Whys]
[What]
[When]

In [None]:
# Select best 3 models based on test accuracy
test_results = []
for name, pipe in optimized_models:
    y_pred = pipe.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    test_results.append((name, accuracy))
    print(f"{name}: Test Accuracy = {accuracy:.4f}")

# Comparison Boxplot
fig = plt.figure(figsize=(25,6))
fig.suptitle('Models Comparison') 
ax = fig.add_subplot(111) 
plt.boxplot([x[1] for x in test_results]) 
ax.set_xticklabels([x[0] for x in test_results])
plt.show()

# Sort by test accuracy and get top 3
test_results.sort(key=lambda x: x[2], reverse=True)
best_optimized_models = [(name, model) for name, model, acc in test_results[:3]]

print(f"\n🏆 Final top 3 models: {[name for name, _ in best_optimized_models]}")
print(f"📊 Test accuracies: {[f'{acc:.4f}' for _, _, acc in test_results[:3]]}")


**9. Use best models to predict unknown data**

Let`s apply the model into unknown data

[Description]
[Whys]
[What]
[When]

In [None]:
# Apply best model to unlabeled data
best_model = best_optimized_models[0][1]  # Champion model
X_unlabeled = df_unlabeled.drop(['class', 'txId'], axis=1)

predictions = best_model.predict(X_unlabeled)
probabilities = best_model.predict_proba(X_unlabeled)[:, 1]
display(predictions)
print(f"🔮 Predictions on {len(X_unlabeled)} unlabeled transactions")
#print(f"Illicit predictions: {sum(predictions)} ({sum(predictions)/len(predictions)*101:.1f}%)")
print(f"Risk scores range: {probabilities.min():.3f} - {probabilities.max():.3f}")

### Conclusions

[conclusion]