goal: train 10 machine learning predictors: one predictor for each function from the ten protein function categories ("DNA, RNA and nucleotide metabolism", "tail", "head and packaging", "other", "lysis", "connector", "transcription regulation", "moron, auxiliary metabolic gene and host takeover", "unknown function", "integration and excision") 

predictor input: protein features; output: labels (0/1) representing whether the protein serves the specific function

dataset: 360,413 seqs in total - 60% for training, 20% for validation, 20% for testing 
** subset the negatives (samples whose label is 0) and make it about 5-10 times of the positives
** change: split 20% for testing first, then train:val = 8:2

-use clustering results to avoid spliting protein seqs in the same cluster (maybe use GroupShuffleSplit from sklearn)
** change: for each predictor, cluster on the positive dataset (those with label "1") alone, and do nothing to negative dataset. use the clustering result while splitting dataset to ensure seqs in the same cluster in positive dataset is not splitted into different sets of train/val/test. I already have protein distances generated from diamond blastp in unique_diamond_results.daa, and would like to get the clustering result by select a cutoff of 100bitscore and use that to read into a graph with networkx and then extract subgraphs which then will be the clusters.

** change: add learning curve for training XGBoost

** change: try the 100 versions of threshold from 0.01 - 1

the results are printed as text.  
** change: not to print them out but to save it to results/predictor/{function_name}.txt


dataset:  
2,318,538 seqs in the original dataset  
927,040 seqs after dropping pcat "unknown_no_hit"  
360,413 unique seqs after dropping duplicated seqs  

features: 1711-dim

subset the negatives;  
- for each function, cluster on the positive dataset  
- try threshold from 0.01 to 1  
- [o] change the dataset splitting strategy  
- [o] model param : is_imbalanced  
- add learning curve (could be find from XGBoost)  
- n_estimators: increased to 10k; may need GPU

# training script

In [1]:
import numpy as np
import pandas as pd
import networkx as nx
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Dict, Any
import joblib
import os
import json

In [2]:
def get_df(function_name: str):
    ids = pd.read_csv(f"../dataset/pcat/{function_name}.csv")
    features = pd.read_parquet("../dataset/protein_features_unique_no_dipep_tripep.pa")
    features["label"] = features["id"].isin(ids["name"]).astype(int)
    # features = features.drop(columns=["md5"])
    return features

In [3]:
def subset_negatives(
    df: pd.DataFrame, label_col: str = "label", ratio: int = 5, random_state: int = 42
) -> pd.DataFrame:
    """
    Subset the negatives to be `ratio` times the number of positives.
    Args:
        df: DataFrame with a label column
        label_col: Name of the label column
        ratio: Negative:positive ratio
        random_state: For reproducibility
    Returns:
        Subsetted DataFrame
    """
    pos_df = df[df[label_col] == 1]
    neg_df = df[df[label_col] == 0]
    n_pos = len(pos_df)
    n_neg = min(len(neg_df), n_pos * ratio)
    neg_df = neg_df.sample(n=n_neg, random_state=random_state)
    return (
        pd.concat([pos_df, neg_df])
        .sample(frac=1, random_state=random_state)
        .reset_index(drop=True)
    )

In [4]:
def cluster_positives(
    df, bitscore_file, bitscore_cutoff=100, id_col="id", label_col="label"
):
    pos_ids = set(df[df[label_col] == 1][id_col])

    # Load diamond results (parquet)
    diamond = pd.read_parquet(bitscore_file)
    # If needed, rename columns here:
    # diamond = diamond.rename(columns={"col1": "qseqid", "col2": "sseqid", "col3": "bitscore"})

    # Filter for bitscore cutoff and only positive IDs
    diamond = diamond[
        (diamond["bitscore"] >= bitscore_cutoff)
        & (diamond["qseqid"].isin(pos_ids))
        & (diamond["sseqid"].isin(pos_ids))
    ]

    # Build graph
    G = nx.Graph()
    G.add_edges_from(diamond[["qseqid", "sseqid"]].itertuples(index=False, name=None))

    # Assign cluster IDs
    cluster_mapping = {}
    for i, component in enumerate(nx.connected_components(G)):
        for node in component:
            cluster_mapping[node] = f"cluster_{i}"

    # Assign unconnected positives to their own cluster
    for pid in pos_ids:
        if pid not in cluster_mapping:
            cluster_mapping[pid] = f"cluster_single_{pid}"

    return cluster_mapping

In [5]:
def prepare_data(
    df: pd.DataFrame,
    feature_cols: list,
    cluster_mapping: Dict[str, str],
    label_col: str = "label",
    id_col: str = "id",
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Prepare data for training by separating features and labels.

    Args:
        df: DataFrame containing features and labels
        feature_cols: List of feature column names
        cluster_mapping: Dictionary mapping sequence IDs to cluster IDs
        label_col: Name of the label column
        id_col: Name of the ID column

    Returns:
        X: Feature matrix
        y: Label array
        ids: Array of sequence IDs
        groups: Array of cluster IDs for each sequence
    """
    X = df[feature_cols].values
    y = df[label_col].values
    ids = df[id_col].values

    # Get cluster IDs for each sequence
    groups = np.array([cluster_mapping.get(str(id_), "unknown") for id_ in ids])

    return X, y, ids, groups

In [6]:
def split_data(
    X: np.ndarray,
    y: np.ndarray,
    ids: np.ndarray,
    groups: np.ndarray,
    test_size: float = 0.2,
    val_size: float = 0.2,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Split data into train, validation, and test sets while keeping related sequences together.

    Args:
        X: Feature matrix
        y: Label array
        ids: Array of sequence IDs
        groups: Array of cluster IDs for each sequence
        test_size: Proportion of data to use for testing
        val_size: Proportion of data to use for validation
    """
    # First split: separate test set
    gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=42)
    train_val_idx, test_idx = next(gss.split(X, y, groups=groups))

    X_train_val, X_test = X[train_val_idx], X[test_idx]
    y_train_val, y_test = y[train_val_idx], y[test_idx]
    groups_train_val = groups[train_val_idx]

    # Second split: separate validation set from training set
    val_size_adjusted = val_size / (
        1 - test_size
    )  # Adjust val_size to account for test set
    gss = GroupShuffleSplit(n_splits=1, test_size=val_size_adjusted, random_state=42)
    train_idx, val_idx = next(
        gss.split(X_train_val, y_train_val, groups=groups_train_val)
    )

    X_train, X_val = X_train_val[train_idx], X_train_val[val_idx]
    y_train, y_val = y_train_val[train_idx], y_train_val[val_idx]

    return X_train, X_val, X_test, y_train, y_val, y_test


# first split: test 20%
# train:val = 8:2

In [7]:
def train_model(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_val: np.ndarray,
    y_val: np.ndarray,
    model_params: Dict[str, Any] = None,
) -> Tuple[Any, StandardScaler]:
    """Train an XGBoost classifier with optional hyperparameters."""
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)

    # Set default parameters if none provided
    if model_params is None:
        model_params = {
            "n_estimators": 200,
            "max_depth": 6,
            "learning_rate": 0.1,
            "subsample": 0.8,
            "colsample_bytree": 0.8,
            "min_child_weight": 1,
            "scale_pos_weight": 1,
            "random_state": 42,
        }

    # Calculate scale_pos_weight based on class imbalance
    n_neg = np.sum(y_train == 0)
    n_pos = np.sum(y_train == 1)
    model_params["scale_pos_weight"] = n_neg / n_pos

    # Train model
    model = XGBClassifier(eval_metric=["logloss", "auc"], **model_params)
    model.fit(
        X_train_scaled,
        y_train,
        eval_set=[(X_train_scaled, y_train), (X_val_scaled, y_val)],
        verbose=False,
        # early_stopping_rounds=50,
    )
    evals_result = model.evals_result()
    return model, scaler, evals_result

In [8]:
def evaluate_model(
    model: Any,
    scaler: StandardScaler,
    X: np.ndarray,
    y: np.ndarray,
    set_name: str = "",
    threshold: float = 0.5,
) -> Dict[str, float]:
    """Evaluate model performance on a dataset."""

    # no need for scaling!

    X_scaled = scaler.transform(X)
    y_pred_proba = model.predict_proba(X_scaled)[:, 1]
    y_pred = (y_pred_proba >= threshold).astype(int)

    # Calculate MCC
    def matthews(y_true, y_pred):
        from math import sqrt

        """
            P  = Total number of positives
            N  = Total number of negatives
            Tp = number of true positives
            Fp = number of false positives
        """
        if type(y_true) == pd.Series:
            y_true = y_true.values

        P = len([x for x in y_true if x == 1])
        N = len([x for x in y_true if x == 0])

        Tp, Fp = 0, 0
        for i in range(len(y_true)):
            if y_true[i] == 1 and y_pred[i] == 1:
                Tp += 1
            elif y_true[i] == 0 and y_pred[i] == 1:
                Fp += 1

        Tn = N - Fp
        Fn = P - Tp

        try:
            mcc = (Tp * Tn - Fp * Fn) / sqrt(
                (Tn + Fn) * (Tn + Fp) * (Tp + Fn) * (Tp + Fp)
            )
        except ZeroDivisionError:
            mcc = 0
        return (mcc, f"P: {P:_} Tp: {Tp:_} Fp: {Fp:_} N: {N:_} Tn: {Tn:_} Fn: {Fn:_}")

    # Get MCC and confusion matrix values
    mcc, confusion_str = matthews(y, y_pred)

    metrics = {
        f"{set_name}_accuracy": accuracy_score(y, y_pred),
        f"{set_name}_precision": precision_score(y, y_pred),
        f"{set_name}_recall": recall_score(y, y_pred),
        f"{set_name}_f1": f1_score(y, y_pred),
        f"{set_name}_roc_auc": roc_auc_score(y, y_pred_proba),
        f"{set_name}_mcc": mcc,
    }

    # Print metrics
    print(f"\nMetrics for {set_name} (threshold={threshold}):")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")

    # Print confusion matrix values
    print(f"\nConfusion Matrix Values:")
    print(confusion_str)

    return metrics

In [9]:
# def plot_feature_importance(model: Any, feature_cols: list, top_n: int = 20):
#     """Plot feature importance from XGBoost model."""
#     importance_scores = model.feature_importances_
#     indices = np.argsort(importance_scores)[::-1]

#     plt.figure(figsize=(12, 6))
#     plt.title("Feature Importances")
#     plt.bar(range(top_n), importance_scores[indices[:top_n]])
#     plt.xticks(range(top_n), [feature_cols[i] for i in indices[:top_n]], rotation=90)
#     plt.tight_layout()
#     plt.show()

#     # Print top N most important features
#     print(f"\nTop {top_n} most important features:")
#     for i in range(top_n):
#         print(f"{feature_cols[indices[i]]}: {importance_scores[indices[i]]:.4f}")

In [10]:
def save_model(
    model: Any, scaler: StandardScaler, feature_cols: list, function_name: str
):
    """Save the trained model and scaler."""
    # Create models directory if it doesn't exist
    os.makedirs("../models", exist_ok=True)

    # Create a dictionary containing all necessary components
    model_data = {"model": model, "scaler": scaler, "feature_cols": feature_cols}

    # Save the model data
    joblib.dump(model_data, f"../models/{function_name}_predictor.joblib")

In [11]:
def train_function_predictor(function_name: str, model_params: Dict[str, Any] = None):
    """Main training pipeline for a specific protein function."""
    # # Load cluster mapping
    # with open("../dataset/protein_cluster_mapping.json", "r") as f:
    #     cluster_mapping = json.load(f)

    # get df using function name
    df = get_df(function_name)
    df = subset_negatives(df, ratio=5)

    # Cluster positives using diamond results
    bitscore_file = "../dataset/unique_diamond_results.parquet"
    cluster_mapping = cluster_positives(df, bitscore_file, bitscore_cutoff=100)

    # Assign dummy group to negatives
    for idx, row in df.iterrows():
        if row["label"] == 0:
            cluster_mapping[row["id"]] = f"neg_{row['id']}"

    # Get feature columns (excluding 'id' and 'label')
    feature_cols = [col for col in df.columns if col not in ["id", "label"]]

    # Prepare data
    X, y, ids, groups = prepare_data(df, feature_cols, cluster_mapping)

    # Split data
    X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y, ids, groups)

    # Train model
    model, scaler, evals_result = train_model(
        X_train, y_train, X_val, y_val, model_params
    )

    train_metric = evals_result["validation_0"]["logloss"]
    val_metric = evals_result["validation_1"]["logloss"]

    plt.figure()
    plt.plot(train_metric, label="Train Logloss")
    plt.plot(val_metric, label="Val Logloss")
    plt.xlabel("Boosting Round")
    plt.ylabel("Logloss")
    plt.title(
        f"Learning Curve: {function_name}\n"
        f"n_estimators: {model_params['n_estimators']}, no early stopping"
    )
    plt.legend()
    plt.ylim(0, 1.0)
    os.makedirs(f"../results/predictor", exist_ok=True)
    plt.savefig(f"../results/predictor/{function_name}_learning_curve.png")
    plt.close()

    # Collect results in a list of strings
    results_lines = []

    # Feature importances
    importance_scores = model.feature_importances_
    indices = np.argsort(importance_scores)[::-1]
    results_lines.append("Top 20 most important features:")
    for i in range(20):
        results_lines.append(
            f"{feature_cols[indices[i]]}: {importance_scores[indices[i]]:.4f}"
        )
    results_lines.append("")

    # Train metrics
    train_metrics = evaluate_model(model, scaler, X_train, y_train, "train")
    results_lines.append("=== Training Set ===")
    for k, v in train_metrics.items():
        results_lines.append(f"{k}: {v}")
    results_lines.append("")

    # Validation metrics
    val_metrics = evaluate_model(model, scaler, X_val, y_val, "val")
    results_lines.append("=== Validation Set ===")
    for k, v in val_metrics.items():
        results_lines.append(f"{k}: {v}")
    results_lines.append("")

    # Test metrics for all thresholds
    results_lines.append("=== Test Set ===")
    thresholds = np.linspace(0.01, 1.0, 100)
    best_mcc = -float("inf")
    best_threshold = None

    for threshold in thresholds:
        test_metrics = evaluate_model(model, scaler, X_test, y_test, "test", threshold)
        results_lines.append(f"Threshold: {threshold:.2f}")
        for k, v in test_metrics.items():
            results_lines.append(f"{k}: {v}")
        results_lines.append("")
        # Track best MCC
        mcc = test_metrics.get("test_mcc", -float("inf"))
        if isinstance(
            mcc, tuple
        ):  # If your evaluate_model returns (mcc, confusion_str)
            mcc = mcc[0]
        if mcc > best_mcc:
            best_mcc = mcc
            best_threshold = threshold

    # Add the best MCC summary line
    results_lines.append(f"Best MCC: {best_mcc:.4f} at threshold {best_threshold:.2f}")
    results_lines.append("")

    # Save all results to file
    os.makedirs("../results/predictor", exist_ok=True)
    result_file = f"../results/predictor/{function_name}.txt"
    with open(result_file, "w") as f:
        f.write("\n".join(results_lines))

    # Save model
    save_model(model, scaler, feature_cols, function_name)

# execution

In [12]:
# example for testing early stopping
# import xgboost

# print("XGBoost version:", xgboost.__version__)

# from xgboost import XGBClassifier
# from sklearn.datasets import load_iris
# from sklearn.model_selection import train_test_split

# X, y = load_iris(return_X_y=True)
# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# model = XGBClassifier(n_estimators=10)
# model.fit(
#     X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=5, verbose=True
# )

In [13]:
# model_params = {
#     "n_estimators": 10000,
#     "max_depth": 8,
#     "learning_rate": 0.05,
#     "subsample": 0.7,
#     "colsample_bytree": 0.7,
#     "min_child_weight": 2,
#     "random_state": 42,
#     "early_stopping_rounds": 100,
# }å

# train_function_predictor(å
#     "moron_auxiliary_metabolic_gene_and_host_takeover", model_params
# )

In [14]:
function_names = [
    "lysis",
    "tail",
    "connector",
    "dna_rna_and_nucleotide_metabolism",
    "head_and_packaging",
    "other",
    "transcription_regulation",
    "moron_auxiliary_metabolic_gene_and_host_takeover",
    "unknown_function",
    "integration_and_excision",
]

model_params = {
    "n_estimators": 5000,
    "max_depth": 8,
    "learning_rate": 0.05,
    "subsample": 0.7,
    "colsample_bytree": 0.7,
    "min_child_weight": 2,
    "random_state": 42,
    # "early_stopping_rounds": 100,
}

for function_name in function_names:
    train_function_predictor(function_name, model_params)


Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 15_298 Tp: 15_298 Fp: 0 N: 61_405 Tn: 61_405 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.9568
val_precision: 0.8864
val_recall: 0.5972
val_f1: 0.7137
val_roc_auc: 0.9596
val_mcc: 0.7070

Confusion Matrix Values:
P: 2_026 Tp: 1_210 Fp: 155 N: 20_454 Tn: 20_299 Fn: 816

Metrics for test (threshold=0.01):
test_accuracy: 0.9144
test_precision: 0.6517
test_recall: 0.7651
test_f1: 0.7039
test_roc_auc: 0.9429
test_mcc: 0.6571

Confusion Matrix Values:
P: 3_138 Tp: 2_401 Fp: 1_283 N: 20_451 Tn: 19_168 Fn: 737

Metrics for test (threshold=0.02):
test_accuracy: 0.9221
test_precision: 0.7011
test_recall: 0.7228
test_f1: 0.7118
test_roc_auc: 0.9429
test_mcc: 0.6668

Confusion Matrix Values:
P: 3_138 Tp: 2_268 Fp: 967 N: 20_451 Tn: 19_484 Fn: 870

Metrics for test (threshold=0.03):
test_accuracy: 0.9

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for test (threshold=1.0):
test_accuracy: 0.8649
test_precision: 0.0000
test_recall: 0.0000
test_f1: 0.0000
test_roc_auc: 0.9557
test_mcc: 0.0000

Confusion Matrix Values:
P: 2_464 Tp: 0 Fp: 0 N: 15_768 Tn: 15_768 Fn: 2_464

Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 65_751 Tp: 65_751 Fp: 1 N: 153_458 Tn: 153_457 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.8485
val_precision: 0.9122
val_recall: 0.5447
val_f1: 0.6821
val_roc_auc: 0.9422
val_mcc: 0.6246

Confusion Matrix Values:
P: 21_777 Tp: 11_862 Fp: 1_142 N: 51_210 Tn: 50_068 Fn: 9_915

Metrics for test (threshold=0.01):
test_accuracy: 0.8284
test_precision: 0.6026
test_recall: 0.9148
test_f1: 0.7266
test_roc_auc: 0.9375
test_mcc: 0.6373

Confusion Matrix Values:
P: 17_003 Tp: 15_554 Fp: 10_258 N: 51_214 Tn: 40_956 Fn: 1_449

Metrics for test (threshold=0.02):
test_accur

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for test (threshold=1.0):
test_accuracy: 0.7848
test_precision: 0.0000
test_recall: 0.0000
test_f1: 0.0000
test_roc_auc: 0.9239
test_mcc: 0.0000

Confusion Matrix Values:
P: 16_141 Tp: 0 Fp: 0 N: 58_868 Tn: 58_868 Fn: 16_141

Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 19_094 Tp: 19_094 Fp: 0 N: 101_305 Tn: 101_305 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.8537
val_precision: 0.8799
val_recall: 0.3116
val_f1: 0.4603
val_roc_auc: 0.8905
val_mcc: 0.4692

Confusion Matrix Values:
P: 8_436 Tp: 2_629 Fp: 359 N: 33_706 Tn: 33_347 Fn: 5_807

Metrics for test (threshold=0.01):
test_accuracy: 0.8446
test_precision: 0.5007
test_recall: 0.6654
test_f1: 0.5714
test_roc_auc: 0.8594
test_mcc: 0.4861

Confusion Matrix Values:
P: 6_217 Tp: 4_137 Fp: 4_125 N: 33_724 Tn: 29_599 Fn: 2_080

Metrics for test (threshold=0.02):
test_accuracy: 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for test (threshold=1.0):
test_accuracy: 0.8443
test_precision: 0.0000
test_recall: 0.0000
test_f1: 0.0000
test_roc_auc: 0.8594
test_mcc: 0.0000

Confusion Matrix Values:
P: 6_217 Tp: 0 Fp: 0 N: 33_724 Tn: 33_724 Fn: 6_217

Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 7_773 Tp: 7_773 Fp: 0 N: 44_765 Tn: 44_765 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.9247
val_precision: 0.7892
val_recall: 0.4970
val_f1: 0.6099
val_roc_auc: 0.9270
val_mcc: 0.5894

Confusion Matrix Values:
P: 2_004 Tp: 996 Fp: 266 N: 14_920 Tn: 14_654 Fn: 1_008

Metrics for test (threshold=0.01):
test_accuracy: 0.8386
test_precision: 0.6984
test_recall: 0.6521
test_f1: 0.6744
test_roc_auc: 0.8988
test_mcc: 0.5679

Confusion Matrix Values:
P: 5_145 Tp: 3_355 Fp: 1_449 N: 14_925 Tn: 13_476 Fn: 1_790

Metrics for test (threshold=0.02):
test_accuracy: 0.8406
t

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for test (threshold=1.0):
test_accuracy: 0.7436
test_precision: 0.0000
test_recall: 0.0000
test_f1: 0.0000
test_roc_auc: 0.8988
test_mcc: 0.0000

Confusion Matrix Values:
P: 5_145 Tp: 0 Fp: 0 N: 14_925 Tn: 14_925 Fn: 5_145

Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 9_166 Tp: 9_166 Fp: 0 N: 47_865 Tn: 47_865 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.8708
val_precision: 0.9098
val_recall: 0.3327
val_f1: 0.4872
val_roc_auc: 0.8620
val_mcc: 0.5029

Confusion Matrix Values:
P: 3_607 Tp: 1_200 Fp: 119 N: 15_948 Tn: 15_829 Fn: 2_407

Metrics for test (threshold=0.01):
test_accuracy: 0.8459
test_precision: 0.5369
test_recall: 0.5325
test_f1: 0.5347
test_roc_auc: 0.8121
test_mcc: 0.4424

Confusion Matrix Values:
P: 3_185 Tp: 1_696 Fp: 1_463 N: 15_977 Tn: 14_514 Fn: 1_489

Metrics for test (threshold=0.02):
test_accuracy: 0.8596

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for test (threshold=1.0):
test_accuracy: 0.8338
test_precision: 0.0000
test_recall: 0.0000
test_f1: 0.0000
test_roc_auc: 0.8121
test_mcc: 0.0000

Confusion Matrix Values:
P: 3_185 Tp: 0 Fp: 0 N: 15_977 Tn: 15_977 Fn: 3_185

Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 4_039 Tp: 4_039 Fp: 0 N: 22_240 Tn: 22_240 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.9063
val_precision: 0.8673
val_recall: 0.5809
val_f1: 0.6958
val_roc_auc: 0.9407
val_mcc: 0.6610

Confusion Matrix Values:
P: 1_687 Tp: 980 Fp: 150 N: 7_458 Tn: 7_308 Fn: 707

Metrics for test (threshold=0.01):
test_accuracy: 0.8664
test_precision: 0.6039
test_recall: 0.8222
test_f1: 0.6963
test_roc_auc: 0.9219
test_mcc: 0.6253

Confusion Matrix Values:
P: 1_693 Tp: 1_392 Fp: 913 N: 7_397 Tn: 6_484 Fn: 301

Metrics for test (threshold=0.02):
test_accuracy: 0.8806
test_precis

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Metrics for train (threshold=0.5):
train_accuracy: 1.0000
train_precision: 1.0000
train_recall: 1.0000
train_f1: 1.0000
train_roc_auc: 1.0000
train_mcc: 1.0000

Confusion Matrix Values:
P: 4_323 Tp: 4_323 Fp: 0 N: 22_293 Tn: 22_293 Fn: 0

Metrics for val (threshold=0.5):
val_accuracy: 0.9332
val_precision: 0.9643
val_recall: 0.7614
val_f1: 0.8509
val_roc_auc: 0.9768
val_mcc: 0.8180

Confusion Matrix Values:
P: 2_485 Tp: 1_892 Fp: 70 N: 7_440 Tn: 7_370 Fn: 593

Metrics for test (threshold=0.01):
test_accuracy: 0.9264
test_precision: 0.5171
test_recall: 0.7943
test_f1: 0.6264
test_roc_auc: 0.9486
test_mcc: 0.6043

Confusion Matrix Values:
P: 627 Tp: 498 Fp: 465 N: 7_442 Tn: 6_977 Fn: 129

Metrics for test (threshold=0.02):
test_accuracy: 0.9363
test_precision: 0.5677
test_recall: 0.7560
test_f1: 0.6484
test_roc_auc: 0.9486
test_mcc: 0.6218

Confusion Matrix Values:
P: 627 Tp: 474 Fp: 361 N: 7_442 Tn: 7_081 Fn: 153

Metrics for test (threshold=0.03):
test_accuracy: 0.9437
test_precision:

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
