# Phenotypic traits prediction with pretrained Bacformer tutorial

This tutorial outlines how one can use genome embeddings from pretrained Bacformer to predict phenotypic labels.

We provide a dataset containing precomputed genome embeddings and 139 diverse phenotypic labels. We show how to train and evaluate
phenotype prediction from a genome embedding using a simple linear regression model.

If your task is highly challenging or requires embedding genomes, we recommend following the `finetune_phenotypic_traits_prediction_tutorial.ipynb`, which outlines how to finetune entire
Bacformer model for phenotypic traits prediction.

Before you start, make sure to have the [datasets](https://huggingface.co/docs/datasets/en/installation), [scikit-learn](https://scikit-learn.org/stable/install.html) packages installed.

## Step 1: Import required dependencies

In [38]:
import numpy as np
import pandas as pd
from datasets import load_dataset

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

## Step 2: Load the dataset and subset to a selected phenotype

We load the precomputed genome embeddings together with the phenotypic trait labels (139 of them).

As an example, we select `gideon_Catalase` phenotype. The actual phenotype we are predicting here is the `Catalase`, which denotes whether a bacterium produces the catalase enzyme that breaks down hydrogen peroxide (H₂O₂) into water and oxygen, thereby protecting the cell from oxidative stress. The `gideon` stands for the source of the phenotype [1].

The `gideon_Catalase` phenotype is a binary classification problem.

In [39]:
# load the dataset and convert to pandas DF
df = load_dataset("macwiatrak/bacformer-genome-embeddings-with-phenotypic-traits-labels", split="train").to_pandas()

# select the phenotype
phenotype = "gideon_Catalase"
# remove the genomes with NaN values for the phenotype of interest
phenotype_df = df[df[phenotype].notna()].copy()

# get features matrix X and the label vector Y
X = np.vstack(phenotype_df["bacformer_genome_embedding"].to_list())
y = phenotype_df[phenotype].map({'+': 1, '-': 0}).values

## Step 3: Perform train / val / test split

Perform stratified train, val, test splir with `60 / 20 / 20` ratio. We use the validation set for hyperparameter search.


In [40]:
# ------------------------------------------------------------------
# 2  60 / 20 / 20 stratified split (train → 0.6, val → 0.2, test → 0.2)
# ------------------------------------------------------------------
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val,
    test_size=0.25,  # 0.25 × 0.80 = 0.20
    random_state=42,
    stratify=y_train_val
)

## Step 4: Perform hyperparameter search on the validation set

In [41]:
# ------------------------------------------------------------------
# 3  Hyper-parameter search on validation set
# ------------------------------------------------------------------
param_grid = np.logspace(-4, 4, 9)      # 1e-4 … 1e4
best_auc, best_C, best_model = -np.inf, None, None

for C in param_grid:
    model = Pipeline(
        steps=[
            ("scale", StandardScaler()),
            ("clf", LogisticRegression(
                C=C, solver="liblinear", max_iter=2000, penalty="l2"
            ))
        ]
    )
    model.fit(X_train, y_train)

    val_probs = model.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, val_probs)

    if auc > best_auc:
        best_auc, best_C, best_model = auc, C, model

print(f"Best C on validation: {best_C}  |  AUROC_val = {best_auc:.4f}")

Best C on validation: 0.1  |  AUROC_val = 0.9973


## Step 5: Final evaluation on the held-out test set

In [42]:
# ------------------------------------------------------------------
# 4  Final evaluation on the held-out test set
# ------------------------------------------------------------------
test_probs = best_model.predict_proba(X_test)[:, 1]
test_auc  = roc_auc_score(y_test, test_probs)

print(f"AUROC_test = {test_auc:.4f}")

AUROC_test = 0.9837


----------------------
#### Voilà, you made it 👏! 

There is 139 phenotypic traits to choose from and experiment with!

In case of any issues or questions raise an issue on github - https://github.com/macwiatrak/Bacformer/issues.

If your task is highly challenging and using precomputed genome embedding yields not sufficient performance  or requires embedding genomes, we recommend following the `finetune_phenotypic_traits_prediction_tutorial.ipynb`, which outlines how to finetune entire
Bacformer model for phenotypic traits prediction.

## References

[1] Weimann, Aaron, et al. "From genomes to phenotypes: Traitar, the microbial trait analyzer