# Using biotrainer-autoeval for plm evaluation

This notebook shows an example how to use the biotrainer `autoeval` module for automatic plm evaluation. We use the [DWT](https://github.com/Rostlab/dwt) framework that includes curated datasets that are well interpretable for plm benchmarking.

In [None]:
# Install biotrainer if you haven't
# !pip install biotrainer

In [None]:
import torch

# Define variables
embedder_name = "Rostlab/prot_t5_xl_uniref50"  # Replace with your plm's huggingface id. For alternatives, see "Advanced options" below
framework = "dwt"
min_seq_length = 0  # Default
max_seq_length = 2000  # Default

In [None]:
# Run the pipeline
from biotrainer.autoeval import autoeval_pipeline

current_progress = None
for progress in autoeval_pipeline(embedder_name="one_hot_encoding",
                                  framework=framework,
                                  min_seq_length=min_seq_length,
                                  max_seq_length=max_seq_length):
    print(progress)  # The pipeline is a generator function to inform the user about the current progress.
    current_progress = progress

In [None]:
# Let's look at the results
if current_progress is None or current_progress.final_report is None:
    print("No results found.")  # Something went wrong
else:
    final_report = current_progress.final_report
    scl_results = final_report["results"]["DWT-scl"]["test_results"]
    sec_struct_results = final_report["results"]["DWT-secondary_structure"]["test_results"]

    print(f"DWT-scl results: {scl_results}\n")
    print(f"DWT-secondary_structure results: {sec_struct_results}\n")

## Advanced options 1: Using a custom embedding function

If you are running biotrainer-autoeval directly after training your model, the model will probably not be available on huggingface, but locally. Therefore, you can provide custom embedding functions both for per-sequence and per-residue embeddings to be independent of the biotrainer embedding module. The provided functions take a list of strings (sequences) as input and must return, for each sequence, the sequence and the respective embedding. This is to ensure that the sequence is always mapped to the correct embedding.

In [None]:
custom_embedding_function_per_sequence = lambda seq: (seq, torch.empty())  # Define your function as a generator here
custom_embedding_function_per_residue = lambda seq: (seq, torch.empty())  # Define your function as a generator here
for progress in autoeval_pipeline(embedder_name=embedder_name,
                                  framework=framework,
                                  custom_embedding_function_per_sequence=custom_embedding_function_per_sequence,
                                  custom_embedding_function_per_residue=custom_embedding_function_per_residue,
                                  min_seq_length=min_seq_length,
                                  max_seq_length=max_seq_length):
    print(progress)  # The pipeline is a generator function to inform the user about the current progress.

## Advanced Options 2: Precomputed embeddings file

Another option is to use precomputed embeddings file, if you prefer that or have them already. Just make sure that the files include embeddings for all framework sequences and are stored by sequence hash, according to biotrainer standards.

In [None]:
from pathlib import Path
from biotrainer.autoeval import get_unique_framework_sequences

_, per_residue_seqs, per_sequence_seqs = get_unique_framework_sequences(framework="dwt",
                                                                        min_seq_length=min_seq_length,
                                                                        max_seq_length=max_seq_length)
# per_residue_seqs and per_sequence_seqs are dictionaries mapping sequence hashes to BiotrainerSequenceRecord objects, use that hash as an id when storing your embeddings

per_residue_path = Path()  # TODO Your per-residue embeddings path
per_sequence_path = Path()  # TODO Your per-sequence embeddings path
for progress in autoeval_pipeline(embedder_name=embedder_name,
                                  framework=framework,
                                  precomputed_per_residue_embeddings=per_residue_path,
                                  precomputed_per_sequence_embeddings=per_sequence_path,
                                  min_seq_length=min_seq_length,
                                  max_seq_length=max_seq_length):
    print(progress)  # The pipeline is a generator function to inform the user about the current progress.