After having trained a model, you can use the output.yml and an input sequence file to make predictions

## IMPORTANT BEFORE YOU START

The out config will contain absolute paths to directories. In other words: this notebook will likely work if you ran the training and notebook from the same environment (node, cluster or machine), but it will most likely fail to work if you move results from training to a different machine to perform inference/predictions. However, don't worry: there's an easy fix (see below).

In [1]:
from biotrainer.utilities import read_config_file
from biotrainer.inference import Inferencer

In [2]:
out_config_path = '../examples/residue_to_class/output/out.yml'
out_config = read_config_file(out_config_path)

Let's find out how well the model performs on the test set.

In [3]:
print(f"For the {out_config['model_choice']}, the metrics on the test set are:")
for metric in out_config['test_iterations_results']['metrics']:
    print(f"\t{metric} : {out_config['test_iterations_results']['metrics'][metric]}")

For the CNN, the metrics on the test set are:
	accuracy : 0.2857142984867096
	loss : 2.120563268661499


**Does the absolute path of the model look correct?**

As stated above, the out.yml file will contain absolute paths to files and directories from the biotrainer run. If you move files between machines, these paths may get "broken". However, in order to fix this, you juse need to substitute the beginning of the path as stored in the outconfig with the location of where the results are stored now. This is fairly easy, an example is provided below, but needs to be adapted to **your local folder structure**!



In [4]:
print(f"Absolute path of biotrainer run output as from config: {out_config['output_dir']}")

Absolute path of biotrainer run output as from config: /mnt/home/cdallago/biotrainer/examples/residue_to_class/output


In [5]:
new_output_path_root = "../examples/"
old_output_path_root = "/mnt/home/cdallago/biotrainer/examples/"

for key in out_config:
    if isinstance(out_config[key], str) and old_output_path_root in out_config[key]:
        out_config[key] = out_config[key].replace(old_output_path_root, new_output_path_root)

In [6]:
print(f"Absolute path of biotrainer run output after swapping the root path: {out_config['output_dir']}")

Absolute path of biotrainer run output after swapping the root path: ../examples/residue_to_class/output


First we need to create the embeddings for the sequences we are interested in

In [7]:
from bio_embeddings.embed import ProtTransT5XLU50Embedder

In [8]:
embedder = ProtTransT5XLU50Embedder()

In [9]:
sequences = [
    "PROVTEIN",
    "SEQVENCESEQVENCE"
]

In [10]:
embeddings = embedder.embed_many(sequences)
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]

Next we generate an inference object from the out config of our training run

In [11]:
inferencer = Inferencer(**out_config)

In [12]:
predictions = inferencer.from_embeddings(embeddings)

We can inspect the predictions

In [13]:
for sequence, prediction in zip(sequences, predictions):
    print(sequence)
    print(prediction)

PROVTEIN
DVVDDDDD
SEQVENCESEQVENCE
DVCDVVDDDVVDVVDD
