After having trained a model, you can use the output.yml and an input sequence file to make predictions

## IMPORTANT BEFORE YOU START

The out config will contain absolute paths to directories. In other words: this notebook will likely work if you ran the training and notebook from the same environment (node, cluster or machine), but it will most likely fail to work if you move results from training to a different machine to perform inference/predictions. However, don't worry: there's an easy fix (see below).

In [1]:
from biotrainer.utilities import read_config_file
from biotrainer.inference import Inferencer

In [2]:
out_config_path = '../residue_to_class/output/out.yml'
out_config = read_config_file(out_config_path)

Let's find out how well the model performs on the test set.

In [3]:
print(f"For the {out_config['model_choice']}, the metrics on the test set are:")
for metric in out_config['test_iterations_results']['metrics']:
    print(f"\t{metric} : {out_config['test_iterations_results']['metrics'][metric]}")

For the CNN, the metrics on the test set are:
	- f1_score class 0 : 0.0
	- f1_score class 1 : 0.0
	- f1_score class 2 : 0.0
	- f1_score class 3 : 0.0
	- f1_score class 4 : 0.0
	- precission class 0 : 0.0
	- precission class 1 : 0.0
	- precission class 2 : 0.0
	- precission class 3 : 0.0
	- precission class 4 : 0.0
	- recall class 0 : 0.0
	- recall class 1 : 0.0
	- recall class 2 : 0.0
	- recall class 3 : 0.0
	- recall class 4 : 0.0
	accuracy : 0.0
	loss : 1.617287278175354
	macro-f1_score : 0.0
	macro-precision : 0.0
	macro-recall : 0.0
	matthews-corr-coeff : -0.3000600337982178
	micro-f1_score : 0.0
	micro-precision : 0.0
	micro-recall : 0.0
	spearmans-corr-coeff : -0.14046210050582886


**Does the absolute path of the model look correct?**

As stated above, the out.yml file will contain absolute paths to files and directories from the biotrainer run. If you move files between machines, these paths may get "broken". However, in order to fix this, you juse need to substitute the beginning of the path as stored in the outconfig with the location of where the results are stored now. This is fairly easy, an example is provided below, but needs to be adapted to **your local folder structure**!



In [4]:
print(f"Absolute path of biotrainer run output as from config: {out_config['output_dir']}")

Absolute path of biotrainer run output as from config: /home/sebie/PycharmProjects/biotrainerFork/examples/residue_to_class/output


In [5]:
new_output_path_root = "../examples/"
old_output_path_root = "/mnt/home/cdallago/biotrainer/examples/"

for key in out_config:
    if isinstance(out_config[key], str) and old_output_path_root in out_config[key]:
        out_config[key] = out_config[key].replace(old_output_path_root, new_output_path_root)

In [6]:
print(f"Absolute path of biotrainer run output after swapping the root path: {out_config['output_dir']}")

Absolute path of biotrainer run output after swapping the root path: /home/sebie/PycharmProjects/biotrainerFork/examples/residue_to_class/output


First we need to create the embeddings for the sequences we are interested in

In [7]:
from bio_embeddings.embed import OneHotEncodingEmbedder

In [8]:
embedder = OneHotEncodingEmbedder()

In [9]:
sequences = [
    "PROVTEIN",
    "SEQVENCESEQVENCE"
]

In [10]:
embeddings = list(embedder.embed_many(sequences))
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]

Next we generate an inference object from the out config of our training run

In [11]:
inferencer = Inferencer(**out_config)

Got 1 split(s): hold_out




In [12]:
predictions = inferencer.from_embeddings(embeddings, split_name="hold_out")

We can inspect the predictions

In [13]:
for sequence, prediction in zip(sequences, predictions.values()):
    print(sequence)
    print(prediction)

PROVTEIN
None
SEQVENCESEQVENCE
{'0': 'FFFDFDFF', '1': 'FFEFFFFFDEFFFFEF'}


**If your model uses dropout, you can also use inferencer.from_embeddings_with_monte_carlo_dropout to get the predictions with monte-carlo dropout. This is a method to quantify the uncertainty within your model.**

In [14]:
predictions_mcd = inferencer.from_embeddings_with_monte_carlo_dropout(embeddings, n_forward_passes=30, confidence_level=0.05, split_name="hold_out")

In [16]:
# Show predictions for first sequence:
for idx, residue in enumerate(sequences[0]):
    print(f"Residue: {residue}, MCD Prediction: {predictions_mcd['0'][idx]}")
    # prediction: Class prediction based on the mean over 30 forward passes
    # mcd_mean: Average over 30 forward passes
    # mcd_lower_bound: Lower bound of confidence interval using normal distribution with the given confidence level
    # mcd_upper_bound: Upper bound of confidence interval using normal distribution with the given confidence level

Residue: P, MCD Prediction: {'prediction': 'F', 'mcd_mean': tensor([0.1800, 0.2024, 0.2106, 0.2173, 0.1897]), 'mcd_lower_bound': tensor([0.1788, 0.2015, 0.2093, 0.2159, 0.1884]), 'mcd_upper_bound': tensor([0.1812, 0.2033, 0.2118, 0.2188, 0.1910])}
Residue: R, MCD Prediction: {'prediction': 'F', 'mcd_mean': tensor([0.1844, 0.2049, 0.2002, 0.2186, 0.1919]), 'mcd_lower_bound': tensor([0.1831, 0.2037, 0.1988, 0.2174, 0.1907]), 'mcd_upper_bound': tensor([0.1858, 0.2062, 0.2015, 0.2198, 0.1931])}
Residue: O, MCD Prediction: {'prediction': 'F', 'mcd_mean': tensor([0.1966, 0.2042, 0.1946, 0.2105, 0.1940]), 'mcd_lower_bound': tensor([0.1954, 0.2028, 0.1932, 0.2092, 0.1928]), 'mcd_upper_bound': tensor([0.1979, 0.2056, 0.1961, 0.2119, 0.1952])}
Residue: V, MCD Prediction: {'prediction': 'D', 'mcd_mean': tensor([0.1924, 0.2105, 0.2058, 0.2018, 0.1894]), 'mcd_lower_bound': tensor([0.1909, 0.2078, 0.2035, 0.2000, 0.1880]), 'mcd_upper_bound': tensor([0.1940, 0.2132, 0.2082, 0.2036, 0.1908])}
Residue: