#### In this notebook we will extract annotations from SeqVec embeddings via trained models that can predict secondary structure and subcellular localization

In [1]:
from bio_embeddings.embed import SeqVecEmbedder

We initialize the SeqVec embedder by providing the model files, which were [downloaded from here](http://maintenance.dallago.us/public/embeddings/embedding_models/seqvec/) previously.

In [2]:
embedder = SeqVecEmbedder(weights_file="/mnt/nfs/models/seqvec/weights_file", options_file="/mnt/nfs/models/seqvec/options_file")

We select an AA sequence. In this case, the sequence is that of [Aspartate aminotransferase, mitochondrial](https://www.uniprot.org/uniprot/P12345)


In [3]:
target_sequence = "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGYLAHAIHQVTK"

We produce the embeddings of the above sequence. Since we only have one sequence, we use the simple `embed` function, rather than the `embed_many` or `embed_batch`, which we would instead use if we had multiple sequences to embed.

In [4]:
embedding = embedder.embed(target_sequence)

The `bio_embeddings` pipeline includes some models trained on embeddings for the prediction of Secondary Structure and Subcellular Localization. In the following we make use of these models.

To speed up processing, we have downloaded the model weights of the supervised subcellular localization and secondary structure prediction models from [here](http://maintenance.dallago.us/public/embeddings/feature_models/seqvec/).

In [5]:
from bio_embeddings.extract.basic import BasicAnnotationExtractor

In [6]:
annotations_extractor = BasicAnnotationExtractor(
    secondary_structure_checkpoint_file="../../models/seqvec/secstruct_checkpoint.pt",
    subcellular_location_checkpoint_file="../../models/seqvec/subcell_checkpoint.pt"
)

In [7]:
annotations = annotations_extractor.get_annotations(embedding)

Let's see what annotations are available from SeqVec

In [8]:
annotations._fields

('DSSP3', 'DSSP8', 'disorder', 'localization', 'membrane')

Let's print the subcellular localization predicted via the SeqVec embeddings

In [9]:
print(f"The subcellular localization predicted from the embedding is: {annotations.localization.value}")

The subcellular localization predicted from the embedding is: Mitochondrion


For AA-annotations, e.g. secondary structure, we can use a helper function to format the extracted annotations as a single string:

In [10]:
from bio_embeddings.utilities.helpers import convert_list_of_enum_to_string

In [11]:
print("The predicted secondary structure (red) of the sequence is:")

for (AA, DSSP3) in zip(target_sequence, convert_list_of_enum_to_string(annotations.DSSP3)):
    print(f"\x1B[30m{AA}\x1b[31m{DSSP3}")

The predicted secondary structure (red) of the sequence is:
[30mM[31mC
[30mA[31mC
[30mL[31mC
[30mL[31mC
[30mH[31mC
[30mS[31mC
[30mA[31mC
[30mR[31mC
[30mV[31mC
[30mL[31mC
[30mS[31mC
[30mG[31mC
[30mV[31mC
[30mA[31mC
[30mS[31mC
[30mA[31mC
[30mF[31mC
[30mH[31mC
[30mP[31mC
[30mG[31mC
[30mL[31mC
[30mA[31mC
[30mA[31mC
[30mA[31mC
[30mA[31mC
[30mS[31mC
[30mA[31mC
[30mR[31mC
[30mA[31mC
[30mS[31mC
[30mS[31mC
[30mW[31mH
[30mW[31mH
[30mA[31mH
[30mH[31mH
[30mV[31mC
[30mE[31mC
[30mM[31mC
[30mG[31mC
[30mP[31mC
[30mP[31mC
[30mD[31mC
[30mP[31mC
[30mI[31mH
[30mL[31mH
[30mG[31mH
[30mV[31mH
[30mT[31mH
[30mE[31mH
[30mA[31mH
[30mY[31mH
[30mK[31mH
[30mR[31mH
[30mD[31mC
[30mT[31mC
[30mN[31mC
[30mS[31mC
[30mK[31mC
[30mK[31mC
[30mM[31mE
[30mN[31mE
[30mL[31mC
[30mG[31mC
[30mV[31mC
[30mG[31mC
[30mA[31mC
[30mY[31mE
[30mR[31mC
[30mD[31mC
[30mD[31mC
[30mN[31mC
[30mG[31mC
[30