## <u>Demo and Tutorial</u> 
Welcome to our demo and tutorial! This will be continually updated with more examples and use cases. Make sure abmap and ANARCI are installed before running this notebook.

In [1]:
import torch
import abmap

#### <u>Foundational Models</u>
**AbMAP** leverages foundational protein language models (PLMs) like ESM2 to embed antibody sequences into a continuous embedding space. We've pre-trained several models using the outputs of these foundational models, which can be easily loaded and used for downstream tasks (e.g. property pred, structure pred, etc.). These are not "fine-tuned" versions of ESM2, but rather models that have been trained on the outputs of ESM2.

We've included pre-trained models from the outputs of four different foundational models (ESM2, ESM1b, Bepler-Berger, and ProtBERT). For each model, we have two checkpoints: one trained on heavy chain (VH) antibody sequences from SabDAB and the other chained on light chain (VL) sequences. You can load them here. We'd recommend running this notebook with GPU capability for faster inference, but it's not required.

In [2]:
# -------- Load AbMAP 
# Using Bepler-Berger as foundational model (best for structure prediction)
abmap_H = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_beplerberger_H.pt', plm_name='beplerberger')
abmap_L = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_beplerberger_L.pt', plm_name='beplerberger')

# Using ESM2 (best for functional prediction, e.g. affinity, paratope prediction, etc.)
# abmap_H = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_esm2_H.pt', plm_name='esm2')
# abmap_L = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_esm2_L.pt', plm_name='esm2')

# Using ESM1b
# abmap_H = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_esm1b_H.pt', plm_name='esm1b')
# abmap_L = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_esm1b_L.pt', plm_name='esm1b')

# Using ProtBert
# abmap_H = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_protbert_H.pt', plm_name='protbert')
# abmap_L = abmap.load_abmap(pretrained_path='../pretrained_models/AbMAP_protbert_L.pt', plm_name='protbert')

beplerberger loaded to cpu
Loaded the Pre-trained Model!
beplerberger loaded to cpu
Loaded the Pre-trained Model!


#### <u>AbMAP embedding process</u>
Embedding a sequence with AbMAP occurs in 3 steps:
1. Get the embedding from a **foundational PLM**
2. **Augment** the embedding to focus on CDRs, the regions most important for binding and function
3. **Enhance** the embedding with knowledge of antibody structure and function
4. **Fine-tune** on other functional properties of interest (optional)

In [3]:
# ----- Get embedding for one sequence (steps 1-3)
demo_seq = 'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYWMHWVRQAPGKGLVWVSRINSDGSSTSYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAGSYRSLFDYWGQGTLVTVSS'

# Contrastive augmentation (PLM, mutagenesis, CDR focus)
x = abmap.ProteinEmbedding(demo_seq, chain_type='H', embed_device=torch.device('cpu')) # push sequence through foundational model
x.create_cdr_specific_embedding(embed_type='beplerberger', k=20) # decrease k to speed up, would recommend k >= 6

# # Pass the augmented embedding to AbMAP to get final embedding
with torch.no_grad():
    embed_var = abmap_H.embed(x.cdr_embedding.unsqueeze(0), embed_type='variable') # residue-level embeddings
    embed_fl = abmap_H.embed(x.cdr_embedding.unsqueeze(0), embed_type='fixed') # fixed-length



In [4]:
# Variable-length CDR-only embedding and fixed-length embedding
embed_var.shape, embed_fl.shape

(torch.Size([1, 31, 256]), torch.Size([1, 512]))

#### <u>Embed a whole fasta file</u>
We've also made it easy to create embeddings for sequences in a fasta file. The embeddings will be stored in a directory for later use. These AbMAP embeddings can be thought of as features to train an additional model on more tasks. They can also be used to compare sequences in a continuous embedding space if you, for example, would like to cluster sequences by structural & functional similarity.

In [5]:
# -------- Get embeddings for a fasta file of sequences
fasta_path = '../data/test.fasta'
output_path = '../data/test_embeddings'

# outputs saved to output_path directory
abmap.augment_from_fasta(fastaPath=fasta_path,
                                           outputPath=output_path,
                                        chain_type='H', 
                                        embed_type='beplerberger',
                                        num_mutations=10,
                                        device='cpu') # change to 0 if using 'cuda'

beplerberger loaded to cpu
[2023-05-18-15:09:58] # Storing to ../data/test_embeddings...


100%|██████████| 3/3 [00:00<00:00, 6456.09it/s]


#### <u>Fine-tuning AbMAP on your own functional data</u>

In [6]:
# ---------- Fine-tuning functional data (step 4)
'''Create dataloader for your data, see abmap/dataloader.py for examples. 
This dataloader should take in your embeddings (stored in a directory now) and your functional labels.'''

# Define a model, see abmap.models.py for examples
model = abmap.PropertyPredictorAttn()

# TODO - train model with PytorchLightning