<a href="https://colab.research.google.com/github/sid-betalol/evodiff/blob/main/examples/inpaint_multiple_regions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EvoDiff Examples

In this notebook we will overview how to approach the following topics:


* Inpainting of multiple regions in a given sequence or given list of sequences


## Installation

To download and run our code, first open this notebook in a clean conda environment. We recommend creating it with python ```v3.8.5```. You can do so by running ```conda create --name evodiff python=3.8.5```. In that new environment, to download our code, run:

In [None]:
# import sys
# !{sys.executable} -m pip install evodiff

You will also need to install PyTorch. We tested our models on `v2.0.1`. Change the below line to install the pytorch version that works for your system.

In [None]:
# conda install pytorch torchvision torchaudio cpuonly -c pytorch

You also need PyTorch Geometric and PyTorch Scatter installed

In [None]:
# conda install pyg -c pyg

In [None]:
# conda install -c conda-forge torch-scatter

In [None]:
!pip install git+https://github.com/sid-betalol/evodiff.git

In [None]:
!pip install git+https://github.com/microsoft/protein-sequence-models.git

In [None]:
!pip install alembic aniso8601 biotite blosum docker fair-esm fasteners graphene graphql-core graphql-relay GridDataFormats gunicorn lmdb Mako mda-xdrlib MDAnalysis mlflow mmtf-python mrcfile pdb-tools querystring-parser smmap biopython==1.81

In [None]:
!pip install torchvision torchaudio

In [None]:
import torch
print(torch.__version__)
print(torch.version.cuda)
torch_version = torch.__version__
cuda_version = torch.version.cuda.replace('.', '')
base_url = f"https://pytorch-geometric.com/whl/torch-{torch_version}.html"

In [None]:
!pip install -q torch-scatter -f $base_url
!pip install -q torch-sparse -f $base_url
!pip install -q torch-cluster -f $base_url
!pip install -q torch-spline-conv -f $base_url
!pip install -q torch-geometric

## Conditional generation

### Inpainting IDRs with EvoDiff-Seq

First, lets load the model we want to use

In [None]:
from evodiff.pretrained import OA_DM_38M

checkpoint = OA_DM_38M()
model, collater, tokenizer, scheme = checkpoint

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

In [None]:
from evodiff.conditional_generation import inpaint_multiple_regions

In [None]:
sequence1 = 'DQTERTVRSFEGRRTAPYLDSRNVLTIGYGHLLNRPGANKSWEGRLTSALPREFKQRLTELAASQLHETDVRLATARAQALYGSGAYFESVPVSLNDLWFDSVFNLGERKLLNWSGLRTKLESRDWGAAAKDLGRHTFGREPVSRRMAESMRMRRGIDLNHYNI'
sequence2 = sequence1[:100]

Helper Functions for tokenizing the sequences

In [None]:
def mask_sequences(sequences, start_ids, end_ids):
    masked_sequences = []
    for sequence, starts, ends in zip(sequences, start_ids, end_ids):
        masked_sequence = sequence
        offset = 0
        for start, end in zip(starts, ends):
            start += offset
            end += offset
            masked_sequence = masked_sequence[:start] + '#' * (end - start) + masked_sequence[end:]
            offset += (end - start) - (end - start)
        masked_sequences.append(masked_sequence)
    return masked_sequences

def tokenize_sequences(sequences, tokenizer, device=device):
    tokenized_sequences = [torch.tensor(tokenizer.tokenizeMSA(seq)) for seq in sequences]
    tokenized_sequences = [seq.to(device) for seq in tokenized_sequences]
    return tokenized_sequences

def prepare_indices(start_ids, end_ids, device=device):
    start_idxs = torch.tensor(start_ids).to(device)
    end_idxs = torch.tensor(end_ids).to(device)
    return start_idxs, end_idxs

Prepare your sequences for inpainting

In [None]:
sequences = [sequence1, sequence2]
start_ids = [[20, 80], [10, 80]]
end_ids = [[50, 100], [20, 90]]

# Mask the sequences
masked_sequences = mask_sequences(sequences, start_ids, end_ids)

# Tokenize the masked sequences
tokenizer = tokenizer
tokenized_sequences = tokenize_sequences(masked_sequences, tokenizer, device)

In [None]:
# masked_sequences

In [None]:
untokenized_seqs, sequences, untokenized_idrs, sequences_idrs, save_starts, save_ends = inpaint_multiple_regions(model, tokenized_sequences, start_ids, end_ids, sequences, tokenizer)

In [None]:
# untokenized_seqs

In [None]:
# sequences

In [None]:
# untokenized_idrs

In [None]:
# sequences_idrs