# Lab 1.2 Using Boltz-1 model to predict protein structure

In [7]:
import os
import shutil
import numpy as np

## Helper functions

In [8]:

def preprare_directory(temp, delete_old=True):
    """
    Create a new directory and delete the old one if it exists
    :param temp: str: path to the directory
    :param delete_old: bool, whether to delete the old directory. Defaults to True.
    """
    if delete_old:  
        if os.path.exists(temp):
            # Remove the directory and all its contents
            shutil.rmtree(temp)
    # Recreate the directory
    os.makedirs(temp, exist_ok=True)

## Preparing the input file

[Boltz-1](https://github.com/jwohlwend/boltz/tree/main) is a state-of-the-art open-source model that predicts the 3D structure of proteins, RNA, DNA, and small molecules. It handles modified residues, covalent ligands and glycans, as well as condition the generation on pocket residues.

In this lab, we will primarily look at protein structure prediction. 2 exmaple fils are provided in the `notebooks/input` directory for monomer and multimer. Let's look at the multimer example first: 


```yaml
# notebooks/input/multimer.yaml
version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: A
      sequence: MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
  - protein:
      id: B
      sequence: MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
```

To predict the structure, we will use the `boltz predict` command. 
Additional information about the command can be found using: 

```bash
! boltz predict --help
```

In [3]:
! boltz predict --help

Usage: boltz predict [OPTIONS] DATA

  Run predictions with Boltz-1.

Options:
  --out_dir PATH               The path where to save the predictions.
  --cache PATH                 The directory where to download the data and
                               model. Default is ~/.boltz.
  --checkpoint PATH            An optional checkpoint, will use the provided
                               Boltz-1 model by default.
  --devices INTEGER            The number of devices to use for prediction.
                               Default is 1.
  --accelerator [gpu|cpu|tpu]  The accelerator to use for prediction. Default
                               is gpu.
  --recycling_steps INTEGER    The number of recycling steps to use for
                               prediction. Default is 3.
  --sampling_steps INTEGER     The number of sampling steps to use for
                               prediction. Default is 200.
  --diffusion_samples INTEGER  The number of diffusion samples to use for
          

## Run predicdtion

### CLI

Now let's run the prediction with the following parameters: 

- input file: `input/keytruda.yaml` (Humira Fab in complex with TNFa)
- `out_dir`: Output directory to stre the result. 
- `devices`: number of GPUs. If you are providing a single YAML file, set to `1`.
- `output_format`: Save the result in PDB format. 
- `use_msa_server`: Use the MSA server (Colab) to get the MSA. You can optionally provide your own MSA file, but for simplicity, we will use the server here. 

In [10]:
! boltz predict input/keytruda.yaml --out_dir output --devices 1 --output_format pdb --use_msa_server --num_workers 4

Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                     | 0/1 [00:00<?, ?it/s]Generating MSA for input/keytruda.yaml with 2 protein entities.

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00][A

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00][A
100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.45s/it]
GPU available: True (

Note: 
1. The first time this command it is run, it will download the model weights and cache them, and can take a while (~15 min, depending on the network speed). The second time it is run, it will read the cached weights, so it will be much faster. For the example multimer, it should take ~2 minutes to complete on L4. 
2. You can open another terminal and run the following command to monitor the GPU usage: 
    ```bash
    watch -n 1 nvidia-smi
    ```
3. You can watch for `Number of failed examples: 0` in the output if the run is successful. 
4. When complexes is large, you might encounter out of memory errors

### Python

In [1]:
! rm -rf output/boltz_results_keytruda

In [2]:
import subprocess
input_yaml_path = "input/keytruda.yaml"
result_dir = "output/boltz_results_keytruda"


command = [
"boltz", 
"predict", 
input_yaml_path, 
"--out_dir", result_dir, 
"--devices", "1", 
"--output_format", "pdb", 
"--use_msa_server"
]

output = subprocess.run(command, capture_output=True, text=True)

In [4]:
print(output.stdout)

Checking input data.
Running predictions for 1 structure
Processing input data.
Generating MSA for input/keytruda.yaml with 2 protein entities.

Predicting: |                                                                                                                     | 0/? [00:00<?, ?it/s]
Predicting:   0%|                                                                                                                 | 0/1 [00:00<?, ?it/s]
Predicting DataLoader 0:   0%|                                                                                                    | 0/1 [00:00<?, ?it/s]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:29<00:00,  0.03it/s]Number of failed examples: 0

Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:29<00:00,  0.03it/s]



In [5]:
# make sure it is successful
assert "Number of failed examples: 0" in output.stdout


## Analyze the results

In the `notebooks/utput/boltz_results_keytruda/predictions/keytruda` directory, you should see: 
-  `keytruda_model_0.pdb` file: predicted structure
- `confidence_keytruda_model_0.json` file: confidence scores for the predicted structure
- `plddt_keytruda_model_0.npz` file: per-residue pLDDT confidence scores

Similarly to ESMFold, assume we want to visualize the predicted structure, coloring the residues based on the pLDDT scores. To do this we will first need to extract the pLDDT scores. 

In [13]:
# Load the .npz file that contains the pLDDT scores
npz_file_path = "output/boltz_results_keytruda/predictions/keytruda/plddt_keytruda_model_0.npz"
data = np.load(npz_file_path)
# extract the pLDDT scores
plddt_scores = data['plddt']

# how many scores are there?
print(plddt_scores.shape)

(230,)


Note the length of the PLDDt scores = length if the input sequence A (111 aa) +  input sequence B (119 aa) = 230 aa

In [16]:
# take a look at the scores
plddt_scores

array([0.8550867 , 0.9642465 , 0.97675884, 0.98382616, 0.9833848 ,
       0.97937065, 0.95455503, 0.9659277 , 0.9658207 , 0.96840876,
       0.9607582 , 0.9691442 , 0.97348726, 0.97164476, 0.96110845,
       0.9779537 , 0.97456515, 0.9781869 , 0.98064953, 0.9777327 ,
       0.9826192 , 0.97426087, 0.98586166, 0.9820484 , 0.9802414 ,
       0.9773198 , 0.89338213, 0.9710161 , 0.9527947 , 0.92729926,
       0.8791078 , 0.8422709 , 0.89715123, 0.896601  , 0.9462743 ,
       0.9352464 , 0.9674897 , 0.963935  , 0.9828695 , 0.98241174,
       0.98397726, 0.9830535 , 0.9639764 , 0.9210799 , 0.93787575,
       0.9609377 , 0.9730338 , 0.974687  , 0.95802474, 0.95752645,
       0.9782369 , 0.9702632 , 0.9470639 , 0.938419  , 0.96240956,
       0.9535309 , 0.93689287, 0.95661026, 0.9379674 , 0.8982222 ,
       0.9437908 , 0.96925235, 0.972739  , 0.96875596, 0.9801584 ,
       0.9852443 , 0.97913873, 0.98217916, 0.9735309 , 0.97811425,
       0.94203824, 0.96640307, 0.9757999 , 0.9776363 , 0.98578

If we take a look at the min and max values of the pLDDT scores, we can see that the scores are between 0 and 1. 

In [17]:

print(max(plddt_scores), min(plddt_scores))

0.9878665 0.62562263


Optionally, for consistency with the ESMFold method, we will multiply the pLDDT scores by 100 to get the scores between 0 and 100. This line of code is shown in the `load_protein_boltz` function later in the nobeook. 

## Visualizing the results with py3Dmol

From what we have seen above, to visaulize the protein, we need to match the pLDDT scores to the correct residues in the PDB file. 

Now open the `multimer_model_0.pdb` file and take a look at the raw PDB file. 

Start counting columns from 0: 

- `column 1`: atom number, starting from 1
- `Column 4`: this the chain ID: A or B
- `Column 5`: this is the residue number counting from 1 to 112 for chain A and 1 to 116 for chain B

This tells us our strategy could be: 
- read the PDB file line by line
- split each line by column
- Look at `column 1` (0 indexed), if it is not "ATOM", ignore the line
- Extract the residue number from `column 5` (0 indexed), and match it to the pLDDT score
- Style each atom in this residue based on the pLDDT score

Based on this, we will rewrite the `load_protein_boltz` function to visualize the results. 

In [18]:
import py3Dmol
def load_protein_boltz(pdb_file_path, plddt_file_path, width=800, height=600):

    """
    Load a protein structure from a PDB file and display it using py3Dmol
    pdb_file_path: str, path to the PDB file
    plddt_file_path: str, path to the npz file containing the pLDDT scores
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """
    
    # load the pdb file
    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])

    # load the plddt scores
    scores = np.load(plddt_file_path)['plddt']
    
    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)
    

    for line in pdb_data.split("\n"):

        # split each line by columns
        split = line.split()

        # not a valid line, ignore
        if len(split) == 0 or split[0] != "ATOM":
            continue

        # get residue id, pdb is 1-indexed, python is 0-indexed, therefore -1 to convert
        residue_idx = int(split[5]) - 1 

        # get the pLDDT score for the current residue, scale it to 0-100
        plddt_score = scores[residue_idx] * 100

        if plddt_score > 90:
            color = "blue"
        elif 70 <= plddt_score <= 90:
            color = "cyan"
        elif 50 <= plddt_score < 70:
            color = "yellow"
        else:
            color = "orange"
        
        # Atom serial numbers typically start from 1, similar to requirement of `view.setStyle`, hence idx should be used directly
        idx = int(split[1])
        
        # Style should be set per atom id
        view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})
    view.zoomTo()
    return view


In [19]:
view = load_protein_boltz(
    pdb_file_path="output/boltz_results_keytruda/predictions/keytruda/keytruda_model_0.pdb", 
    plddt_file_path="output/boltz_results_keytruda/predictions/keytruda/plddt_keytruda_model_0.npz")
view


<py3Dmol.view at 0x7f2e98251b50>

In [9]:
! rm -rf output/boltz_reults_keytruda

In [11]:
len("DIQMTQSPSSLSASVGDRVTITCRASQGIRNYLAWYQQKPGKAPKLLIYAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDVATYYCQRYNRAPYTFGQGTKVEIK") + len(
    "EVQLVESGGGLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLEWVSAITWNSGHIDYADSVEGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAKVSYLSTASSLDYWGQGTLVTVSS"
)

228

In [10]:
! boltz predict input/keytruda.yaml --out_dir output --devices 1 --output_format pdb --use_msa_server --num_workers 4

Checking input data.
Running predictions for 1 structure
Processing input data.


  0%|                                                     | 0/1 [00:00<?, ?it/s]Generating MSA for input/keytruda.yaml with 2 protein entities.

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:00 remaining: 00:00][A

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00][A
100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.16s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False,