# Lab 1.2 Using Boltz-1 model to predict protein structure

In [8]:
import os
import shutil
import numpy as np

## Helper functions

In [2]:

def preprare_directory(temp, delete_old=True):
    """
    Create a new directory and delete the old one if it exists
    :param temp: str: path to the directory
    :param delete_old: bool, whether to delete the old directory. Defaults to True.
    """
    if delete_old:  
        if os.path.exists(temp):
            # Remove the directory and all its contents
            shutil.rmtree(temp)
    # Recreate the directory
    os.makedirs(temp, exist_ok=True)

## Preparing the input file

[Boltz-1](https://github.com/jwohlwend/boltz/tree/main) is a state-of-the-art open-source model that predicts the 3D structure of proteins, RNA, DNA, and small molecules; it handles modified residues, covalent ligands and glycans, as well as condition the generation on pocket residues.

In this lab, we will primarily look at protein structure prediction. 2 exmaple fils are provided in the `notebooks/input` directory for monomer and multimer. Let's look at the multimer example first: 


```yaml
# notebooks/input/multimer.yaml
version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: A
      sequence: MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
  - protein:
      id: B
      sequence: MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
```

To predict the structure, we will use the `boltz predict` command. 
Additional information about the command can be found using: 

```bash
! boltz predict --help
```

In [1]:
! boltz predict --help

Usage: boltz predict [OPTIONS] DATA

  Run predictions with Boltz-1.

Options:
  --out_dir PATH               The path where to save the predictions.
  --cache PATH                 The directory where to download the data and
                               model. Default is ~/.boltz.
  --checkpoint PATH            An optional checkpoint, will use the provided
                               Boltz-1 model by default.
  --devices INTEGER            The number of devices to use for prediction.
                               Default is 1.
  --accelerator [gpu|cpu|tpu]  The accelerator to use for prediction. Default
                               is gpu.
  --recycling_steps INTEGER    The number of recycling steps to use for
                               prediction. Default is 3.
  --sampling_steps INTEGER     The number of sampling steps to use for
                               prediction. Default is 200.
  --diffusion_samples INTEGER  The number of diffusion samples to use for
          

Now let's run the prediction: 

In [5]:
! boltz predict input/multimer.yaml --out_dir output --devices 1 --output_format pdb --use_msa_server 

Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                     | 0/1 [00:00<?, ?it/s]Generating MSA for input/multimer.yaml with 2 protein entities.

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:00 remaining: 00:00][A

  0%|                                      | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT:   0%|                              | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE:   0%|                            | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00][A
100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.18s/it]
GPU available: True (

The first time this command it is run, it will download the model weights and cache them, and can take a while (~15 min, depending on the network speed). The second time it is run, it will read the cached weights, so it will be much faster. For the example multimer, it should take ~2 minutes to complete on a A10 GPU

## Visualizing the results

In the `output/boltz_results_multimer/predictions/multimer` directory, you should see: 
-  `multimer_model_0.pdb` file: predicted structure
- `confidence_multimer_model_0.json` file: confidence scores for the predicted structure
- `plddt_multimer_model_0.npz` file: per-residue pLDDT confidence scores

In [14]:
# Load the .npz file
npz_file_path = "output/boltz_results_multimer/predictions/multimer/plddt_multimer_model_0.npz"
data = np.load(npz_file_path)
# extract the pLDDT scores
plddt_scores = data['plddt']

# how many scores are there?
print(plddt_scores.shape)

(228,)


Note the length of the PLDDt scores = length if the input sequence A (112 aa) +  input sequence B (116 aa) = 228 aa


In [21]:
# take a look at the scores
plddt_scores

array([0.46982896, 0.55617976, 0.6161727 , 0.62426984, 0.625071  ,
       0.64067435, 0.6172121 , 0.6357447 , 0.679019  , 0.6918191 ,
       0.6303525 , 0.73059523, 0.80158144, 0.81276375, 0.8190963 ,
       0.88588583, 0.9153666 , 0.8892622 , 0.91429484, 0.936362  ,
       0.94336843, 0.93883973, 0.9566945 , 0.95690846, 0.95039654,
       0.97129804, 0.9766546 , 0.96616733, 0.9644903 , 0.9770143 ,
       0.9731744 , 0.9707906 , 0.9719828 , 0.9794927 , 0.98068047,
       0.98146737, 0.9810993 , 0.9639594 , 0.9802025 , 0.98173046,
       0.9791391 , 0.9763158 , 0.97638154, 0.97025955, 0.8907247 ,
       0.9218055 , 0.93321705, 0.94455576, 0.96194464, 0.9781572 ,
       0.97837377, 0.97654164, 0.97614   , 0.9809093 , 0.98262906,
       0.98282546, 0.9779602 , 0.96412313, 0.9740423 , 0.9815273 ,
       0.96756667, 0.9758676 , 0.9721914 , 0.98177   , 0.9814992 ,
       0.9808858 , 0.97542274, 0.96175027, 0.9764862 , 0.96714586,
       0.93430984, 0.9478642 , 0.9740845 , 0.95587224, 0.93942

Now open the `multimer_model_0.pdb` file and take a look at the raw PDB file. 

Start counting columns from 0: 

- `column 1`: atom number, starting from 1
- `Column 4`: this the chain ID: A or B
- `Column 5`: this is the residue number counting from 1 to 112 for chain A and 1 to 116 for chain B

Based on this, we will rewrite the `load_protein_boltz` function to visualize the results. 

## Visualizing the results with py3Dmol

In [28]:
import py3Dmol
def load_protein_boltz(pdb_file_path, plddt_file_path, width=800, height=600):

    """
    Load a protein structure from a PDB file and display it using py3Dmol
    pdb_file_path: str, path to the PDB file
    plddt_file_path: str, path to the npz file containing the pLDDT scores
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """
    
    # load the pdb file
    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])

    # load the plddt scores
    scores = np.load(plddt_file_path)['plddt']
    
    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)
    

    for line in pdb_data.split("\n"):

        # split each line by columns
        split = line.split()

        # not a valid line, ignore
        if len(split) == 0 or split[0] != "ATOM":
            continue

        # get residue id, pdb is 1-indexed, python is 0-indexed, therefore -1 to convert
        residue_idx = int(split[5]) - 1 

        # get the pLDDT score for the current residue, scale it to 0-100
        plddt_score = scores[residue_idx] * 100

        if plddt_score > 90:
            color = "blue"
        elif 70 <= plddt_score <= 90:
            color = "cyan"
        elif 50 <= plddt_score < 70:
            color = "yellow"
        else:
            color = "orange"
        
        # Atom serial numbers typically start from 1, hence idx should be used directly
        idx = int(split[1])
        
        # Style should be set per atom id
        view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})
    view.zoomTo()
    return view


In [29]:
view = load_protein_boltz(
    pdb_file_path="output/boltz_results_multimer/predictions/multimer/multimer_model_0.pdb", 
    plddt_file_path="output/boltz_results_multimer/predictions/multimer/plddt_multimer_model_0.npz")
view


<py3Dmol.view at 0x7fcd201063f0>

In [2]:
import subprocess

# Define the CLI command and arguments
command = [
    "boltz", 
    "predict", 
    "input/multimer.yaml", 
    "--out_dir", "output", 
    "--devices", "1", 
    "--output_format", "pdb", 
    "--use_msa_server"
]

try:
    # Run the command
    result = subprocess.run(command, check=True, capture_output=True, text=True)
    
    # Print the standard output
    print("Command Output:\n", result.stdout)
    
    # Print the standard error if there is any
    if result.stderr:
        print("Command Error:\n", result.stderr)

except subprocess.CalledProcessError as e:
    print(f"Command failed with return code {e.returncode}")
    print("Error Output:\n", e.stderr)

Command Output:
 Checking input data.
Running predictions for 1 structure
Processing input data.
Generating MSA for input/multimer.yaml with 2 protein entities.

Predicting: |                                                                                                                             | 0/? [00:00<?, ?it/s]
Predicting:   0%|                                                                                                                         | 0/1 [00:00<?, ?it/s]
Predicting DataLoader 0:   0%|                                                                                                            | 0/1 [00:00<?, ?it/s]
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00,  0.05it/s]Number of failed examples: 0

Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00,  0.05it/s]

Com

In [13]:
from datetime import datetime
run_time = datetime.now().strftime("run-date-%y%m%d-time-%H%M")
run_time

'run-date-241201-time-2140'