# Lab 1.1 Using Hosted API from the ESMFold NIM Playground to predict protein structure

[ESMFold](https://www.science.org/doi/10.1126/science.ade2574) is a state-of-the-art model developed by Meta that does not rely on MSA templates. It accepts single chain proteins only. 

## Helper functions

In [1]:
import os, requests, shutil
# load .env file
from dotenv import load_dotenv
from loguru import logger
load_dotenv()

True

### Directory setup

In [2]:
def preprare_directory(temp, delete_old=True):
    """
    Create a new directory and delete the old one if it exists
    :param temp: str: path to the directory
    :param delete_old: bool, whether to delete the old directory. Defaults to True.
    """
    if delete_old:  
        if os.path.exists(temp):
            # Remove the directory and all its contents
            shutil.rmtree(temp)
    # Recreate the directory
    os.makedirs(temp, exist_ok=True)

### Interact with the hosted API endpoint at ESMFold Playground 

ESMFold playground can be accessed [here](https://build.stg.ngc.nvidia.com/meta/esmfold?snippet_tab=Python)


In [3]:
class ESMFoldPlayground:
    def __init__(self, NGC_API_KEY, query_url=None):
        """
        Initialize the ESMFoldPlayground class
        NGC_API_KEY: str, the API key to use
        query_url: str, the url to send the request to, default is the ESMFold NIM endpoint
        """
        self.NGC_API_KEY = NGC_API_KEY
        self.query_url = query_url if query_url is not None else "https://health.api.nvidia.com/v1/biology/nvidia/esmfold"

    
    def predict(self,sequence, output_dir=None, output_file_name="predicted_protein.pdb", delete_old_dir=True):
        """
        Main function to run the molecular docking
        sequence: str, single aa sequence
        output_dir: str, the directory to save the output to. If there are existing contents, it will be deleted and recreated. Defaults to None, and it will not save the output PDB file. 
        output_file_name: str, the name of the output PDB file. Defaults to "predicted_protein.pdb". Only used when output_dir is not None.
        delete_old_dir: bool, whether to delete the old directory. Defaults to True.
        return JSON response
        """

        # prepare output directory
        if output_dir is not None:
            preprare_directory(output_dir, delete_old=delete_old_dir)
        
        # prepare data
        data = {
            "sequence": sequence,
        }

        # prepare headers
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.NGC_API_KEY}"
        }
        
        # send request
        response = requests.post(self.query_url, headers=headers, json=data)
        
        # check response
        if response.status_code == 200:
            logger.success("Request successful")
            # get the result
            result = response.json()
            # Write PDB file
            if output_dir is not None:
                fp = os.path.join(output_dir, output_file_name)
                with open(fp, "w") as f:
                    f.write(result["pdbs"][0])
        else:
            logger.error(f"Request failed with status code {response.status_code}. Output file will not be saved.")
            logger.error("Response:", response.text)
            
        return response.json()

## Try out the hosted API endpoint

### Initialize the ESMFoldPlayground class

In [4]:
# get NGC API key
NGC_API_KEY = os.getenv("NVIDIA_NIM_API_KEY")

# initialize the ESMFoldPlayground class
esmfold_playground = ESMFoldPlayground(
    NGC_API_KEY=NGC_API_KEY
)

### Running the prediction

In [5]:
# output directory
output_dir = "output/esmfold_result"

# output file name inside the output directory
output_file_name = "predicted_protein.pdb"

# first prepare the output directory by deleting the old one if it exists
preprare_directory(output_dir, delete_old=True)


In [6]:
%%time 

# source of VHH sequence: sdAb_5763_Ca from SdAb-Db: https://www.sdab-db.ca/?Display&ID=sdAb_5763_Ca
VHH_seq = "QVQLQESGGGLVQAGGSLRLSCAASGTISPLPAMGWYRQAPGKEREFVAGIDTGAITNYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCAVFPAAYDYYERYYTYWGQGTQVTVSS"

# run prediction
result = esmfold_playground.predict(
    sequence=VHH_seq,
    output_dir=output_dir,
    output_file_name=output_file_name,
    delete_old_dir=True # delete the old directory if it exists
)


[32m2024-12-04 15:54:19.660[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36mpredict[0m:[36m42[0m - [32m[1mRequest successful[0m


CPU times: user 1.92 ms, sys: 4.8 ms, total: 6.71 ms
Wall time: 1.08 s


### Analyze the result

In [7]:
result.keys()

dict_keys(['pdbs'])

In [8]:
# only 1 sequence is allowed each time
assert len(result["pdbs"]) == 1

In [9]:
# the returned, predicted PDB file is formatted in a string, which we can write out to a text file with `.pdb` extension
print(result['pdbs'][0])


PARENT N/A                                                                      
MODEL     1                                                                     
ATOM      1  N   GLN A   1      -8.799 -19.340  -7.255  1.00 72.44           N  
ATOM      2  CA  GLN A   1      -8.818 -18.523  -6.046  1.00 73.39           C  
ATOM      3  C   GLN A   1      -7.953 -17.276  -6.211  1.00 74.37           C  
ATOM      4  CB  GLN A   1      -8.342 -19.336  -4.841  1.00 68.76           C  
ATOM      5  O   GLN A   1      -6.820 -17.359  -6.689  1.00 70.67           O  
ATOM      6  CG  GLN A   1      -8.819 -18.787  -3.502  1.00 63.29           C  
ATOM      7  CD  GLN A   1      -8.388 -19.647  -2.329  1.00 61.69           C  
ATOM      8  NE2 GLN A   1      -8.805 -19.262  -1.127  1.00 50.77           N  
ATOM      9  OE1 GLN A   1      -7.687 -20.650  -2.501  1.00 60.58           O  
ATOM     10  N   VAL A   2      -8.425 -16.021  -5.933  1.00 79.17           N  
ATOM     11  CA  VAL A   2  

### Visualize the results

We can take a look at the predicted protein structure and its pLDDT score. The score is stored as B-factor in the returned results. 

To align with the [EBI/AlphaFold database](https://alphafold.ebi.ac.uk/), we will use the following colors for the pLDDT score

![plddt](https://res.cloudinary.com/dpfqlyh21/image/upload/v1705026011/obsidian/izrfmiepbzpnzm2aoqwh.png)

We display the predicted structure with these colors using py3Dmol.

In [10]:
import py3Dmol
def load_protein_esmfold(pdb_file_path, width=800, height=600):

    """
    Load a protein structure from a PDB file and display it using py3Dmol.
    pdb_file_path: str, path to the PDB file
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """
    
    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])
    
    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)
    
    for line in pdb_data.split("\n"):
        split = line.split()
        if len(split) == 0 or split[0] != "ATOM":
            continue
        # Assuming the B-factor is at position 10 (you may need to adjust this based on your PDB format)
        b_factor = float(split[10])
        if b_factor > 90:
            color = "blue"
        elif 70 <= b_factor <= 90:
            color = "cyan"
        elif 50 <= b_factor < 70:
            color = "yellow"
        else:
            color = "orange"
        
        # Atom serial numbers typically start from 1, hence idx should be used directly
        idx = int(split[1])
        
        # Style should be set per atom id
        view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})
    view.zoomTo()
    return view

In [11]:
view = load_protein_esmfold(os.path.join(output_dir, output_file_name))
view.show()