# Lab 1.4 A simple research assistant for protein structure prediction

## Background

### Architecture

![Research Assistant Architecture](../images/research-assistant-architecture.png)

Agents are models capable with 3 capabilities: planning, memory, and tool-calling. 

A group of agents can be assembled into a crew to perform more complex tasks. 

In the architecture diagram above, we build a crew of 5 agents, performing different functions. 

### Features 
1. This architecture allows users to flexbilty add/remove model agents.
   - In my initial draft I also added code for the 3rd model - IgFold -  but commented the out the code due to non-commercial license restriction. If you have access to it, you can refer to additional dependencies to install from the Gray Lab's repository [here](https://github.com/Graylab/IgFold). 
   - If you add more than one model, you might want to consider first deploying them on a separate container, then interacting with them via APIs.
2. Model selection mechanism can be customized based on: 
   - empirical rules (such as implemented here), or 
   - dynaimcally pulling benchmark data from other internal sources (requires additional agents to be designed)


### Limitations

1. This agent is designed work with one protein a time. The protein can have 1 chain or multiple chains. A prompt will look like this: 

    ```bash
    predict the structure of the following protein: 
    >myProteinName chain A
    MALWMRLLPLLALLALWGPDPAAA

    >myProteinName chain B
    MAFWGAPVLLLALWGPDPAAA
    ```
    or something like this: 

    ```bash
    predict the structure of the xxx protein: MAFWGAPVLLLALWGPDPAAA
    ```
2. It is desgined to simplify tasks where you want to predict the structure of a protein, but isn't sure which model to use, or how to format in the input.  
3. It is not designed to run prediction in batches. This can be potentially done with a crew of agents, but requires a different design.  
4. Some models, like Boltz, can handle structure prediction of other modalities (e.g. small moelcules, RNAs). For simplicity I did not include this in the current implementation. To include this feature, you will have to modifty both the `Preprocess` tool so that it can handle different modalities, as teh `Boltz` tool to format the input into the correct YAML file
5. In the current implementation, we use a sequential process for simplicity. This might not be efficient for this workflow. Crewai provides [flow feature](https://docs.crewai.com/concepts/flows#unstructured-state-management) that might be better suited. 
6. As of 12/2/24, NIMs in crewai do not support pydantic output. A workaround is to output structured strings (e.g. `str(pydanticOject)`, as implemented here). 
7. In this implementation, we installed Boltz model directly using `pip`. This should be avoided for actual production, because as you add more and more models, you're likely going to run into dependency conflicts. A better approach is to deploy the models on a separate container, then interact with them via APIs. 
8. When inputting larger complexes/longer sequences, you might see Boltz running out of memory. In this case, the `prediction` folder will be empty. 


## Code walk through

The code used to build the research assistant is in the `research_assistant` directory. We will walk through the code to understand how it works. 

## Run the research assistant

> Important: crewai projects has their own dependecies and project folders

> For the rest of this notebook, all commands should be exectued inside the `research_assistant` directory. 

### Set up env file

Run the follwowing command to copy the `.env` file from the main project folder to the `research_assistant` folder

In [2]:
! cp ../.env ../research_assistant/.env

### Preparing for the environment

> Crewai uses its own virtual environment. We need to set it up first. 

To start: 

1. Open a terminal, and CD into the `research_assistant` directory. **All crewai commands should be executed at this directory**
    ```bash
    cd /home/ubuntu/2025-01-biologic-summit-workshop/research_assistant
    ```
2. Build the virtual environment of crewai first. Note that you must be in the `research_assistant` directory to run this command: 
    ```bash
    crewai install
    ```
3. Open the `research_assistant/.venv/pyvenv.cfg` directory, change this line to `true`, so that crewai can utilize the conda environmen pacakges: 
    ```bash
    include-system-site-packages = true
    ```

### Running the research assistant

1. Open the `research_assistant/src/research_assistant/main.py`. This file contains the code to interact with the research assistant. There are 3 example inputs. Change this line to swap between the 3 inputs: 

    ```python
        inputs = {
            'inputs': example_input3 # change this line to swap between the 3 inputs
        }
    ```
2. Read the comments for each seuqence, and the expected processing/seelction restuls. You can also change the sequence in each example input to test your own sequence. 

3. Back to the terminal, make sure you still in the `research_assistant` directory: 
    ```bash
    cd /home/ubuntu/2025-01-biologic-summit-workshop/research_assistant
    ```
    and run the following command to kickoff the research assistant: 
    ```bash
    crewai run
    ```


## Analyze the output

Navigate to the `research_assistant/output` directory to see the output of the research assistant. 

> Note, the output file of the research assistant is in the `research_assistant/output` directory, NOT the `2025-01-biologic-summit-workshop/output` directory in the previous notebook!

### Visualize the ESMFold prediction

In [3]:
import py3Dmol
def load_protein_esmfold(pdb_file_path, width=800, height=600):

    """
    Load a protein structure from a PDB file and display it using py3Dmol.
    pdb_file_path: str, path to the PDB file
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    return: py3Dmol.view object
    """
    
    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])
    
    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)
    
    for line in pdb_data.split("\n"):
        split = line.split()
        if len(split) == 0 or split[0] != "ATOM":
            continue
        # Assuming the B-factor is at position 10 (you may need to adjust this based on your PDB format)
        b_factor = float(split[10])
        if b_factor > 90:
            color = "blue"
        elif 70 <= b_factor <= 90:
            color = "cyan"
        elif 50 <= b_factor < 70:
            color = "yellow"
        else:
            color = "orange"
        
        # Atom serial numbers typically start from 1, hence idx should be used directly
        idx = int(split[1])
        
        # Style should be set per atom id
        view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})
    view.zoomTo()
    return view

In [11]:
import os

# ESMFold output from the research assistant 
# this is the output folder inside the research_assistant, NOT the output folder in the main project directory!!!
esmfold_pdb_fp = "../research_assistant/output/esmfold_result/run-date-241203-time-2351/BSA.pdb"


load_protein_esmfold(esmfold_pdb_fp)


<py3Dmol.view at 0x7d0b47f1cfe0>

### Visualize the Boltz prediction

In [4]:
import py3Dmol
import os
import numpy as np


import py3Dmol
def load_protein_boltz(pdb_file_path, plddt_file_path, width=800, height=600, color_by_chain=False):

    """
    Load a protein structure from a PDB file and display it using py3Dmol
    pdb_file_path: str, path to the PDB file
    plddt_file_path: str, path to the npz file containing the pLDDT scores
    width: int, width of the viewer in pixels
    height: int, height of the viewer in pixels
    color_by_chain: bool, whether to color the chains differently. Defaults to False.
    return: py3Dmol.view object
    """
    
    # load the pdb file
    with open(pdb_file_path) as ifile:
        pdb_data = "".join([x for x in ifile])

    # load the plddt scores
    scores = np.load(plddt_file_path)['plddt']
    
    view = py3Dmol.view(width=width, height=height)
    view.addModelsAsFrames(pdb_data)
    
    # if not coloring by chain, then color by pLDDT scores
    if not color_by_chain:
        print("color by pLDDT scores....")
        for line in pdb_data.split("\n"):

            # split each line by columns
            split = line.split()

            # not a valid line, ignore
            if len(split) == 0 or split[0] != "ATOM":
                continue

            # get residue id, pdb is 1-indexed, python is 0-indexed, therefore -1 to convert
            residue_idx = int(split[5]) - 1 

            # get the pLDDT score for the current residue, scale it to 0-100
            plddt_score = scores[residue_idx] * 100

            if plddt_score > 90:
                color = "blue"
            elif 70 <= plddt_score <= 90:
                color = "cyan"
            elif 50 <= plddt_score < 70:
                color = "yellow"
            else:
                color = "orange"
            
            # Atom serial numbers typically start from 1, similar to requirement of `view.setStyle`, hence idx should be used directly
            idx = int(split[1])
            
            # Style should be set per atom id
            view.setStyle({'model': -1, 'serial': idx}, {"cartoon": {'color': color}})

    # else color by chain
    else:
        print("color by chain....")
        chain_ids = set()
        # count the number of chains
        for line in pdb_data.split("\n"):
            split = line.split()

            # not a valid line, ignore
            if len(split) == 0 or split[0] != "ATOM":
                continue
            
            # split each line by columns
            chain_id = split[4]
            # add to the set
            chain_ids.add(chain_id)
        
        # palette
        palette = ["blue", "red", "yellow", "orange", "purple", "brown", "pink", "green", "cyan", "magenta"]
        for idx, cid in enumerate(chain_ids):
            adjusted_idx = idx % len(palette)
            view.setStyle({'chain': cid}, {'cartoon': {'color': palette[adjusted_idx]}})    
    view.zoomTo()
    return view

In [13]:
# Boltz output from the research assistant 
boltz_pdb_fp = "../research_assistant/output/boltz_result/run-date-241203-time-2351/boltz_results_BSA/predictions/BSA/BSA_model_0.pdb"
boltz_plddt_fp = "../research_assistant/output/boltz_result/run-date-241203-time-2351/boltz_results_BSA/predictions/BSA/plddt_BSA_model_0.npz"

load_protein_boltz(boltz_pdb_fp, boltz_plddt_fp, color_by_chain=False)

color by pLDDT scores....


<py3Dmol.view at 0x7d0b4c733050>