# **Predict AlphaFold Protein Structure Predictions with LocalColabFold**

## **Overview**
This notebook provides a guide to visualizing protein structures predicted by **AlphaFold** using **LocalColabFold** within an **AWS SageMaker Notebook Instance**. AlphaFold, developed by DeepMind, is a revolutionary AI system that predicts protein structures with remarkable accuracy. LocalColabFold makes AlphaFold accessible for local execution, and this notebook will walk you through setting up your environment, running a sample prediction, and interactively visualizing the resulting 3D protein structures using the **py3Dmol** library. This visualization allows for a deeper understanding of the predicted protein conformations.

---

## **Learning Objectives**
By the end of this notebook, you will be able to:
- Understand the basics of **AlphaFold** and its significance in protein structure prediction.
- Learn about **LocalColabFold** and its advantages for local protein structure prediction.
- Set up an **AWS SageMaker Notebook Instance** with necessary configurations for LocalColabFold.
- Install **LocalColabFold** on your SageMaker instance.
- Run a protein structure prediction using **LocalColabFold**.
- Use **py3Dmol** to interactively visualize the predicted protein structures in 3D within the notebook.
- Utilize **IPywidgets** for creating interactive viewers to explore multiple predicted protein structures.

---

## **Tasks to Complete**
1. **Set up AWS SageMaker Notebook Instance:**
    - Launch an AWS SageMaker Notebook Instance with a GPU enabled instance type.
    - Configure storage for the instance.

2. **Install LocalColabFold:**
    - Open a terminal within the SageMaker Notebook Instance.
    - Download and execute the `install_colabbatch_linux.sh` script to install LocalColabFold.
    - Configure environment variables as instructed after installation.

3. **Run LocalColabFold Prediction:**
    - Execute a sample protein structure prediction using `colabfold_batch` command in the terminal.
    - Observe the output files generated in the `results/` directory.

4. **Visualize Prediction Results:**
    - Install the `py3Dmol` library within the notebook.
    - Utilize the provided Python code to:
        - Load and parse PDB files generated by LocalColabFold.
        - Create an interactive 3D viewer using `py3Dmol`.
        - Implement a dropdown widget using `ipywidgets` to select and visualize different predicted protein structures (ranked by confidence).

---

## **Prerequisites**
Before running this notebook, ensure you have the following:
- **An AWS Account:** Required to launch and manage an AWS SageMaker Notebook Instance.
- **Basic understanding of cloud computing and AWS SageMaker.**
- **Familiarity with Linux command line interface.**
- **Basic knowledge of protein structure and AlphaFold (recommended).**
- **Installed Python libraries:** `ipywidgets`, `py3Dmol`, `glob`, `re`. (Installation of `py3Dmol` is included in the notebook).

---

## **Get Started**
1. **Launch an AWS SageMaker Notebook Instance:** Follow the instructions in the "Setup an AWS SageMaker Notebook Instance" section to create and configure your instance, ensuring you select a GPU-enabled instance type. (e.g., g4dn.xlarge or p3.2xlarge for CUDA support, 50GB storage).
2. **Open a Terminal:** Open a terminal within your SageMaker Notebook Instance environment ("File" -> "New" -> "Terminal").
3. **Install LocalColabFold:** Execute the installation commands provided in the "Install LocalColabFold" section in the terminal.
4. **Run Protein Prediction:** Follow the "Run LocalColabFold" section to execute a sample prediction from the terminal.
5. **Open and Run the Notebook:** Open this notebook within your SageMaker Notebook Instance and execute the cells sequentially, starting from the "Visualization of Prediction Results" section to visualize the predicted protein structures interactively. Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.
---

## What is AlphaFold?
Developed by DeepMind (a subsidiary of Alphabet/Google), [AlphaFold](https://deepmind.google/technologies/alphafold/) is an artificial intelligence (AI) system that predicts the 3D structure of proteins from their amino acid sequences. It addresses the long-standing "protein folding problem," which has puzzled scientists for decades. Proteins are essential to nearly all biological processes, and their functions are determined by their intricate 3D shapes. Traditional methods like X-ray crystallography or cryo-EM are time-consuming and costly, making computational prediction a game-changer.

## How AlphaFold Works:
AlphaFold uses deep learning and neural networks trained on:
- Protein Data Bank (PDB): A repository of experimentally determined protein structures.
- Evolutionary Data: Multiple sequence alignments (MSAs) to infer evolutionary relationships.
- Physical Constraints: Geometric and chemical rules (e.g., bond angles, steric clashes).

The system employs a transformer-based architecture to model interactions between amino acids, generating highly accurate predictions (often near-experimental accuracy).

## Applications of AlphaFold
- Drug Discovery:
    - Accelerates identification of drug targets by predicting structures of disease-related proteins (e.g., cancer, Alzheimer’s).
    - Enables structure-based drug design (e.g., targeting SARS-CoV-2 spike protein).
- Understanding Genetic Diseases:
    - Predicts how mutations (e.g., in cystic fibrosis or sickle cell anemia) disrupt protein function.
- Enzyme Engineering:
    - Designs enzymes for industrial applications (e.g., biofuel production, plastic degradation).
- Synthetic Biology:
    - Facilitates creation of artificial proteins for novel functions.
- Antibiotic Development:
    - Predicts structures of bacterial proteins to combat antibiotic resistance.
- Basic Research:
    - Provides structural insights for poorly characterized proteins, expanding biological knowledge.

AlphaFold’s success was validated in the CASP competition (Critical Assessment of Structure Prediction), where it achieved unprecedented accuracy, rivaling experimental methods.

# LocalColabFold: Democratizing AlphaFold’s Power
## What is LocalColabFold?
[LocalColabFold](https://github.com/YoshitakaMo/localcolabfold) is an open-source, community-driven adaptation of ColabFold, which itself combines AlphaFold with faster, user-friendly tools. It allows researchers to run protein structure predictions locally (on their own hardware) without relying on cloud services like Google Colab.

Key Features:
- Accessibility:
    - Eliminates dependency on internet or cloud resources.
    - Ideal for sensitive data (e.g., proprietary or medical sequences).
- Speed & Efficiency:
    - Uses MMseqs2 (instead of HHblits) for rapid multiple sequence alignments (MSAs).
    - Reduced computational footprint compared to AlphaFold’s original implementation.
- Ease of Use:
    - Simplified setup via Conda or Docker.
    - Compatible with GPUs for faster predictions.
## Applications of LocalColabFold
- Academic Research:
    - Enables small labs to predict structures for hypothesis testing.
    - Useful for teaching structural biology concepts.
- Personalized Medicine:
    - Predicts structures of patient-specific protein variants.
- Structural Genomics:
    - Scales predictions for large protein datasets (e.g., metagenomic studies).
- Collaborative Projects:
    - Integrates with high-performance computing (HPC) clusters for batch processing.
## Limitations
- **Computational Resources**: LocalColabFold still requires a GPU for optimal performance.
- **Accuracy**: Slightly lower than AlphaFold for certain proteins due to simplified MSAs.
- **Multimer Support**: Early versions struggled with protein complexes, but updates have improved this.



## Install LocalColabFold

From "File"->"New"->"Terminal". Click the newwly opened Terminal tab and run the commands below to download "install_colabbatch_linux.sh" from repository
```bash
cd  /home/ec2-user/
wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
bash install_colabbatch_linux.sh
```

If installation is successful, you will see the following message:

```bash
Installation of ColabFold finished.
Add /home/ec2-user/localcolabfold/colabfold-conda/bin to your PATH environment variable to run 'colabfold_batch'.
i.e. for Bash:
        export PATH="/home/ec2-user/localcolabfold/colabfold-conda/bin:$PATH"
For more details, please run 'colabfold_batch --help'.
```


Add environment variables to ~/.bashrc by running the following commands from the terminal.
```bash
echo 'export PATH="/home/ec2-user/localcolabfold/colabfold-conda/bin:$PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="/home/ec2-user/localcolabfold/colabfold-conda/lib/:$LD_LIBRARY_PATH"' >> ~/.bashrc
source ~/.bashrc
```

## Run LocalColabFold

Run the following commands from the terminal.

```bash
echo ">Example_Protein|ChainA" > example.fasta 
echo "GIVEQCCTSICSLYQLENYCN" >> example.fasta
colabfold_batch --templates --amber example.fasta results/
ls -l results/
```


## Visualization of Prediction Results

In [None]:
# Install py3Dmol library for 3D molecular visualization in Jupyter notebooks.
%pip install py3Dmol

### Visualize multiple PDB files using py3Dmol and Interactive Dropdown Viewer

In [None]:
# Import the interact and Dropdown widgets from ipywidgets for creating interactive elements.
from ipywidgets import interact, Dropdown
# Import the py3Dmol library for 3D molecular visualization.
import py3Dmol
# Import the glob module for file path pattern matching.
import glob
# Import the re module for regular expression operations.
import re
# Import the display and HTML functions from IPython.display for output manipulation.
from IPython.display import display, HTML

# Define the directory where the results (PDB files) are stored.
results_dir = "/home/ec2-user/results/"
# 1. Get sorted list of PDB files by rank
# Use glob to find all PDB files in the results directory that match the pattern '*_rank_*_alphafold2_ptm_model_*.pdb'.
pdb_files = sorted(glob.glob(results_dir+'*_rank_*_alphafold2_ptm_model_*.pdb'),
                   # Sort the found PDB files based on their rank, extracted from the filename using regular expressions.
                   key=lambda x: int(re.search(r'rank_(\d+)', x).group(1)))

# Create dropdown selector
# Create a dictionary 'pdb_dict' where keys and values are the same PDB filenames, to be used in the dropdown widget.
pdb_dict = {f: f for f in pdb_files}

# Define an interactive function 'show_structure' that takes a 'File' argument, controlled by a Dropdown widget.
@interact(File=Dropdown(options=pdb_dict))
# Define the function 'show_structure' which will be called when the dropdown selection changes.
def show_structure(File):
    # Create a py3Dmol view object with specified width and height.
    view = py3Dmol.view(width=600, height=400)
    # Open the selected PDB file.
    with open(File) as f:
        # Add the model from the PDB file content to the py3Dmol view.
        view.addModel(f.read(), 'pdb')
    # Set the style of the protein structure to 'cartoon' with 'spectrum' coloring.
    view.setStyle({'cartoon': {'color': 'spectrum'}})
    # Zoom the view to fit the displayed structure.
    view.zoomTo()
    # Return the py3Dmol view object to display it.
    return view.show()

## **Conclusion**
This notebook provides a practical workflow for visualizing protein structures predicted by AlphaFold using LocalColabFold on AWS SageMaker. 
- By completing this guide, you have successfully set up your environment, run a protein structure prediction locally, and utilized interactive 3D visualization tools to explore the predicted protein conformations. - - This hands-on experience empowers you to leverage the power of AlphaFold for your own research and structural biology explorations, utilizing accessible and user-friendly tools within a cloud-based environment.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.