# Introduction

This notebook will introduce some base API functionality of the ESMFold LLM, which is a Metagenomics Atlas. Allowing us to Fold a protein sequence to see the resultant predicted structure.

We will use the following Python libraries:

- PyTorch
- esm
- prothelpers
- py3Dmol

***

Will walkthrough each stage step-by-step carrying out the following taks:

1. Download Protein PDB data
2. Parse the protein data
3. Visualise protein data
4. Extract an experimental structure and sequence of chain
5. Setup a pre-trained ESMFold model using AutoTokenizer from HuggingFace
6. Tokenise sequence (experimental structure from step 4.) to convert it into numerical format (so can be used by ESMFold for prediciton)
7. Use instantiated ESMFold Model to make prediction
8. Submit tokenised sequence to ESMFold model to make prediction of the 3D Structure (May take ~5 minutes on non hardware accelerated setup)
9. View result visualisation
10. Evaluate accuracy of prediction by comapring to experimental structure.
11. Calculate TM-Score and RMSD to evaluate overall prediction experimentation
12. Finish


### 1. Download Protein PDB data

In [1]:
from Bio.PDB import PDBList, MMCIFParser
import os
import py3Dmol
from prothelpers.structure import atoms_to_pdb
import warnings

target_id = "1N8Z"

if not os.path.isdir("data"):
    os.mkdir("data")

pdbl = PDBList()
filename = pdbl.retrieve_pdb_file(target_id, pdir="data", file_format="mmCif")

ModuleNotFoundError: No module named 'Bio'

### 2. Parse the protein data

In [None]:
parser = MMCIFParser()
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    structure_1N8Z = parser.get_structure(target_id, filename)
pdb_string = atoms_to_pdb(structure_1N8Z[0])

### 3. Visualise protein data

In [None]:
view = py3Dmol.view(width=600, height=400)
view.addModel(pdb_string)
view.setStyle({"chain": "A"}, {"cartoon": {"color": "orange", "opacity": 0.5}})
view.setStyle({"chain": "B"}, {"cartoon": {"color": "blue"}})
view.setStyle({"chain": "C"}, {"cartoon": {"color": "green", "opacity": 0.5}})
view.zoomTo()
view.show()

### 4. Extract an experimental structure and sequence of chain

In [2]:
from prothelpers.structure import get_aa_seq

experimental_structure = atoms_to_pdb(structure_1N8Z[0]["B"])
with open("data/experimental.pdb", "w") as f:
    f.write(experimental_structure)

experimental_sequence = get_aa_seq(structure_1N8Z[0]["B"])

ModuleNotFoundError: No module named 'prothelpers'

### 5. Setup a pre-trained ESMFold model using AutoTokenizer from HuggingFace

In [None]:
from transformers import AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained(
    "facebook/esmfold_v1", low_cpu_mem_usage=True
)

In [None]:
device = torch.device("cpu")
model.esm = model.esm.float()
torch.backends.cuda.matmul.allow_tf32 = False

model = model.to(device)
model.trunk.set_chunk_size(64)

### 6. Tokenise sequence 
Experimental structure from step 4, to convert it into numerical format (so can be used by ESMFold for prediciton)

In [6]:
tokenized_input = tokenizer(
    [experimental_sequence], return_tensors="pt", add_special_tokens=False
)["input_ids"]
tokenized_input = tokenized_input.to(device)

print(f"The human-readable sequence is {experimental_sequence}")
print(f"The tokenized representation of the sequences is {tokenized_input}")

NameError: name 'tokenizer' is not defined