## Protein Models and Visualization

AlphaFold is an AI system that predicts a protein’s 3D structure from its amino acid sequence. Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have partnered to create AlphaFold DB to make these predictions freely available to the scientific community. 
The latest database release contains over 200 million entries, providing broad coverage of UniProt (the standard repository of protein sequences and annotations).

### Alphafold API can be queried with the uniprot number for a protein.

Uniprot number: Upon integration into UniProtKB, each entry is assigned a unique accession number, which is called 'Primary (citable) accession number'. 
We will start by retrieving available uniprot numbers for a protein using the Uniprot API at https://rest.uniprot.org/uniprotkb/search.


In [None]:
import requests

def search_uniprot(protein_name):
    base_url = "https://rest.uniprot.org/uniprotkb/search"
    params = {
        'query': protein_name,
        'fields': 'accession',  # Request UniProt ID (accession number)
        'format': 'json',        # Get results in JSON format
        'size': 5                # Limit the number of returned hits
    }
    
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        data = response.json()
        uniprot_ids = [entry['primaryAccession'] for entry in data.get('results', [])]
        return uniprot_ids
    else:
        print(f"Error: Received status code {response.status_code}")
        return []

def check_alphafold(uniprot_ids):
    base_url = "https://alphafold.ebi.ac.uk/api/prediction/"
    pdf_urls = []
    
    for uniprot_id in uniprot_ids:
        url = f"{base_url}{uniprot_id}"
        response = requests.get(url)
        
        if response.status_code == 200:
            data = response.json()
            for entry in data:
                if 'pdbUrl' in entry:
                    pdf_urls.append(entry['pdbUrl'])
        else:
            print(f"Error: Received status code {response.status_code} for UniProt ID {uniprot_id}")
    
    return pdf_urls

protein_name = "P53"
uniprot_ids = search_uniprot(protein_name)
print(uniprot_ids)

pdf_urls = check_alphafold(uniprot_ids)
print(pdf_urls)

## Check for strucutre files on the AlphaFold EMBL server

start with "pip install py3Dmol"

py3Dmol is a simple [IPython/Jupyter](http://jupyter.org/) widget to embed an interactive [3Dmol.js](http://3dmol.org) viewer in a notebook.
We will retrieve structures from the urls we captured above and view them using py3Dmol.

In [None]:
import py3Dmol


def check_alphafold(uniprot_ids):
    base_url = "https://alphafold.ebi.ac.uk/api/prediction/"
    pdb_urls = []
    
    for uniprot_id in uniprot_ids:
        url = f"{base_url}{uniprot_id}"
        response = requests.get(url)
        
        if response.status_code == 200:
            data = response.json()
            for entry in data:
                if 'pdbUrl' in entry:
                    pdb_urls.append(entry['pdbUrl'])
        else:
            print(f"Error: Received status code {response.status_code} for UniProt ID {uniprot_id}")
    
    return pdb_urls

def visualize_protein_with_py3dmol(pdb_url):
    response = requests.get(pdb_url)
    if response.status_code == 200:
        pdb_data = response.text
        view = py3Dmol.view(width=800, height=600)
        view.addModel(pdb_data, "pdb")
        view.setStyle({'stick': {}})
        view.zoomTo()
        view.show()
    else:
        print(f"Error: Received status code {response.status_code} for PDB URL {pdb_url}")


pdb_urls = check_alphafold(uniprot_ids)
print(pdb_urls)

if pdb_urls:
    visualize_protein_with_py3dmol(pdb_urls[0])


## pLDDT
The predicted local distance difference test (pLDDT) is a per-residue measure of local confidence. It is scaled from 0 to 100, with higher scores indicating higher confidence and usually a more accurate prediction.

pLDDT measures confidence in the local structure, estimating how well the prediction would agree with an experimental structure. On this basis, a pLDDT above 90 would be taken as the highest accuracy category, in which both the backbone and side chains are typically predicted with high accuracy. In contrast, a pLDDT above 70 usually corresponds to a correct backbone prediction with misplacement of some side chains. 

The pLDDT score is contained within the PDB file format in the position normally used for an X-ray diffraction quality metric. It is calculated for each residue and stored on each atom line.

The script below was intended to capture the pLDDT scores and compare them across three categories. But something went wrong.

In [None]:
import pandas as pd
import re


def save_pdb_to_disk(pdb_url, filename="protein.pdb"):
    response = requests.get(pdb_url)
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.content)
        print(f"PDB file saved as {filename}")
    else:
        print(f"Error: Received status code {response.status_code} for PDB URL {pdb_url}")

def extract_plddt_from_pdb(filename="protein.pdb"):
    plddt_scores = []
    seen_residues = set()
    with open(filename, 'r') as file:
        for line in file:
            if line.startswith("ATOM"):
                parts = re.split(r'\s+', line.strip())
                if len(parts) >= 11:
                    try:
                        residue = parts[3]
                        residue_number = parts[5]
                        plddt = float(parts[10])
                        residue_id = (residue, residue_number)
                        if residue_id not in seen_residues:
                            seen_residues.add(residue_id)
                            plddt_scores.append({'Residue': residue, 'ResidueNumber': residue_number, 'PLDDT': plddt})
                    except ValueError:
                        continue
    return pd.DataFrame(plddt_scores)

def calculate_average_plddt(df):
    # Define residue categories
    acidic_residues = {'D', 'E'}
    basic_residues = {'K', 'R', 'H'}
    nonpolar_residues = {'A', 'V', 'L', 'I', 'M', 'F', 'W', 'P', 'G'}
    
    # Calculate averages using a loop and only taking the first atom of each residue
    acidic_scores = []
    basic_scores = []
    nonpolar_scores = []
    
    for _, row in df.iterrows():
        if row['Residue'] in acidic_residues:
            acidic_scores.append(row['PLDDT'])
        elif row['Residue'] in basic_residues:
            basic_scores.append(row['PLDDT'])
        elif row['Residue'] in nonpolar_residues:
            nonpolar_scores.append(row['PLDDT'])
    
    acidic_avg = sum(acidic_scores) / len(acidic_scores) if acidic_scores else float('nan')
    basic_avg = sum(basic_scores) / len(basic_scores) if basic_scores else float('nan')
    nonpolar_avg = sum(nonpolar_scores) / len(nonpolar_scores) if nonpolar_scores else float('nan')
    
    print("Average PLDDT values:")
    print(f"Acidic residues: {acidic_avg:.2f}")
    print(f"Basic residues: {basic_avg:.2f}")
    print(f"Nonpolar residues: {nonpolar_avg:.2f}")

pdb_urls = check_alphafold(uniprot_ids)
print(pdb_urls)

if pdb_urls:
    save_pdb_to_disk(pdb_urls[0])
    plddt_df = extract_plddt_from_pdb()
    calculate_average_plddt(plddt_df)

We wanted numbers not nans! Use the debugger to find out what is going on. Add a few breakpoints
at logical locations and check the variables at those time points. What is the source of error?


Exercise 1: Once you have found the error, cut and paste into the code block below and try adjusting the code to fix it. Spend a few minutes outlining a plan in code or comments. Then paste your partial solution into GPT to ask for suggestions.


In [None]:
# insert corrected code here

Excercise 2:
Create a plot of the average pLDDT for each residue type individually across all proteins you retreived.