# Exercise 01 

## Aim of the exercise

The goal of this exercise is to extract and analyse pdb files from the PDB database.

We will learn how to:

- perform queries to the pdb database
- extract pdb IDs – each structure is defined by its ID code
- extract information about structures
- perform different analyses (e.g. find  structure with lowest resolution)

partially based on [Drazen Petrov](https://orcid.org/0000-0001-6221-7369)'s Exercises and on [TeachOpenCADD](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0351-x) and 

In [None]:
# Check if running on Google Colab
try:
    from google.colab import drive

    is_google_colab = True
except ImportError:
    is_google_colab = False

# If on Google Colab, install the package
if is_google_colab:
    %pip install numpy==1.26.4 scipy==1.14.1 pandas==2.2.2 matplotlib==3.9.2 biopandas==0.4.1 pypdb==2.4 tqdm==4.66.1 py3dmol==2.0.4

# NOTE: Ignore specific warning message from ipykernel=5.5.6
import warnings
warnings.filterwarnings(
    "ignore",
    message=r"`should_run_async`.*",
    category=DeprecationWarning,
    module=r"ipykernel\.ipkernel",
)

In [None]:
# import needed libraries
import math

import requests
import json
import tqdm
from tqdm import tqdm

import scipy
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl

import pypdb
import biopandas

import py3Dmol


mpl.rcParams["figure.dpi"] = 120
mpl.rcParams["figure.figsize"] = [5, 3]

In [None]:
# Print versions of each library
print(f"Running on Google Colab: {is_google_colab}")
print(f"numpy=={np.__version__}")
print(f"scipy=={scipy.__version__}")
print(f"pandas=={pd.__version__}")
print(f"matplotlib=={plt.matplotlib.__version__}")
print(f"biopandas=={biopandas.__version__}")
# print(f"pypdb=={pypdb.__version__}") # NOTE: pypdb does not have a __version__ attribute
# print(f"tqdm=={tqdm.__version__}") # NOTE: tqdm does not have a __version__ attribute
print(f"py3dmol=={py3Dmol.__version__}")

## PDB Protein Data Bank

The [RCSB PDB](https://www.rcsb.org/) (Research Collaboratory for Structural Bioinformatics Protein Data Bank) is a comprehensive database for the 3D structural information of biological macromolecules. The aim of RCSB PDB is to provide open access to 3D structural data of biological macromolecules to advance research and understanding of molecular biology and biochemistry. The RCSB PDB also provides a variety of tools and resources. Users can perform simple and advanced searches based on annotations relating to sequence, structure and function. These molecules are visualized, downloaded, and analyzed by users who range from students to specialized scientists. 



### Protein of interest

Today we will take a look at the human tyrosine-protein kinase. This protein is involved in cell differentiation, cell division, cell adhesion, stress response and apoptosis. It is also a target for cancer therapy.

The UNIPROT ID of this protein is P00519. You can find more information about this protein at [https://www.uniprot.org/uniprot/P00519](https://www.uniprot.org/uniprot/P00519)

To perform a search in the PDB database, copy the uniprot id (P00519) to the search box at [https://www.rcsb.org/](https://www.rcsb.org/)
Be free to explore the website and the information available for this protein.

### Programmatic access to PDB

While performing search over the website is straightforward, making repeated searches to systematically analyze structures of interest is only possible using a programmatic access.

Therefore, we will use the [PDB Search API](https://www.rcsb.org/docs/programmatic-access/web-services) to perform queries to the PDB database.

How does it work? The API lets you search the PDB database with a JSON query in a URL and retrieve results in JSON format for further extraction.

The API is well document in [https://search.rcsb.org/index.html#search-api](https://search.rcsb.org/index.html#search-api). You can find there also examples of queries [https://search.rcsb.org/index.html#examples](https://search.rcsb.org/index.html#examples).

We will use [pypdb](https://github.com/williamgilpin/pypdb) to easily access and download PDB data based on metadata like protein and ligand names.


In [None]:
# prepare search parameters using the uniprot ID of ABL1 (P00519)
search_dict = {
    "query": {
        "type": "terminal",
        "label": "full_text",
        "service": "full_text",
        "parameters": {"value": "P00519"},
    },
    "return_type": "entry",
    "request_options": {
        "paginate": {"start": 0, "rows": 10000},
        "results_content_type": ["experimental"],
        "sort": [{"sort_by": "score", "direction": "desc"}],
        "scoring_strategy": "combined",
    },
}

In [None]:
# performing the search
response = requests.get(
    "https://search.rcsb.org/rcsbsearch/v2/query?json=" + json.dumps(search_dict)
)
data = response.json()
# printing the keys of the retreived dictionary
data.keys()

In [None]:
# showing the total number of hits (how does this compare to the search performed on the website directly?)
data["total_count"], len(data["result_set"])

In [None]:
data["result_set"][0]

In [None]:
def extract_pdb_ids(search_result: dict) -> list:
    """Extracts the PDB IDs from the search result."""
    pdb_IDs = []
    for entry in search_result["result_set"]:
        pdb_IDs.append(entry["identifier"])
    return pdb_IDs


found_pdb_ids = extract_pdb_ids(data)

In [None]:
# here we look at how many hits we got
len(found_pdb_ids)

In [None]:
# here we look at the first 5 pdb codes
found_pdb_ids[:5]

### Extracting information of one protein

In [None]:
# let's take a look at some information about one of the structures from the list (PDB ID 1BBZ)
pdb_1bbz_info = pypdb.get_info("1BBZ")
for key, value in pdb_1bbz_info.items():
    # print(key, value) # this line would print all the data available (long output)!
    print(key, end=", ")  # here we just print the keys

Now, our problems is the fact that we have to much information. We are interested in parameters as resolution, method, date, number of atoms, etc.
 
Let's try to extract this information from the dictionary.

now try to do the same search direct on the PDB website https://www.rcsb.org/

In [None]:
# let's extract some of the (interesting) information
# of course, what is interesting depends on the project you are involved in
# this is one example of such a function
def extract_interesting_info(pdb_info: dict) -> dict:
    """Extracts interesting information from the PDB info."""
    info = {
        "pdb_id": pdb_info["rcsb_id"],
        "desc": pdb_info["struct"].get("pdbx_descriptor"),
        "title": pdb_info["struct"]["title"],
        "method": pdb_info["exptl"][0]["method"],
        "date": pdb_info["rcsb_accession_info"]["deposit_date"],
        "num_atoms": pdb_info["rcsb_entry_info"]["deposited_atom_count"],
        "resolution": None,
        "rwork": None,
        "rfree": None,
    }

    # only for X-ray
    try:
        xray_info = {
            "resolution": pdb_info["refine"][0]["ls_dres_high"],
            "rwork": pdb_info["refine"][0]["ls_rfactor_rwork"],
            "rfree": pdb_info["refine"][0]["ls_rfactor_rfree"],
        }
        info.update(xray_info)
    except Exception as e:
        print(f"Error: {e}")
        pass

    return (
        info  # pdb_id, desc, title, method, date, num_atoms, resolution, rwork, rfree
    )

In [None]:
extract_interesting_info(pdb_1bbz_info)

### Extracting information of the found proteins

We will use the function defined `extract_interesting_info` to extract the information of each protein.

In [None]:
# let’s collect data for all retrieved pdb codes
pdb_data = []
for pdb_id in tqdm(found_pdb_ids):
    pdb_data.append(extract_interesting_info(pypdb.get_info(pdb_id)))

In [None]:
# now we store this data in a pandas dataframe
pdbs = pd.DataFrame(pdb_data)
pdbs.head()

#### Let's make some plots

looking at the release years and resolution of the structures

In [None]:
# a little bit of preprocessing
# let's convert the date column to datetime format
pdbs["date"] = pd.to_datetime(pdbs["date"])
pdbs["year"] = pdbs["date"].dt.year

In [None]:
df = pdbs
df[["date"]].groupby(df["date"].dt.year).count().plot(kind="bar")

In [None]:
structures_per_year = pdbs["year"].value_counts().reset_index().sort_values("year")
structures_per_year

In [None]:
# let's plot the number of structures per year
# structures_per_year = pdbs['year'].value_counts().reset_index().sort_values('year') # <- this works for pandas>=2.0
structures_per_year = pdbs["year"].value_counts().reset_index()
structures_per_year.columns = ["year", "count"]
structures_per_year.sort_values("year", inplace=True)
structures_per_year

In [None]:
# let's plot the number of structures per year
structures_per_year.plot(x="year", y="count", style="o-")

In [None]:
# let's plot how the resolution changed over the years
pdbs = pdbs.sort_values(["year"], ascending=True, na_position="last")
pdbs[["date", "year", "resolution"]].head(10)

In [None]:
pdbs.plot(x="year", y="resolution", style="o")

In [None]:
# let's observe the type of methods used to obtain the structures
pdbs["method"].unique()

In [None]:
pdbs[pdbs.method == "X-RAY DIFFRACTION"].head()

In [None]:
pdbs[pdbs.method == "X-RAY DIFFRACTION"].head().resolution

In [None]:
pdbs.hist(column="resolution")

<img src="https://biopandas.github.io/biopandas/img/logos/logo_size_1.png" width="200" align="left"/>

## biopandas 

[BioPandas](https://biopandas.github.io/biopandas/) simplifies the handling of protein structure files, such as PDB files, for computational biologists. It utilizes pandas DataFrames, widely used in data science, to work with biological macromolecule structures from PDB and MOL2 files in structural biology.

We will use it to extract the structure with the lowest resolution 

In [None]:
from biopandas.pdb import PandasPdb
# The following warning is a biopandas issue that will be fixed in the next release

In [None]:
pdbs["resolution"].min()

In [None]:
pdbs["resolution"].idxmin()

In [None]:
ID_min = pdbs["resolution"].idxmin()
pdbs.iloc[ID_min]


In [None]:
pdbs["pdb_id"].iloc[ID_min]

In [None]:
pdb_ID = pdbs["pdb_id"].iloc[ID_min]
ppdb = PandasPdb().fetch_pdb(pdb_ID)
ppdb

In [None]:
ppdb.df["ATOM"].head()

In [None]:
ppdb.df["ATOM"]["b_factor"].plot(kind="hist")

In [None]:
ppdb.df["ATOM"]["b_factor"].plot(kind="line")

In [None]:
ppdb.df["ATOM"].x_coord[0]

In [None]:
def get_coord(pdb: PandasPdb, at: int) -> np.array:
    """Get the coordinates of an atom."""
    r = []
    for coord in ("x_coord", "y_coord", "z_coord"):
        r.append(pdb.df["ATOM"][coord][at])
    return np.array(r)


def calc_dist(pdb: PandasPdb, at1: int, at2: int) -> tuple[float, float]:
    """Calculate the distance between two atoms."""
    r1, r2 = get_coord(pdb, at1), get_coord(pdb, at2)
    r = r1 - r2
    d = math.sqrt(sum(r**2))
    d_alternative = np.linalg.norm(r)
    return d, d_alternative


calc_dist(ppdb, 0, 1)

In [None]:
# help(ppdb)
# execute to see what other function are available

In [None]:
get_coord(ppdb, 0)
# ppdb.distance()

In [None]:
ppdb.distance(get_coord(ppdb, 0)).head()

## py3Dmol

py3Dmol is a wrapper around the 3Dmol.js JavaScript library.

In [None]:
import py3Dmol

view = py3Dmol.view(query="pdb:4XEY")
view.setStyle({"cartoon": {"color": "spectrum"}})
view.zoomTo()
view.show()

# Exercises

* Perform a PDB query on a protein of choice (e.g. from a uniprot ID or by textual input query) and retrieve the data from the PDB database - note to use a query that will lead to some tens or hundreds of structures
* How many structures have you retrieved and how many of them are X-ray and how many NMR?
* Sort the structures by the resolution
* What is the min and max resolution
* Visualize the structure b-factors (X-ray) or visualize all the structures per PDB (NMR). **Use py3Dmol for this task.**

## For the project
* chose one of the X-ray and one of the NMR structures and use pymol to visualize them (save visualizations as png files) - this task (pymol) only works on a local machine:<br>
   - download the PDB file from the PDB database
   - visualize secondary structure elements and describe the structure in terms of secondary structure, motifs, domain, ...
   - zoom to the ligand or heteroatoms (if present) and analyze the aminoacids involved in the interaction
   - for the X-ray structure, visualize b-factors by either changing the size of the atoms (spheres, see https://sourceforge.net/p/pymol/mailman/message/29616429/) or by color and cartoon thickness (see https://www.michaelchimenti.com/2014/09/five-cool-features-in-pymol-that-you-may-have-missed/)
   - for the NMR structure, visualize the bundle. An NMR bundle is a set of structures that satisfy experimental data. This set of structures is reported within one PDB file.


# Space for the Exercises
please provide your solutions below this cell.

Also you can provide me the information for the project. I can help you in the progress.