# Finding Potential Structural Relatives by Sequence Similarity using proteusPy
Eric G. Suchanek, PhD 4/17/24

Working under the assumption that similar sequence -> similar structure I generated a query on the lowest energy Disulfide Bond in the RCSB database (2q7q) to return PDB IDs for structures with high sequence similarity. I then use some of the proteusPy functions to find structures with similar disulfide bonds.

In [1]:
#
import pandas as pd
import pyvista as pv
from pyvista import set_plot_theme

from proteusPy import Disulfide, DisulfideList, Load_PDB_SS

# pyvista setup for notebooks
pv.set_jupyter_backend("trame")

#set_plot_theme("dark")
LIGHT = True

ProteusPy V0.97.0dev1


### Load the RCSB Disulfide Database
We load the database and get its properties as follows:

In [2]:
PDB_SS = Load_PDB_SS(verbose=True)
PDB_SS.describe()

--> DisulfideLoader: Downloading Disulfide Database from Drive...


Downloading...
From (original): https://drive.google.com/uc?id=1igF-sppLPaNsBaUS7nkb13vtOGZZmsFp
From (redirected): https://drive.google.com/uc?id=1igF-sppLPaNsBaUS7nkb13vtOGZZmsFp&confirm=t&uuid=607f1574-2326-4d94-b243-c9dd69f1499f
To: /Users/egs/miniforge3/envs/proteusPy/lib/python3.11/site-packages/proteusPy/data/PDB_SS_ALL_LOADER.pkl
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 340M/340M [00:15<00:00, 22.6MB/s]


--> DisulfideLoader: Downloading Disulfide Subset Database from Drive...


Downloading...
From: https://drive.google.com/uc?id=1puy9pxrClFks0KN9q5PPV_ONKvL-hg33
To: /Users/egs/miniforge3/envs/proteusPy/lib/python3.11/site-packages/proteusPy/data/PDB_SS_SUBSET_LOADER.pkl
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.64M/9.64M [00:00<00:00, 16.1MB/s]


-> load_PDB_SS(): Reading /Users/egs/miniforge3/envs/proteusPy/lib/python3.11/site-packages/proteusPy/data/PDB_SS_ALL_LOADER.pkl... 
-> load_PDB_SS(): Done reading /Users/egs/miniforge3/envs/proteusPy/lib/python3.11/site-packages/proteusPy/data/PDB_SS_ALL_LOADER.pkl... 
PDB IDs present:                    35818
Disulfides loaded:                  120494
Average structure resolution:       2.34 Å
Lowest Energy Disulfide:            2q7q_75D_140D
Highest Energy Disulfide:           1toz_456A_467A
Cα distance cutoff:                 8.00 Å
Total RAM Used:                     30.72 GB.


In [3]:
best_ss = PDB_SS["2q7q_75D_140D"]
best_ss.pprint()
best_ss.display(style="sb", light=LIGHT)

<Disulfide 2q7q_75D_140D, Source: 2q7q, Resolution: 1.6 Å 
Χ1-Χ5: -59.36°, -59.28°, -83.66°, -59.82° -59.91°, -25.17°, 0.49 kcal/mol 
Cα Distance: 5.50 Å 
Torsion length: 145.62 deg>


Widget(value='<iframe id="pyvista-jupyter_trame__template_P_0x318987410_0" src="http://localhost:8888/trame-ju…

I generated a query on: https://www.ebi.ac.uk/pdbe/entry/pdb/2q7q to return PDB IDs for structures with high sequence similarity to 2q7q - the protein with the lowest energy disulfide bond in the RCSB database. This yielded a ```.csv``` file, which we will import below:

In [4]:
ss_df = pd.read_csv("2q7q_seqsim.csv")
ss_df.head(5)

Unnamed: 0,pdb_id,organism_scientific_name,tax_id,organism_synonyms,rank,genus,superkingdom,journal,journal_volume,journal_first_page,...,molecule_name,all_molecule_name,modified_residue_flag,molecule_type,mutation_type,entry_uniprot_accession,uniprot_id,molecule_synonym,gene_name,entity_id
0,2q7q,Paracoccus denitrificans,266,"Parde,Paracoccus Denitrificans,Micrococcus Den...","species,genus,family,order,class,phylum,superk...",Paracoccus,Bacteria,J. Mol. Biol.,276.0,,...,Methylamine dehydrogenase heavy chain,,N,Protein,Conflict,"P29894,P22619",DHMH_PARDE,"Methylamine dehydrogenase (amicyanin),Methylam...",mauB,1
1,2bbk,Paracoccus denitrificans,266,"Parde,Paracoccus Denitrificans,Micrococcus Den...","species,genus,family,order,class,phylum,superk...",Paracoccus,Bacteria,J. Mol. Biol.,276.0,,...,Methylamine dehydrogenase light chain,,Y,Protein,,"P29894,P22619",DHML_PARDE,"Methylamine dehydrogenase (amicyanin),MADH,Met...",mauA,2
2,2agy,Alcaligenes faecalis,511,"Achromobacter Sp. Atcc8750,Alcaligenes Sp. Bp1...","species,genus,family,order,class,phylum,superk...",Alcaligenes,Bacteria,Science,312.0,,...,Aralkylamine dehydrogenase light chain,,Y,Protein,,"P84887,P84888",AAUA_ALCFA,"Aromatic amine dehydrogenase,AADH,Aralkylamine...",aauA,1
3,2agy,Alcaligenes faecalis,511,"Achromobacter Sp. Atcc8750,Alcaligenes Sp. Bp1...","species,genus,family,order,class,phylum,superk...",Alcaligenes,Bacteria,Science,312.0,,...,Aralkylamine dehydrogenase heavy chain,,N,Protein,,"P84887,P84888",AAUB_ALCFA,"Aromatic amine dehydrogenase,Aralkylamine dehy...",aauB,2
4,2ah1,Alcaligenes faecalis,511,"Achromobacter Sp. Atcc8750,Alcaligenes Sp. Bp1...","species,genus,family,order,class,phylum,superk...",Alcaligenes,Bacteria,Science,312.0,,...,Aralkylamine dehydrogenase light chain,,Y,Protein,,"P84888,P84887",AAUA_ALCFA,"Aromatic amine dehydrogenase,AADH,Aralkylamine...",aauA,1


All of the nearest sequence neighbors are sadly, bacterial. Let's extract the unique ids next.

In [5]:
relative_list = ss_df["pdb_id"].unique()
relative_list

array(['2q7q', '2bbk', '2agy', '2ah1', '2ah0', '2agl', '2agx', '2hjb',
       '1mae', '2oiz', '2ojy', '2i0s', '2iup', '2iur', '2agw', '2hxc',
       '2i0r', '2iuv', '2i0t', '2mad', '2agz', '2hkr', '2hj4', '2ok4',
       '2hkm', '1maf', '2ok6', '2iuq', '3orv', '2h47', '2h3x', '3l4m',
       '3l4o', '2j57', '2j55', '2j56', '3pxt', '3sle', '3c75', '3rn0',
       '3sjl', '3pxw', '3pxs', '3rlm', '3rmz', '4fa1', '4fa9', '3sxt',
       '4l3h', '3rn1', '3sws', '4o1q', '4l3g', '4k3i', '4l1q', '4fan',
       '4fa5', '4fav', '4fb1', '3svw', '4y5r', '4fa4', '2iaa', '1mg3',
       '1mg2', '2gc4', '2gc7', '2mta'], dtype=object)

We now need to convert the list of PDB IDs into real disulfides from the database. We do this with the ``DisulfideLoader.build_ss_from_idlist()`` function. Next we print out some relevant statistics.


In [6]:
relatives = DisulfideList([], "relatives")
relatives = PDB_SS.build_ss_from_idlist(relative_list)

print(
    f"There are: {relatives.length} related structures.\nAverage Energy: {relatives.average_energy:.2f} kcal/mol\nAverage Ca distance: {relatives.Average_Distance:.2f} Å"
)
print(
    f"Average resolution: {relatives.Average_Resolution:.2f} Å \nAverage torsion distance: {relatives.Average_Torsion_Distance:.2f}°"
)

There are: 68 related structures.
Average Energy: 2.23 kcal/mol
Average Ca distance: 3.99 Å
Average resolution: 1.88 Å 
Average torsion distance: 155.99°


Now let's look at the lowest and highest energy structures in this list of relatives.

In [7]:
ssmin, ssmax = relatives.minmax_energy
duolist = DisulfideList([ssmin, ssmax], "mM")
# duolist.display(style='sb', light=LIGHT)

In [8]:
duolist.display_overlay(light=LIGHT)

  0%|                                                                              | 0/2 [00:00<?, ?it/s]

Widget(value='<iframe id="pyvista-jupyter_trame__template_P_0x329760e10_1" src="http://localhost:8888/trame-ju…

The two Disulfides...

We can find disulfides that are conformationally related by using the DisulfideList.nearest_neighbors() function with a dihedral angle cutoff. This cutoff is measure of angular similarity across all five sidechain dihedral angles.

In [9]:
close_neighbors = relatives.nearest_neighbors(
    ssmin.chi1, ssmin.chi2, ssmin.chi3, ssmin.chi4, ssmin.chi5, 10.0
)
close_neighbors.length

18

In [11]:
close_neighbors.display_overlay(light=LIGHT)

  0%|                                                                             | 0/18 [00:00<?, ?it/s]

Widget(value='<iframe id="pyvista-jupyter_trame__template_P_0x329de8410_2" src="http://localhost:8888/trame-ju…

So now we have the 18 close neighbors of the lowest energy structure.

In [12]:
ssTotList = PDB_SS.SSList
global_neighbors = ssTotList.nearest_neighbors(
    ssmin.chi1, ssmin.chi2, ssmin.chi3, ssmin.chi4, ssmin.chi5, 5.0
)
global_neighbors.length

25

In [13]:
global_neighbors.display_overlay(light=LIGHT)

  0%|                                                                             | 0/25 [00:00<?, ?it/s]

Widget(value='<iframe id="pyvista-jupyter_trame__template_P_0x32e3b8e10_3" src="http://localhost:8888/trame-ju…