In [None]:
# DisulfideBond Playground
# Playing with the DisulfideBond class
# Author: Eric G. Suchanek, PhD.
# (c) 2022 Eric G. Suchanek, PhD., All Rights Reserved
# License: MIT
# Last Modification: 1/22/23
# Cα Cβ Sγ

# important preamble

import pyvista as pv
from pyvista import set_plot_theme

from proteusPy import *
from proteusPy.data import *
from proteusPy.Disulfide import *

# override any default PDB globals
# location for PDB repository
PDB_ROOT = '/Users/egs/PDB/'

# location of cleaned PDB files - these are not stored in the repo
PDB_GOOD = '/Users/egs/PDB/good/'

# from within the repo 
PDB_REPO = '../pdb/'

# location of the compressed Disulfide .pkl files
MODELS = f'{PDB_ROOT}models/'

# pyvista setup for notebooks
pv.set_jupyter_backend('ipyvtklink')
set_plot_theme('document')

# Set the figure sizes and axis limits.
DPI = 220
WIDTH = 6.0
HEIGHT = 3.0
TORMIN = -179.0
TORMAX = 180.0
GRIDSIZE = 20


## Analysis of Disulfide Bonds in Proteins Within the RCSB Protein Data Bank
*Eric G. Suchanek, PhD. (suchanek@mac.com)* <br> <br>

## Summary
I describe the results of a structural analysis of Disulfide bonds contained in 36,362 proteins within the RCSB Protein databank, https://www.rcsb.org. These protein structures contained 294,478 Disulfide Bonds.  The analysis utilizes Python functions from my ``ProteusPy`` package https://github.com/suchanek/proteusPy/, which is built upon the excellent ``BioPython`` library (https://www.biopython.org). 

This work represents a reprise of my original Disulfide modeling analysis conducted in 1986 ([publications](#publications) item 1) as part of my dissertation. Given the original Disulfide database contained only 2xx Disulfide Bonds I felt it would be interesting to revisit the RCSB and mine the thousands of new structures. The initial results are described in the cells below.

### Requirements
 - Biopython patched version, or my delta applied
 - proteusPy: https://github.com/suchanek/proteusPy/

## Introduction
Disulfide bonds are important covalent stabilizing elements in proteins. They are formed when two Sulphur-containing Cysteine (Cys) amino acid residues are close enough and in the correct geometry to form a S-S covalent bond with their terminal sidechain Sγ atoms. Disulfide bonds most commonly occur between alpha helices and greatly enhance a protein's stability to denaturation. 


## Download PDB Files containing Disulfide Bonds

The RCSB query yielded xx disulfides.

In [None]:

# Download_Disulfides(pdb_home=PDB_BASE, model_home=MODELS, reset=False)


## Extract the Disulfides from the PDB files
The function ``Extract_Disulfides()`` processes all the .ent files in ``PDB_DIR`` and creates two .pkl files representing the Disulfide bonds contained in the scanned directory. In addition, a .csv file containing problem IDs is written if any are found. The .pkl files are consumed by the ``DisulfideLoader`` class and are considered private. You'll see numerous warnings during the scan. Files that are unparsable are removed and their IDs are logged to the problem_id.csv file. The default file locations are stored in the file globals.py and are the used by ``DisulfideExtractor()`` in the absence of arguments passed. The Disulfide parser is very stringent and will reject disulfide bonds with missing atoms or disordered atoms.


Outputs are saved in ``MODEL_DIR``:
1) ``SS_PICKLE_FILE``: The ``DisulfideList`` of ``Disulfide`` objects initialized from the PDB file scan, needed by the ``DisulfideLoader()`` class.
2) ``SS_DICT_PICKLE_FILE``: the ``Dict Disulfide`` objects also needed by the ``DisulfideLoader()`` class
3) ``PROBLEM_ID_FILE``: a .csv containining the problem ids.

In general, the process only needs to be run once for a full scan. Setting the ``numb`` argument to -1 scans the entire directory. Entering a positive number allows parsing a subset of the dataset, which is useful when debugging. Setting ``verbose`` enables verbose messages. Setting ``quiet`` to ``True`` disables all warnings.

NB: A extraction of the initial disulfide bond-containing files (> 36000 files) takes about 1.25 hours on a 2020 MacbookPro with M1 Pro chip, 16GB RAM, 1TB SSD. The resulting .pkl files consume approximately 1GB of disk space, and equivalent RAM used when loaded.

In [None]:
#Extract_Disulfides(numb=1000, pdbdir=PDB, datadir=MODELS, verbose=False, quiet=True), subset=True


## Load the Disulfide Data
Now that the Disulfides have been extracted and the Disulfide .pkl files have been created we can load them into memory using the DisulfideLoader() class. This class stores the Disulfides internally as a DisulfideList and a dict. Array indexing operations including slicing have been overloaded, enabling straightforward access to the Disulfide bonds, both in aggregate and by residue. After loading the .pkl files the Class creates a Pandas ``DataFrame`` object consisting of the Disulfide ID, all sidechain dihedral angles, the local coordinates for the Disulfide and the computed Disulfide bond torsional energy.

NB: Loading the data takes 2.5 minutes on my MacbookPro. Be patient if it seems to take a long time to load.

In [None]:
# when running from the repo the local copy of the Disulfides is in ../pdb/models
# PDB_BASE = '../pdb/'

# location of the compressed Disulfide .pkl files
# MODELS = f'{PDB_BASE}models/'

PDB_SS = DisulfideLoader(verbose=True, subset=True)


The Disulfide and DisulfideList classes include rendering capabiities using the excellent PyVista interface to the VTK package. (http://pyvista.org). The following cell displays the first Disulfide bond in the database in ball-and stick style. Atoms are colored by atom type:
- Grey = Carbon
- Blue = Nitrogen
- Red = Oxygen
- Yellow = Sulfur
- White = Previous residue carbonyl carbon and next residue amino Nitrogen. (more on this below).

The display is actually interactive. Select drag to rotate, mousewheel to zoom. The X-Y-Z widget in the window upper right allows orientation against the X, Y and Z axes.

In [None]:
ss = PDB_SS[0]

ss.display(style='cpk')

## Examine the Disulfide $C_\alpha-C_\alpha$ Distances
The Disulfide Bond $Cα-Cα$ distances are constrained due to the bond lengths and bond angles of the disulfide bond itself. We can get an overall sense of the protein structure data quality by looking at the distance distribution and removing any disulfides that have unreasonable/physically impossible distances.

In [7]:
# retrieve the torsions dataframe
from proteusPy.Disulfide import Torsion_DF_Cols

_SSdf = PDB_SS.getTorsions()

# entire database
SS_df = _SSdf.copy()

SS_df = SS_df[Torsion_DF_Cols].copy()


count    20388.000000
mean         5.619891
std          1.298061
min          2.885307
25%          5.128760
50%          5.694448
75%          6.239526
max         86.762092
Name: ca_distance, dtype: float64

The average $C_\alpha$- $C_\alpha$ distance for the entire dataset is 5.62A , with a minimum distance of 2.71 $\AA$ and a maximum of 158.8 $\AA$. Since the latter is not physically possible we should examine the data further to check for additional outliers.

Only 236 disulfides have distances > 10 $\AA$. Later we will ignore those and only use the disulfides with < 10 $\AA$. For now let's have a look at the longest disulfide. The ``DisulfideLoader.getTorsions()`` function returns a Dataframe containing these distances. We can sort by distance using Pandas.

In [10]:
# The distances are held in the overall Torsions array. We get this and sort
tors_df = PDB_SS.getTorsions()
tors_df.sort_values(by=['ca_distance'], ascending=False, inplace=True)

tors_df.head(10)

Unnamed: 0,source,ss_id,proximal,distal,chi1,chi2,chi3,chi4,chi5,energy,ca_distance,phi_prox,psi_prox,phi_dist,psi_dist,torsion_length
17134,4nzd,4nzd_90A_1C,90,1,-37.429263,131.229281,-119.919428,18.145501,-61.88687,7.606388,86.762092,-180.0,-180.0,-180.0,-180.0,192.774446
1597,1c3a,1c3a_135A_203B,135,203,-73.669379,83.55427,144.315112,-10.124321,-46.169513,8.918361,75.611323,-180.0,-180.0,-180.0,-180.0,188.333403
17131,4nzd,4nzd_1A_90C,1,90,-60.790678,-33.636931,-110.683889,134.532096,-36.558873,5.86656,64.932556,-180.0,-180.0,-180.0,-180.0,191.084559
19733,2zvk,2zvk_693U_27B,693,27,-27.113506,175.661831,-62.353988,58.841591,-83.39667,5.162518,61.467054,-180.0,-180.0,-180.0,-180.0,214.237201
13424,5d0n,5d0n_128A_417A,128,417,54.461605,116.741367,-28.426965,162.2803,-95.013077,11.051428,54.910093,-106.615081,-36.531046,-139.824468,163.292236,229.706522
2679,4wmy,4wmy_31A_48A,31,48,-63.624179,33.65565,108.982314,132.674109,-147.286368,6.760177,29.415026,-180.0,-180.0,-180.0,-180.0,237.389084
16924,4unv,4unv_45A_101A,45,101,-67.2454,-88.809865,71.684304,-146.729846,-67.532919,3.203989,15.306822,-71.405759,149.932418,-129.873676,150.052832,208.897387
14230,7rar,7rar_14C_45C,14,45,-78.888758,5.574249,-62.167205,-77.168749,-179.358398,4.76716,12.592702,-67.661592,138.281501,-54.312192,125.689716,219.644532
12783,2wqw,2wqw_206A_227A,206,227,-62.829584,15.567788,100.779284,-84.654993,-163.097907,4.330667,11.374002,-65.53532,-21.741384,-63.94396,-27.440377,219.348517
12784,2wqw,2wqw_206B_227B,206,227,-59.958576,14.045489,99.641583,-82.592835,-169.117848,3.765888,11.37374,-67.800883,-17.027946,-61.865848,-27.412084,221.682599


In [None]:
ssdistmin_id = tors_df.iloc[-1]['ss_id']
ssdistmax_id = tors_df.iloc[0]['ss_id']

ssdistmin = PDB_SS.get_by_name(ssdistmin_id)
ssdistmax = PDB_SS.get_by_name(ssdistmax_id)

duolist = DisulfideList([ssdistmin, ssdistmax], 'dist_minmax')
duolist.display(style='cpk')

The above shows these long disulfides are structurally impossible and should be removed from consideration. They most likely resulted from errors in modeling. We can look at the overall distribution of distances with a simple histogram.

In [13]:
import plotly_express as px
#title = r'$\text{C}_\alpha \text{Distance}$'
title = r'Cα Distance (A)'
labels = {'value': 'Cα Distance', 'variable': 'Probability'}
fig = px.histogram(SS_df['ca_distance'], histnorm='probability', labels=labels,
                   title=title)
                   
fig.show()


We can easily remove the bad Disulfides from consideration by slicing the torsions DataFrame:

In [14]:
# there are a few structures with bad SSBonds. Their
# CA distances are > 9.0. We remove them from consideration
# below

_far = _SSdf['ca_distance'] >= 9.0
_near = _SSdf['ca_distance'] < 9.0
SS_df = _SSdf[_near]

SS_df_Far = _SSdf[_far]
SS_df_Far.describe()

Unnamed: 0,proximal,distal,chi1,chi2,chi3,chi4,chi5,energy,ca_distance,phi_prox,psi_prox,phi_dist,psi_dist,torsion_length
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,154.9,138.6,-47.708772,45.358313,14.185112,10.520285,-104.941844,6.143314,42.374541,-127.901864,-68.708646,-134.982014,-51.581768,212.309825
std,203.052511,128.848059,39.1572,80.785273,100.613582,108.677012,54.546535,2.468049,29.467719,56.121515,132.702625,54.856012,149.768607,16.816578
min,1.0,1.0,-78.888758,-88.809865,-119.919428,-146.729846,-179.358398,3.203989,11.37374,-180.0,-180.0,-180.0,-180.0,188.333403
25%,34.5,45.75,-66.340094,7.692059,-62.307292,-81.236814,-159.145022,4.43979,13.271232,-180.0,-180.0,-180.0,-180.0,196.805181
50%,109.0,95.5,-61.810131,24.611719,21.628669,4.01059,-89.204873,5.514539,42.162559,-143.307541,-108.265523,-159.912234,-103.720189,216.792859
75%,188.25,221.0,-43.061592,108.444593,100.494859,114.215979,-63.298383,7.394835,64.06618,-68.702102,-18.206306,-80.426389,87.414266,221.173082
max,693.0,417.0,54.461605,175.661831,144.315112,162.2803,-36.558873,11.051428,86.762092,-65.53532,149.932418,-54.312192,163.292236,237.389084


As can be seen above, there were XXX structurally impossible disulfides in the entire dataset. Moving forward, the analysis will use Disulfides with $C_α$ - $C_α$ distances < $9Å$

In [15]:
import plotly_express as px
#title = r'$\text{C}_\alpha \text{Distance}$'
title = r'Cα Distance, (A)'

labels = {'value': 'Ca Distance > 9A', 'variable': 'Probability'}
fig = px.histogram(SS_df_Far['ca_distance'], histnorm='probability', 
                   labels=labels, title=title)
                   
fig.show()


In [17]:
# use the < 9 'near' disulfides only.
SS_df = _SSdf[_near]

#title = r'$\text{C}_\alpha \text{Distance}$'
title = f'Cα Distance, (Å)'

labels = {'value': 'Cα Distance < 9Å', 'variable': 'Probability'}
fig = px.histogram(SS_df['ca_distance'], histnorm='probability', 
                   labels=labels, title=title)
                   
fig.show()


## Examining Disulfide Energies


In [19]:
labels = {'value': 'Energy', 'variable': 'Count'}
cols = ['energy']
px.histogram(SS_df['energy'], labels=labels, histnorm='probability')


### Find the lowest and highest energy disulfides and display them

In [20]:
# takes 5 min 26 sec on the entire dataset
All_SS_list = DisulfideList([], 'tmp')
#All_SS_list = PDB_SS.getlist()
All_SS_list = PDB_SS.SSList

ssMin, ssMax = All_SS_list.minmax_energy()


In [None]:
duolist = DisulfideList([ssMin, ssMax], 'energy_duo')
duolist.display(style='sb')

In [None]:
SS_df = _SSdf[_near]
SS_df = SS_df[Torsion_DF_Cols]

SS_df = SS_df.sort_values(by=['energy'])
ssid_list = SS_df['ss_id'].values

good_SS_list = DisulfideList([],'low_energy')
bad_SS_list = DisulfideList([],'high_energy')

ss = Disulfide()

# first 12 are lowest energy
for i in range(12):
    ssid = ssid_list[i]
    ss = PDB_SS.get_by_name(ssid)
    good_SS_list.append(ss)
    
for i in range(13):
    if i == 0:
        continue
    ssid = ssid_list[-i]
    ss = PDB_SS.get_by_name(ssid)
    bad_SS_list.append(ss)


In [None]:
bad_SS_list.display(style='sb')


## Examine the Disulfide Torsions
The disulfide bond's overall conformation is defined by the sidechain dihedral angles $\chi_{1}$-$\chi_{5}$. Since the S-S bond has electron delocalization, it exhibits some double-bond character with strong minima at $+90°$ and $-90°$. The *Left-handed* Disulfides have $\chi_{3}$ < 0.0° and the *Right-handed* have a $\chi_{3}$ > 0.0°.

These torsion values along with the approximate torsional energy are stored in the DisulfideLoader() class and individually within each Disulfide object. We access them via the ``DisulfideList.getTorsions()`` function.


In [8]:
# make two dataframes containing left handed and right handed disulfides

_left = SS_df['chi3'] <= 0.0
_right = SS_df['chi3'] > 0.0

# left handed and right handed torsion dataframes
SS_df_Left = SS_df[_left]
SS_df_Right = SS_df[_right]


## Examine Torsion Length Vector

In [None]:
# The torsion_distances are held in the overall Torsions array. We get this and sort
tors_df.sort_values(by=['torsion_length'], ascending=False, inplace=True)

sstormin_id = tors_df.iloc[-1]['ss_id']
sstormax_id = tors_df.iloc[0]['ss_id']
sstormin = PDB_SS.get_by_name(sstormin_id)
sstormax = PDB_SS.get_by_name(sstormax_id)

duolist = DisulfideList([sstormin, sstormax], 'tor_minmax')
duolist.display(style='cpk')