### Python Preamble

In [1]:
# Disulfide Bond Analysis
# Author: Eric G. Suchanek, PhD.
# Last revision: 12/16/22 -egs-
# Cα Cβ Sγ

import math
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import plotly_express as px
import seaborn as sns

import proteusPy
from proteusPy import *
from proteusPy.disulfide import *
from proteusPy.proteusGlobals import *

import pandas as pd

import pyvista as pv
from pyvista import set_plot_theme

pv.set_jupyter_backend('ipygany')
set_plot_theme('dark')

# the locations below represent the actual location on the dev drive.
# location for PDB repository
PDB_BASE = '/Users/egs/PDB/'

# location of cleaned PDB files
PDB = '/Users/egs/PDB/good/'

# location of the compressed Disulfide .pkl files
MODELS = f'{PDB_BASE}models/'


# Analysis of Disulfide Bonds in Proteins Within the RCSB Protein Data Bank
*Eric G. Suchanek, PhD. (suchanek@mac.com)* <br> <br>

## Summary
I describe the results of a structural analysis of Disulfide bonds contained in 36,362 proteins within the RCSB Protein databank, https://www.rcsb.org. These protein structures contained 294,478 Disulfide Bonds.  The analysis utilizes Python functions from my ``ProteusPy`` package https://github.com/suchanek/proteusPy/, which is built upon the excellent ``BioPython`` library (https://www.biopython.org). 

This work represents a reprise of my original Disulfide modeling analysis conducted in 1986 ([publications](#publications) item 1) as part of my dissertation. Given the original Disulfide database contained only 2xx Disulfide Bonds I felt it would be interesting to revisit the RCSB and mine the thousands of new structures. The initial results are described in the cells below.

### Requirements
 - Biopython patched version, or my delta applied
 - proteusPy: https://github.com/suchanek/proteusPy/

## Introduction
Disulfide bonds are important covalent stabilizing elements in proteins. They are formed when two Sulphur-containing Cysteine (Cys) amino acid residues are close enough and in the correct geometry to form a S-S covalent bond with their terminal sidechain Sγ atoms. Disulfide bonds most commonly occur between alpha helices and greatly enhance a protein's stability to denaturation. 


## Download PDB Files containing Disulfide Bonds

The RCSB query yielded 

In [2]:

# DownloadDisulfides(pdb_home=PDB_BASE, model_home=MODELS, reset=False)


## Extract the Disulfides from the PDB files
The function ``ExtractDisulfides()`` processes all the .ent files in ``PDB_DIR`` and creates two .pkl files representing the Disulfide bonds contained in the scanned directory. In addition, a .csv file containing problem IDs is written if any are found. The .pkl files are consumed by the ``DisulfideLoader`` class and are considered private. You'll see numerous warnings during the scan. Files that are unparsable are removed and their IDs are logged to the problem_id.csv file. The default file locations are stored in the file globals.py and are the used by ``DisulfideExtractor()`` in the absence of arguments passed. The Disulfide parser is very stringent and will reject disulfide bonds with missing atoms or disordered atoms.


Outputs are saved in ``MODEL_DIR``:
1) ``SS_PICKLE_FILE``: The ``DisulfideList`` of ``Disulfide`` objects initialized from the PDB file scan, needed by the ``DisulfideLoader()`` class.
2) ``SS_DICT_PICKLE_FILE``: the ``Dict Disulfide`` objects also needed by the ``DisulfideLoader()`` class
3) ``PROBLEM_ID_FILE``: a .csv containining the problem ids.

In general, the process only needs to be run once for a full scan. Setting the ``numb`` argument to -1 scans the entire directory. Entering a positive number allows parsing a subset of the dataset, which is useful when debugging. Setting ``verbose`` enables verbose messages. Setting ``quiet`` to ``True`` disables all warnings.

NB: A extraction of the initial disulfide bond-containing files (> 36000 files) takes about 1.25 hours on a 2020 MacbookPro with M1 Pro chip, 16GB RAM, 1TB SSD. The resulting .pkl files consume approximately 1GB of disk space, and equivalent RAM used when loaded.

In [3]:


# ExtractDisulfides(numb=1000, pdbdir=PDB, modeldir=MODELS, verbose=False, quiet=True)


## Load the Disulfide Data
Now that the Disulfides have been extracted and the Disulfide .pkl files have been created we can load them into memory using the DisulfideLoader() class. This class stores the Disulfides internally as a DisulfideList and a dict. Array indexing operations including slicing have been overloaded, enabling straightforward access to the Disulfide bonds, both in aggregate and by residue. After loading the .pkl files the Class creates a Pandas ``DataFrame`` object consisting of the Disulfide ID, all sidechain dihedral angles, the local coordinates for the Disulfide and the computed Disulfide bond torsional energy.

NB: Loading the data takes 2.5 minutes on my MacbookPro. Be patient if it seems to take a long time to load.

In [4]:
# when running from the repo the local copy of the Disulfides is in ../pdb/models
# PDB_BASE = '../pdb/'

# location of the compressed Disulfide .pkl files
# MODELS = f'{PDB_BASE}models/'

PDB_SS = None
PDB_SS = DisulfideLoader(verbose=True, modeldir=MODELS)


Reading disulfides from: /Users/egs/PDB/models/PDB_all_ss.pkl
Disulfides Read: 8210
Reading disulfide dict from: /Users/egs/PDB/models/PDB_all_ss_dict.pkl
Reading Torsion DF /Users/egs/PDB/models/PDB_SS_torsions.csv.
Read torsions DF.
PDB IDs parsed: 1000
Total Space Used: 1844317 bytes.


## Examine the Disulfide Torsions
The disulfide bond's overall conformation is defined by the sidechain dihedral angles $\chi_{1}$-$\chi_{5}$. Since the S-S bond has electron delocalization, it exhibits some double-bond character with strong minima at $+90°$ and $-90°$. The *Left-handed* Disulfides have $\chi_{3}$ < 0.0° and the *Right-handed* have a $\chi_{3}$ > 0.0°.

These torsion values along with the approximate torsional energy are stored in the DisulfideLoader() class. We access them via the ``DisulfideList.getTorsions()`` function.

We can get a quick look at their overall statistics using the ``Pandas.describe()`` function.


In [5]:
#
#
# split out the left/right handed torsions from the dataframe
#
# retrieve the torsions dataframe

_cols = ['chi1', 'chi2', 'chi3', 'chi4', 'chi5', 'energy', 'ca_distance']

_SSdf = PDB_SS.getTorsions()

# there are a few structures with bad SSBonds. Their
# CA distances are > 7.0. We remove them from consideration
# below

_far = _SSdf['ca_distance'] >= 7.0
_near = _SSdf['ca_distance'] < 7.0
_zed = _SSdf['ca_distance'] == 0.0

SS_df_zed = _SSdf[_zed].copy()
SS_df_Far = _SSdf[_far].copy()
SS_df = _SSdf[_near].copy()

SS_df = SS_df[_cols].copy()

#print(SS_df.head(5))
SS_df.describe()


Unnamed: 0,chi1,chi2,chi3,chi4,chi5,energy,ca_distance
count,8132.0,8132.0,8132.0,8132.0,8132.0,8132.0,8132.0
mean,-46.595732,-4.547231,-4.875888,-25.863117,-28.12194,3.698889,5.600403
std,103.720743,109.696516,93.590165,111.606315,99.680978,2.313802,0.800254
min,-179.990815,-179.990782,-179.502078,-179.940602,-179.994072,0.494753,2.929105
25%,-90.535282,-86.640862,-88.420862,-99.138569,-73.936822,2.057033,5.120277
50%,-64.205495,-54.626617,-65.24418,-69.024871,-59.42788,3.269689,5.693751
75%,-42.113963,99.81319,93.236462,83.293847,57.516152,4.584666,6.253852
max,179.94597,179.987671,179.554652,179.977181,179.990016,17.289549,6.996572


In [6]:
SS_df_Far.describe()

Unnamed: 0.1,Unnamed: 0,proximal,distal,chi1,chi2,chi3,chi4,chi5,energy,ca_distance
count,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0
mean,5237.923077,212.448718,256.641026,-50.734121,14.782819,41.202433,-22.5991,-8.708395,7.833334,8.322996
std,1993.564013,255.736961,246.891636,100.744113,149.383546,130.957592,157.060426,91.700897,2.332494,8.121318
min,140.0,15.0,34.0,-178.962687,-179.82705,-178.856259,-178.756423,-179.999305,0.928377,7.000319
25%,4292.5,23.0,88.25,-158.833949,-153.131362,-117.024771,-169.438982,-68.925731,6.198417,7.048314
50%,5651.0,114.0,152.0,-66.134241,109.315126,111.692322,-130.779599,-25.976132,7.849847,7.083456
75%,6584.0,317.0,290.0,52.506347,152.028464,151.77109,157.950613,66.651257,9.430485,7.175573
max,8207.0,1155.0,1173.0,172.223331,179.347674,179.976934,178.655554,177.513969,12.761936,75.611323


## Examining the Torsion Angles



In [7]:
labels = {'value': 'Angle', 'variable': 'Torsion'}
cols = ['chi1', 'chi5']
data = SS_df[cols].copy()
data.set_index('chi1')

px.histogram(SS_df, x='chi1', y='chi5', labels=labels, histfunc='avg', nbins=90)
#px.histogram(SS_df[cols], labels=labels)


### Examining torsions by Disulfide Handedness
We split the dataset into these two families easily with Pandas.

In [None]:
# make two dataframes containing left handed and right handed disulfides

_left = SS_df['chi3'] <= 0.0
_right = SS_df['chi3'] > 0.0

# left handed and right handed torsion dataframes
SS_df_Left = SS_df[_left]
SS_df_Right = SS_df[_right]


In [None]:
# routine creates 2 lists  for left-handed and right-handed disulfides 
ss_list = PDB_SS.getlist()
left_handed_SS = DisulfideList([], 'left_handed')
right_handed_SS = DisulfideList([], 'right_handed')

i = 0

for i in range(0, len(ss_list)):
    ss = ss_list[i]
    if ss.chi3 < 0:
        left_handed_SS.append(ss)
    else:
        right_handed_SS.append(ss)


print(f'Left Handed: {len(left_handed_SS)}, Right Handed: {len(right_handed_SS)}')



In [None]:
px.histogram(SS_df, x='chi2', y='chi4', labels=labels, histfunc='avg', nbins=90)

In [None]:
labels2 = {'value': 'Angle', 'variable': 'Distance'}
px.histogram(SS_df, x='ca_distance', labels=labels2, nbins=100)

We can use a hexbin plot to generate the equivalent of a heatmap to explore the symmetry for the Chi1-Chi5 and Chi2-Chi4 dihedral angles. Since the disulfide bond is symmetric about Chi3 (the S-S bond), one would expect the distribution of Chi1 to be similar to Chi5, and the same with Chi2 and Chi4.

In [None]:

# Set the figure sizes and axis limits.
DPI = 220
WIDTH = 6.0
HEIGHT = 3.0
TORMIN = -179.0
TORMAX = 180.0
GRIDSIZE = 20


In [None]:
x = SS_df['ca_distance']
plt.rcParams.update({'font.size': 8})
plt.rcParams['text.usetex'] = True

fig = plt.figure(figsize=(WIDTH, HEIGHT), dpi=DPI)
fig.suptitle(r'$\chi_{1} - \chi_{5}$ Symmetry', fontsize=8)

ax1= fig.subplots()
ax1.set_xlabel(r'Distance')
ax1.set_ylabel(r'Frequency')

ax1.hist(x)

plt.show()


In [None]:
# !!!

x = SS_df_Left['chi1']
y = SS_df_Left['chi5']

x2 = SS_df_Right['chi1']
y2 = SS_df_Right['chi5']

plt.rcParams.update({'font.size': 8})
plt.rcParams['text.usetex'] = True

fig = plt.figure(figsize=(WIDTH, HEIGHT), dpi=DPI)

ax1, ax2 = fig.subplots(1, 2, sharey=True)

fig.suptitle(r'$\chi_{1} - \chi_{5}$ Symmetry', fontsize=8)

ax1.tick_params(axis='both', which='major', labelsize=6)
ax2.tick_params(axis='both', which='major', labelsize=6)

ax1.set_xlim(TORMIN-1, TORMAX+1)
ax1.set_ylim(TORMIN-1, TORMAX+1)

ax1.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 60))

ax1.set_xlabel(r'$\chi_{1 (left-handed)}$')
ax1.set_ylabel(r'$\chi_{5}$')

ax1.hexbin(x, y, gridsize=GRIDSIZE, cmap='nipy_spectral')

ax2.set_xlim(TORMIN, TORMAX)
ax2.set_ylim(TORMIN-1, TORMAX+1)

ax2.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 60))

ax2.set_xlabel(r'$\chi_{1 (right-handed)}$')

ax2.hexbin(x2, y2, gridsize=GRIDSIZE, cmap='nipy_spectral')

plt.show()



The distributions look extremely similar for both the left-handed and right-handed populations. Let's look at Chi2 and Chi4 to see if this holds true.

In [None]:
x = SS_df_Left['chi2']
y = SS_df_Left['chi4']

x2 = SS_df_Right['chi2']
y2 = SS_df_Right['chi4']

plt.rcParams.update({'font.size': 8})
plt.rcParams['text.usetex'] = True

fig = plt.figure(figsize=(WIDTH, HEIGHT), dpi=DPI)

fig.suptitle(r'$\chi_{2} - \chi_{4}$ Symmetry', fontsize=8)

ax1, ax2 = fig.subplots(1, 2, sharey=False)

ax1.tick_params(axis='both', which='major', labelsize=6)
ax2.tick_params(axis='both', which='major', labelsize=6)

ax1.set_xlim(TORMIN-1, TORMAX+1)
ax1.set_ylim(TORMIN-1, TORMAX+1)
ax1.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 60))

ax1.set_xlabel(r'$\chi_{2 (left-handed)}$')
ax1.set_ylabel(r'$\chi_{4}$')

ax1.hexbin(x, y, gridsize=GRIDSIZE, cmap='nipy_spectral')

ax2.set_xlim(TORMIN, TORMAX)
ax2.set_ylim(-180, 180)
ax2.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 60))

ax2.set_xlabel(r'$\chi_{2 (right-handed)}$')
ax2.hexbin(x2, y2, gridsize=GRIDSIZE, cmap='nipy_spectral')

plt.show()



The distributions for left/right show distinct differences. Both show the predicted minimum around -75 degrees, but the right-handed group shows a population of disfuldies in the +90 degree range.

In [None]:
#

fig = plt.figure(figsize=(WIDTH, HEIGHT), dpi=DPI)
ax1 = plt.subplot(111)

x = SS_df['chi3']
y = SS_df['energy']

ymax = 10.0
plt.hexbin(x, y, gridsize=40, cmap='nipy_spectral')
ax1.set_xlim(TORMIN-1, TORMAX+1)
ax1.set_ylim(y.min(), ymax)
ax1.set_xlabel(r'$\chi_{3}$')
ax1.set_ylabel(r'Energy kcal/mol')
ax1.set_title(r'Energy vs $\chi_{3}$')
ax1.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 30))

plt.show()


In [None]:
#
TORMAX = 180 # for left handed (Chi3 < 0)
fig = plt.figure(figsize=(WIDTH, HEIGHT), dpi=DPI)
ax1 = plt.subplot(111)

x = SS_df['chi3']
y = SS_df['ca_distance']

ymax = 8.0
plt.hexbin(x, y, gridsize=80, cmap='nipy_spectral')
ax1.set_xlim(TORMIN-1, TORMAX)
ax1.set_ylim(y.min(), y.max())
ax1.set_xlabel(r'$\chi_{3}$')
ax1.set_ylabel(r'C${\alpha}$ Distance ($\AA$)')
ax1.set_title(r'C${\alpha}$ Distance vs $\chi_{3}$')
ax1.xaxis.set_ticks(numpy.arange(TORMIN-1, TORMAX+1, 30))

plt.show()


In [None]:
from sklearn.mixture import GaussianMixture
n_clusters = 8

tor_df = SS_df[['chi1', 'chi2', 'chi3', 'chi4', 'chi5']].copy()
tor_df.head(1)
gmm_model = GaussianMixture(n_components=n_clusters)
gmm_model.fit(tor_df)
cluster_labels = gmm_model.predict(tor_df)
X = pd.DataFrame(tor_df)
X['cluster'] = cluster_labels
for k in range(n_clusters):
    data = X[X['cluster'] == k]
    plt.scatter(data['chi1'], data['chi5'], s=2)

plt.show()



In [None]:
# takes over an hour for full dataset
from sklearn.cluster import SpectralClustering
n_clusters = 8

tor_df = SS_df[['chi1', 'chi2', 'chi3', 'chi4', 'chi5']].copy()
X = tor_df.copy()

scm_model = SpectralClustering(n_clusters=n_clusters, random_state=25,
                                n_neighbors=8, affinity='nearest_neighbors')
# takes 51 min with full dataset
X['cluster'] = scm_model.fit_predict(X[['chi1', 'chi5']])

fig, ax = plt.subplots()
ax.set(title='Spectral Clustering')
sns.scatterplot(x='chi1', y='chi4', data=X, hue='cluster', ax=ax, size=2)




In [None]:
# takes over an hour for full dataset
from sklearn.cluster import AffinityPropagation
n_clusters = 8

tor_df = SS_df[['chi1', 'chi2', 'chi3', 'chi4', 'chi5']].copy()
X = tor_df.copy()

aff_model = AffinityPropagation(max_iter=600, random_state=25)
# takes 51 min with full dataset
X['cluster'] = aff_model.fit_predict(X[['chi1', 'chi5']])

fig, ax = plt.subplots()
ax.set(title='Affinity Propagation')
sns.scatterplot(x='chi1', y='chi5', data=X, hue='cluster', ax=ax, size=2)


In [None]:
# takes over an hour for full dataset
from sklearn.cluster import AgglomerativeClustering
n_clusters = 4

tor_df = SS_df[['chi1', 'chi2', 'chi3', 'chi4', 'chi5']].copy()
X = tor_df.copy()

agg_model = AgglomerativeClustering(n_clusters=n_clusters)

X['cluster'] = agg_model.fit_predict(X[['chi1', 'chi5']])

fig, ax = plt.subplots()
ax.set(title='Agglomerative Clustering')
sns.scatterplot(x='chi1', y='chi5', data=X, hue='cluster', ax=ax, size=2)


In [None]:
labels = {'value': 'Chi3', 'variable': 'Variable'}

x = SS_df['chi3']
y = SS_df['energy']
df = SS_df[['chi3', 'energy']].copy()

energy_hist = numpy.histogram2d(x, y=y, bins=360)
energy_hist = energy_hist

fig = px.histogram(df, labels=labels)

fig.show()

In [None]:
cols = ['chi1', 'chi5']
px.histogram(SS_df[cols], labels=labels)

In [None]:
cols = ['chi2', 'chi4']
px.histogram(SS_df[cols], labels=labels)

In [None]:
px.histogram(SS_df['chi3'], labels=labels)

## Summary
Conformational analysis of 294,222 disulfide bonds in 36,362 proteins contained in the RCSB confirms the predominant conformational classes first described in my initial analysis:
- Left-Handed Spiral
- Right-Handed Hook
- Left-Handed Spiral
  

## Publications
* https://doi.org/10.1021/bi00368a023
* https://doi.org/10.1021/bi00368a024
* https://doi.org/10.1016/0092-8674(92)90140-8
* http://dx.doi.org/10.2174/092986708783330566

In [None]:
ss1 = PDB_SS[0]
ss2 = PDB_SS[1]


p = pv.Plotter()
p = ss1.render(p)
p = ss2.render(p)

p.show()

In [None]:

from proteusPy.disulfide import render_ss
set_plot_theme('dark')

ss1 = PDB_SS[0]
ss2 = PDB_SS[1]

print(f'{ss1.print_compact()}')
print(f'{ss2.print_compact()}')


In [8]:
#pv.set_jupyter_backend('none')

ss1 = PDB_SS[0]
p = pv.Plotter()
render_ss(ss1, p)
p.show()



ViewInteractiveWidget(height=768, layout=Layout(height='auto', width='100%'), width=1024)

In [None]:
import vtk
def make_render_win():
    renderer = vtk.vtkRenderer()
    renderer.SetBackground(0,0,0)
    renwin = vtk.vtkRenderWindow()
    renwin.SetWindowName("Disulfide Window")
    renwin.SetSize(1024, 1024)
    renwin.AddRenderer(renderer)

    return renwin



In [None]:
win = make_render_win()
win