# Map S-sulphenylated peptide fragments on 3D Structure

The goal of this study is to systematically map the positions of S-sulphenylation of proteins onto 3D protein structures in the Protein Data Bank.

Data source:

Site-specific mapping and quantification of protein S-sulphenylation in cells.
Yang J, Gupta V, Carroll KS, Liebler DC (2014) Nat Commun. (2014), 5:4776. 
[DOI: 10.1038/ncomms5776](https://doi.org/10.1038/ncomms5776)

Excerpts from abstract:
Protein S-sulphenylation, the reversible oxidation of protein cysteinyl thiols to suphenic acids (R-SOH), has emerged as a potential mechanism to regulate protein functions, signal transduction and effects of oxidative stress.
This study identified 1,000 S-sulphenylation sites on more than 700 proteins. Quantitative analysis of human cells stimulated with hydrogen peroxide or epidermal growth factor measured hundreds of site selective redox changes.

In [13]:
import warnings
warnings.filterwarnings("ignore") # numpy version issue?
from pyspark.sql import SparkSession
from mmtfPyspark.datasets import pdbToUniProt
import xlrd
import pandas as pd

In [14]:
spark = SparkSession.builder.master("local[4]").appName("S-Sulphenylation").getOrCreate()

### Read dataset (SulfenM) from supplementary data 2 excel file

In [15]:
df = pd.read_excel('https://media.nature.com/original/nature-assets/ncomms/2014/140901/ncomms5776/extref/ncomms5776-s3.xlsx', sheet_name='IDs from SulfenM')

### Standardize representation of protein modification
Here we use the following notation for modified residues (amino acid, delta mass), here (C,333).

In [16]:
%%time
df['modPeptide'] = df['Modified Peptide Sequence'].map(lambda s: s.replace('C#', '(C,333)'))

CPU times: user 1.43 ms, sys: 200 µs, total: 1.63 ms
Wall time: 1.52 ms


In [17]:
%%time
df['modPos'] = df['Modified Site'].map(lambda s: s.replace('C', ''))

CPU times: user 1.28 ms, sys: 104 µs, total: 1.39 ms
Wall time: 1.32 ms


In [18]:
df.head()

Unnamed: 0,ID,Gene Name,Protein Description,Modified Site,Modification,Peptide Sequence,Modified Peptide Sequence,modPeptide,modPos
0,41526,septin 9,septin-9 isoform a,C248,333@C248,SQEATEAAPSCVGDMADTPR,SQEATEAAPSC#VGDMADTPR,"SQEATEAAPS(C,333)VGDMADTPR",248
1,41527,septin 10,septin-10 isoform 1,C22,333@C22,TTCMSSQGSDDEQIKR,TTC#MSSQGSDDEQIKR,"TT(C,333)MSSQGSDDEQIKR",22
2,ABCB9,"ATP-binding cassette, sub-family B (MDR/TAP), ...",ATP-binding cassette sub-family B member 9 iso...,C548,333@C548,SSCVNILENFYPLEGGR,SSC#VNILENFYPLEGGR,"SS(C,333)VNILENFYPLEGGR",548
3,ABCE1,"ATP-binding cassette, sub-family E (OABP), mem...",ATP-binding cassette sub-family E member 1,C38,333@C38,LCIEVTPQSK,LC#IEVTPQSK,"L(C,333)IEVTPQSK",38
4,ABCE1,"ATP-binding cassette, sub-family E (OABP), mem...",ATP-binding cassette sub-family E member 1,C88,333@C88,YCANAFK,YC#ANAFK,"Y(C,333)ANAFK",88


In [19]:
df['ID'] = df['ID'].apply(str)
df['modPose'] = df['modPos'].apply(str)
ds = spark.createDataFrame(df)

In [20]:
ds.show(10)

+------+--------------------+--------------------+-------------+------------+--------------------+-------------------------+--------------------+------+-------+
|    ID|           Gene Name| Protein Description|Modified Site|Modification|    Peptide Sequence|Modified Peptide Sequence|          modPeptide|modPos|modPose|
+------+--------------------+--------------------+-------------+------------+--------------------+-------------------------+--------------------+------+-------+
| 41526|            septin 9| septin-9 isoform a |         C248|    333@C248|SQEATEAAPSCVGDMADTPR|     SQEATEAAPSC#VGDMA...|SQEATEAAPS(C,333)...|   248|    248|
| 41527|           septin 10|septin-10 isoform 1 |          C22|     333@C22|    TTCMSSQGSDDEQIKR|        TTC#MSSQGSDDEQIKR|TT(C,333)MSSQGSDD...|    22|     22|
| ABCB9|ATP-binding casse...|ATP-binding casse...|         C548|    333@C548|   SSCVNILENFYPLEGGR|       SSC#VNILENFYPLEGGR|SS(C,333)VNILENFY...|   548|    548|
| ABCE1|ATP-binding casse...|ATP-b

## Get PDB to UniProt Residue Mappings

Download PDB to UniProt mappings and filter out residues that were not observed in the 3D structure.

In [11]:
up = pdbToUniProt.get_cached_residue_mappings().filter("pdbResNum IS NOT NULL")

In [21]:
st = up.join(ds, (up.uniprotId == ds.ID) & (up.uniprotNum == ds.modPos))

In [22]:
st.show()

+----------------+---------+---------+---------+----------+---+---------+-------------------+-------------+------------+----------------+-------------------------+----------+------+-------+
|structureChainId|pdbResNum|pdbSeqNum|uniprotId|uniprotNum| ID|Gene Name|Protein Description|Modified Site|Modification|Peptide Sequence|Modified Peptide Sequence|modPeptide|modPos|modPose|
+----------------+---------+---------+---------+----------+---+---------+-------------------+-------------+------------+----------------+-------------------------+----------+------+-------+
+----------------+---------+---------+---------+----------+---+---------+-------------------+-------------+------------+----------------+-------------------------+----------+------+-------+

