# Cysteine Oxidative PTMs in 3D Protein Structures.

Under oxidative stress Cysteines can undergo oxidative post-translational modifications (PTMs). 

This notebook retrieves oxidized forms of L-cysteine in the PDB using the [PSI-MOD Ontology](https://www.ebi.ac.uk/ols/ontologies/mod)
* [MOD:00210](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00210) - oxydation to L-cysteine sulfenic acid (RSOH)
* [MOD:00267](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00267) - oxydation to L-cysteine sulfinic acid (RSO2H)
* [MOD:00460](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00460) - oxydation to L-cysteine sulfonic acid (RSO3H)


PTMs in the PDB are identified with BioJava-ModFinder:

BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank. Gao J, Prlić A,  Bi C  Bluhm WF, Dimitropoulos D, Xu D  Bourne, PE, Rose PW, Bioinformatics 2017, 33: 2047–2049. [doi: doi.org/10.1093/bioinformatics/btx101](https://doi.org/10.1093/bioinformatics/btx101)

In [1]:
import pandas as pd
import numpy as np
from io import BytesIO
import xlrd
from ipywidgets import interact, IntSlider, widgets
import py3Dmol
from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, collect_set, collect_list, col, concat_ws, sort_array
from mmtfPyspark.datasets import pdbToUniProt, pdbPtmDataset

Retrieve oxidized forms of L-cysteine in the PDB using a query by the [PSI-MOD Ontology](https://www.ebi.ac.uk/ols/ontologies/mod)
* [MOD:00210](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00210) - oxydation to L-cysteine sulfenic acid (RSOH)
* [MOD:00267](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00267) - oxydation to L-cysteine sulfinic acid (RSO2H)
* [MOD:00460](https://www.ebi.ac.uk/ols/ontologies/mod/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMOD_00460) - oxydation to L-cysteine sulfonic acid (RSO3H)

## Summary of oxidized Cysteines in the PDB

In [2]:
# get PTM data
spark = SparkSession.builder.appName("CysOxydationInPdb").getOrCreate()
pt = pdbPtmDataset.get_ptm_dataset()
pt = pt.filter("psimodId = 'MOD:00210' OR psimodId = 'MOD:00267' OR psimodId = 'MOD:00460'")
print("Total number of oxidized cysteines in PDB: ", pt.count())
print("   L-cysteine sulfenic acid (RSOH) :", pt.filter("psimodId = 'MOD:00210'").count())
print("   L-cysteine sulfinic acid (RSO2H):", pt.filter("psimodId = 'MOD:00267'").count())
print("   L-cysteine sulfonic acid (RSO3H):", pt.filter("psimodId = 'MOD:00460'").count())                                                 

Total number of oxidized cysteines in PDB:  4046
   L-cysteine sulfenic acid (RSOH) : 1817
   L-cysteine sulfinic acid (RSO2H): 1710
   L-cysteine sulfonic acid (RSO3H): 519


## Table of oxidized Cysteines in the PDB

In [3]:
pd.options.display.max_rows = None # show all rows
pt.limit(5).toPandas()

Unnamed: 0,pdbChainId,pdbResNum,residue,psimodId,residId,ccId,category,modificationId
0,1ACD.A,117,CSD,MOD:00267,AA0262,CSD,modified residue,119
1,1ACD.A,117,CSD,MOD:00267,AA0262,CSD,modified residue,120
2,1BI5.A,164,CSD,MOD:00267,AA0262,CSD,modified residue,119
3,1BI5.A,164,CSD,MOD:00267,AA0262,CSD,modified residue,120
4,1BQ6.A,164,CSD,MOD:00267,AA0262,CSD,modified residue,119


Download PDB to UniProt mappings and filter out residues that were not observed in the 3D structure.

In [4]:
up = pdbToUniProt.get_cached_residue_mappings().filter("pdbResNum IS NOT NULL")

Joint PTM data with UniProt data if the UniProt Id and UniProt residue numbers match

In [5]:
# join datasets
pt = pt.withColumnRenamed("pdbResNum", "resNum") # avoid two columns with identical names
pt = pt.withColumnRenamed("psimodId", "ptms")
st = up.join(pt, (up.pdbResNum == pt.resNum) & (up.structureChainId == pt.pdbChainId))
st = st.sort(st.uniprotId, st.uniprotNum)

## Aggregate PTM data on a per residue and per chain basis

In [6]:
# Aggregate data
st = st.groupBy("structureChainId","pdbResNum","uniprotId","uniprotNum").agg(collect_set("ptms").alias("ptms"))
st = st.withColumn("ptms", concat_ws((","), col("ptms")))
st = st.groupBy("structureChainId","uniprotId").agg(collect_list("ptms").alias("ptms"), collect_list("pdbResNum").alias("pdbResNum"),  collect_list("uniprotNum").alias("uniprotNum"))

Keep only a single structural representative

In [7]:
st = st.drop_duplicates(["uniprotId","uniprotNum"])

## Show Table with PDB mappings

PDB residue numbers do not always match UniProt residue numbers. The table below shows the mapping for each protein chain.

In [8]:
# convert Spark dataframe back to a Pandas dataframe
sp = st.toPandas()
sp.head()

Unnamed: 0,structureChainId,uniprotId,ptms,pdbResNum,uniprotNum
0,1MJB.A,Q08649,[MOD:00210],[304],[304]
1,4YKG.A,P35340,[MOD:00267],[345],[345]
2,2ID4.A,P13134,[MOD:00210],[190],[190]
3,2VH3.B,P85511,[MOD:00210],[65],[65]
4,4IZJ.D,P89521,[MOD:00210],[669],[669]


In [9]:
def view_modifications(df, cutoff_distance, *args):

    def view3d(show_labels=True,show_bio_assembly=False, show_surface=False, i=0):
        pdb_id, chain_id = df.iloc[i]['structureChainId'].split('.')
        res_num = df.iloc[i]['pdbResNum']
        labels = df.iloc[i]['ptms']
        
        # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            print(a + ": " + df.iloc[i][a])
        
        mod_res = {'chain': chain_id, 'resi': res_num}  
        
        # select neigboring residues by distance
        surroundings = {'chain': chain_id, 'resi': res_num, 'byres': True, 'expand': cutoff_distance}
        
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})
    
        # polymer style
        viewer.setStyle({'cartoon': {'color': 'spectrum', 'width': 0.6, 'opacity':0.8}})
        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})
        
        # style for modifications
        viewer.addStyle(surroundings,{'stick':{'colorscheme':'orangeCarbon', 'radius': 0.15}})
        viewer.addStyle(mod_res, {'stick':{'colorscheme':'redCarbon', 'radius': 0.4}})
        viewer.addStyle(mod_res, {'sphere':{'colorscheme':'gray', 'opacity': 0.7}})
        
        # set residue labels    
        if show_labels:
            for residue, label in zip(res_num, labels):
                viewer.addLabel(residue + ": " + label, \
                                {'fontColor':'black', 'fontSize': 9, 'backgroundColor': 'lightgray'}, \
                                {'chain': chain_id, 'resi': residue})

        viewer.zoomTo(surroundings)
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_labels=True, show_bio_assembly=False, show_surface=False, i=s_widget)

## Visualize Results
Residues with reported modifications are shown in an all atom prepresentation as red sticks with transparent spheres. Each modified residue position is labeled by the PDB residue number and the type of the modification. Residues surrounding modified residue (within 6 A) are highlighted as yellow sticks. Small molecules within the structure are rendered as gray sticks.

* Move slider to browse through the results
* To rotate the structure, hold down the left mouse button and move the mouse.

In [10]:
view_modifications(sp, 6, 'uniprotId');

PDB Id: 1MJB chain Id: A
uniprotId: Q08649


## List set of UniProt Ids for proteins with Cysteine oxidations in PDB

In [11]:
rs = sp[['uniprotId']].drop_duplicates()
rs.head()

Unnamed: 0,uniprotId
0,Q08649
1,P35340
2,P13134
3,P85511
4,P89521


In [12]:
spark.stop()