## Browse Post-translational Modifications from dbPTM mapped onto 3D Structure

Post-translational modifications (PTMs) modulate protein function. By mapping the locations of modified amino acid residues onto 3D protein structures, insights into the effect of PTMs can be obtained.

This notebook retrieves about 30 types of PTMs (~900,000 residues) from [dbPTM](http://dbptm.mbc.nctu.edu.tw/) and maps them to 3D Protein Structures from the [Protein Data Bank](https://www.wwpdb.org/).

To visualize the results, run the notebook all the way past the spark.stop command.

The dataset used in this notebook is a compressed and indexed version of the data from:

dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Huang KY, Su MG, Kao HJ, Hsieh YC, Jhong JH, Cheng KH, Huang HD, Lee TY.
Nucleic Acids Res. (2016) 44(D1):D435-46. [doi: 10.1093/nar/gkv1240](https://doi.org/10.1093/nar/gkv1240).

In [1]:
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import collect_set, collect_list, col, concat_ws
from ipywidgets import interact, IntSlider, widgets
from IPython.display import display
from mmtfPyspark.datasets import pdbToUniProt, dbPtmDataset
import py3Dmol

In [2]:
spark = SparkSession.builder.appName("BrowseDbPTM").getOrCreate()

## Read dbPTM dataset

In [3]:
db_ptm = dbPtmDataset.get_ptm_dataset()
print("Total number of PTMs:", db_ptm.count())
db_ptm.limit(5).toPandas()

Total number of PTMs: 906354


Unnamed: 0,uniProtName,uniProtId,uniProtSeqNum,ptmType,pubMedIds,sequenceSegment
0,14310_ARATH,P48347,209,Phosphorylation,"[23328941, 23572148]",AFDDAIAELDSLNEESYKDST
1,14310_ARATH,P48347,233,Phosphorylation,[23572148],QLLRDNLTLWTSDLNEEGDER
2,14310_ARATH,P48347,234,Phosphorylation,[18463617],LLRDNLTLWTSDLNEEGDERT
3,14310_ARATH,P48347,244,Phosphorylation,"[23572148, 20466843, 20733066, 24243849, 19880...",SDLNEEGDERTKGADEPQDEN
4,14312_ARATH,Q9C5W6,41,Phosphorylation,"[22631563, 24243849, 19880383]",ETMKKVARVNSELTVEERNLL


#### Create a unique list of all PTM types and an "All" type to represent all PTM types

In [4]:
ptm_types = db_ptm.select('ptmType').distinct().sort('ptmType').toPandas()['ptmType'].tolist()
ptm_types = ['All'] + ptm_types

## Select PTM Type
The default is set to N-linked Glycosylation. To rerun this notebook with a different PTM type, rerun this notebook from the top!

In [5]:
field = widgets.Dropdown(options=ptm_types,description='Select PTM:',value='N-linked Glycosylation')

In [6]:
display(field)

Dropdown(description='Select PTM:', index=15, options=('All', 'Acetylation', 'Amidation', 'C-linked Glycosylat…

In [7]:
if field.value == 'All':
    df = db_ptm
else:
    query = "ptmType = '" + field.value + "'"
    print("query:", query)
    df = db_ptm.filter(query)
    print("Number of PTMs that match query:", df.count())

query: ptmType = 'N-linked Glycosylation'
Number of PTMs that match query: 7915


## Get PDB to UniProt Residue Mappings

Download PDB to UniProt mappings and filter out residues that were not observed in the 3D structure.

In [8]:
up = pdbToUniProt.get_cached_residue_mappings().filter("pdbResNum IS NOT NULL")

Show some sample data

In [9]:
mappings = up.count()
print("Mappings:", mappings)
up.sample(False, 5/mappings).toPandas().head()

Mappings: 98498569


Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
0,1R8Y.A,34,34,Q9QXF8,35
1,5ON6.t,109,108,Q12690,109
2,4HYH.A,115,115,O14757,115
3,2FS2.B,92,93,P76084,92
4,2WRA.A,49,49,B4EH87,50


Find the intersection between the PTM dataset and PDB to UniProt mappings

In [10]:
up = up.withColumnRenamed("uniprotId","unpId")
st = up.join(df, (up.unpId == df.uniProtId) & (up.uniprotNum == df.uniProtSeqNum)).drop("unpId")

## Aggregate PTM data on a per residue and per chain basis

In [11]:
st = st.groupBy("structureChainId","pdbResNum","uniProtId","uniProtName").agg(collect_set("ptmType").alias("ptms"))
st = st.withColumn("ptms", concat_ws((","), col("ptms")))
st = st.groupBy("structureChainId","uniProtId","uniProtName").agg(collect_list("pdbResNum").alias("pdbResNum"), collect_list("ptms").alias("ptms"))

Convert aggregated data to Pandas and display some results

In [12]:
pst = st.toPandas()
pst.head()

Unnamed: 0,structureChainId,uniProtId,uniProtName,pdbResNum,ptms
0,4WJL.B,Q8N608,DPP10_HUMAN,"[748, 90, 119, 342, 257, 111]","[N-linked Glycosylation, N-linked Glycosylatio..."
1,1XKU.A,P21793,PGS2_BOVIN,"[274, 182, 233]","[N-linked Glycosylation, N-linked Glycosylatio..."
2,2Q7N.B,P15018,LIF_HUMAN,"[34, 116, 96, 73, 9, 63]","[N-linked Glycosylation, N-linked Glycosylatio..."
3,1E4M.M,P29736,MYRA_SINAL,"[482, 361, 346, 244, 60, 292, 343, 90, 265, 21...","[N-linked Glycosylation, N-linked Glycosylatio..."
4,3Q7D.A,Q05769,PGH2_MOUSE,"[68, 144, 410]","[N-linked Glycosylation, N-linked Glycosylatio..."


Setup custom visualization

In [13]:
def view_modifications(df, cutoff_distance, *args):

    def view3d(show_labels=True,show_bio_assembly=False, show_surface=False, i=0):
        pdb_id, chain_id = df.iloc[i]['structureChainId'].split('.')
        res_num = df.iloc[i]['pdbResNum']
        labels = df.iloc[i]['ptms']
        
        # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            print(a + ": " + df.iloc[i][a])
        
        mod_res = {'chain': chain_id, 'resi': res_num}  
        
        # select neigboring residues by distance
        surroundings = {'chain': chain_id, 'resi': res_num, 'byres': True, 'expand': cutoff_distance}
        
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})
    
        # polymer style
        viewer.setStyle({'cartoon': {'color': 'spectrum', 'width': 0.6, 'opacity':0.8}})
        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})
        
        # style for modifications
        viewer.addStyle(surroundings,{'stick':{'colorscheme':'orangeCarbon', 'radius': 0.15}})
        viewer.addStyle(mod_res, {'stick':{'colorscheme':'redCarbon', 'radius': 0.4}})
        viewer.addStyle(mod_res, {'sphere':{'colorscheme':'gray', 'opacity': 0.7}})
        
        # set residue labels    
        if show_labels:
            for residue, label in zip(res_num, labels):
                viewer.addLabel(residue + ": " + label, \
                                {'fontColor':'black', 'fontSize': 8, 'backgroundColor': 'lightgray'}, \
                                {'chain': chain_id, 'resi': residue})

        viewer.zoomTo(surroundings)
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_labels=True, show_bio_assembly=False, show_surface=False, i=s_widget)

## Visualize Results
Residues with reported modifications are shown in an all atom prepresentation as red sticks with transparent spheres. Each modified residue position is labeled by the PDB residue number and the type of the modification. Residues surrounding modified residue (within 6 A) are renderedas as thin orange sticks. Small molecules within the structure are rendered as gray sticks.

In [14]:
view_modifications(pst, 6, 'uniProtId', 'uniProtName');

PDB Id: 4WJL chain Id: B
uniProtId: Q8N608
uniProtName: DPP10_HUMAN


Most PTMs occur at the protein surface. To visualize the surface, check the show_surface checkbox above.

In [15]:
spark.stop()