## Query Post-translational Modifications from dbPTM and map to 3D Structure

Post-translational modifications (PTMs) modulate protein function. By mapping the locations of modified amino acid residues onto 3D protein structures, insights into the effect of PTMs can be obtained.

The user can query PTMs from the [dbPTM](http://dbptm.mbc.nctu.edu.tw/) database using a list of UniProt Ids (P13569), UniProt Name (CFTR_HUMAN), or PDB Id.ChainIds (5UAK.A) and map the hits onto 3D structures in the PDB.

This notebook uses a compressed and indexed version of data from dbPTM in the ORC file format for fast data queries, retrieval, and parallel processing with [mmtf-pyspark](https://github.com/sbl-sdsc/mmtf-pyspark).

[dbPTM](http://dbptm.mbc.nctu.edu.tw/) contains about 30 types of PTMs for more than 900,000 amino acid residues.

Reference:

dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Huang KY, Su MG, Kao HJ, Hsieh YC, Jhong JH, Cheng KH, Huang HD, Lee TY.
Nucleic Acids Res. (2016) 44(D1):D435-46. [doi: 10.1093/nar/gkv1240](https://doi.org/10.1093/nar/gkv1240).

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_set, collect_list, col, concat_ws
from ipywidgets import interact, IntSlider, widgets
from IPython.display import display
from mmtfPyspark.datasets import pdbToUniProt, dbPtmDataset
import py3Dmol

In [2]:
spark = SparkSession.builder.appName("dbSNPTo3D").getOrCreate()

## Query PTMs by an Identifier

In [3]:
field = widgets.Dropdown(options=['uniProtId','uniProtName','structureChainId'],description='Select field:')
selection = widgets.Textarea(description='Enter id(s):', value='P13569')

Select query field and enter a comma separated list of ids.

uniProtId: P13569

uniProtName: CFTR_HUMAN

structureChainId: 5UAK.A, 5TFB.A

In [4]:
display(field)
display(selection)

Dropdown(description='Select field:', options=('uniProtId', 'uniProtName', 'structureChainId'), value='uniProt…

Textarea(value='P13569', description='Enter id(s):')

## Create query string

In [5]:
query = field.value + " IN " + str(selection.value.split(',')).replace("[",'(').replace("]",')').replace(" ", "")
print("Query: " + query)

Query: uniProtId IN ('P13569')


## Read dbPTM dataset

In [6]:
db_ptm = dbPtmDataset.get_ptm_dataset()
print("Total number of PTMs: ", db_ptm.count())
db_ptm.limit(5).toPandas()

Total number of PTMs:  906354


Unnamed: 0,uniProtName,uniProtId,uniProtSeqNum,ptmType,pubMedIds,sequenceSegment
0,14310_ARATH,P48347,209,Phosphorylation,"[23328941, 23572148]",AFDDAIAELDSLNEESYKDST
1,14310_ARATH,P48347,233,Phosphorylation,[23572148],QLLRDNLTLWTSDLNEEGDER
2,14310_ARATH,P48347,234,Phosphorylation,[18463617],LLRDNLTLWTSDLNEEGDERT
3,14310_ARATH,P48347,244,Phosphorylation,"[23572148, 20466843, 20733066, 24243849, 19880...",SDLNEEGDERTKGADEPQDEN
4,14312_ARATH,Q9C5W6,41,Phosphorylation,"[22631563, 24243849, 19880383]",ETMKKVARVNSELTVEERNLL


Filter by UniProt identifiers

In [7]:
if field.value in ['uniProtId','uniProtName']:
    df = db_ptm.filter(query)
    print("Filtered by query:", query)
    print("Number of PTMs matching query:", df.count())
else:
    df = db_ptm
    
df.limit(5).toPandas()

Filtered by query: uniProtId IN ('P13569')
Number of PTMs matching query: 45


Unnamed: 0,uniProtName,uniProtId,uniProtSeqNum,ptmType,pubMedIds,sequenceSegment
0,CFTR_HUMAN,P13569,45,Phosphorylation,[22135298],SDIYQIPSVDSADNLSEKLER
1,CFTR_HUMAN,P13569,50,Phosphorylation,[22135298],IPSVDSADNLSEKLEREWDRE
2,CFTR_HUMAN,P13569,94,Phosphorylation,[22135298],YGIFLYLGEVTKAVQPLLLGR
3,CFTR_HUMAN,P13569,256,Phosphorylation,[22135298],KYRDQRAGKISERLVITSEMI
4,CFTR_HUMAN,P13569,291,Phosphorylation,[20068231],MEKMIENLRQTELKLTRKAAY


## Get PDB to UniProt Residue Mappings

In [8]:
up = pdbToUniProt.get_cached_residue_mappings().filter("pdbResNum IS NOT NULL")
print("Number of PDB to UniProt mappings:", up.count())
up.limit(5).toPandas()

Number of PDB to UniProt mappings: 98498569


Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
0,1A5E.A,1,1,P42771,1
1,1A5E.A,2,2,P42771,2
2,1A5E.A,3,3,P42771,3
3,1A5E.A,4,4,P42771,4
4,1A5E.A,5,5,P42771,5


Filter by structureChainIds

In [9]:
if field.value == 'structureChainId':
    print("Filtered by query: ", query)
    up = up.filter(query)

Find the intersection between the PTM dataset and PDB to UniProt mappings

In [10]:
up = up.withColumnRenamed("uniprotId","unpId")
st = up.join(df, (up.unpId == df.uniProtId) & (up.uniprotNum == df.uniProtSeqNum)).drop("unpId")

Show some sample data

In [11]:
hits = st.count()
print("Hits:", hits)
st.sample(False, 5/hits).toPandas().head()

Hits: 317


Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotNum,uniProtName,uniProtId,uniProtSeqNum,ptmType,pubMedIds,sequenceSegment
0,2PZE.A,582,165,582,CFTR_HUMAN,P13569,582,Phosphorylation,"[12588899, 14695900, 16381945, 23193290, 22135...",DSPFGYLDVLTEKEIFESCVC
1,2PZF.B,515,97,515,CFTR_HUMAN,P13569,515,Phosphorylation,"[17053785, 16381945, 23193290, 22135298, 22817...",NIIFGVSYDEYRYRSVIKACQ
2,5TF8.A,549,132,549,CFTR_HUMAN,P13569,549,Phosphorylation,[20068231],IVLGEGGITLSGGQRARISLA
3,2PZF.A,549,131,549,CFTR_HUMAN,P13569,549,Phosphorylation,[20068231],IVLGEGGITLSGGQRARISLA
4,2BBT.B,670,282,670,CFTR_HUMAN,P13569,670,Phosphorylation,[25330774],SILTETLHRFSLEGDAPVSWT


## Aggregate PTM data on a residue- and chain-level

In [12]:
st = st.groupBy("structureChainId","pdbResNum","uniProtId","uniProtName").agg(collect_set("ptmType").alias("ptms"))
st = st.withColumn("ptms", concat_ws((","), col("ptms")))
st = st.groupBy("structureChainId","uniProtId","uniProtName").agg(collect_list("pdbResNum").alias("pdbResNum"), collect_list("ptms").alias("ptms"))

Convert aggregated data to Pandas and display some results

In [13]:
pst = st.toPandas()
pst.head()

Unnamed: 0,structureChainId,uniProtId,uniProtName,pdbResNum,ptms
0,5TFB.A,P13569,CFTR_HUMAN,"[582, 604, 515, 511, 549, 524, 512, 605]","[Phosphorylation, Phosphorylation, Phosphoryla..."
1,2PZE.A,P13569,CFTR_HUMAN,"[524, 512, 511, 515, 641, 549, 582, 605, 604]","[Palmitoylation, Phosphorylation, Phosphorylat..."
2,2PZF.B,P13569,CFTR_HUMAN,"[604, 605, 511, 549, 582, 515, 524, 512]","[Phosphorylation, Phosphorylation, Phosphoryla..."
3,4WZ6.A,P13569,CFTR_HUMAN,"[603, 581, 659, 510, 669, 548, 604, 640, 523, ...","[Phosphorylation, Phosphorylation, Phosphoryla..."
4,5TFI.A,P13569,CFTR_HUMAN,"[511, 582, 515, 605, 604, 549, 524, 512]","[Phosphorylation, Phosphorylation, Phosphoryla..."


Setup custom visualization

In [14]:
def view_modifications(df, cutoff_distance, *args):

    def view3d(show_labels=True,show_bio_assembly=False, show_surface=False, i=0):
        pdb_id, chain_id = df.iloc[i]['structureChainId'].split('.')
        res_num = df.iloc[i]['pdbResNum']
        labels = df.iloc[i]['ptms']
        
        # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            print(a + ": " + df.iloc[i][a])
        
        mod_res = {'chain': chain_id, 'resi': res_num}  
        
        # select neigboring residues by distance
        surroundings = {'chain': chain_id, 'resi': res_num, 'byres': True, 'expand': cutoff_distance}
        
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})
    
        # polymer style
        viewer.setStyle({'cartoon': {'color': 'spectrum', 'width': 0.6, 'opacity':0.8}})
        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})
        
        # style for modifications
        viewer.addStyle(surroundings,{'stick':{'colorscheme':'orangeCarbon', 'radius': 0.15}})
        viewer.addStyle(mod_res, {'stick':{'colorscheme':'redCarbon', 'radius': 0.4}})
        viewer.addStyle(mod_res, {'sphere':{'colorscheme':'gray', 'opacity': 0.7}})
        
        # set residue labels    
        if show_labels:
            for residue, label in zip(res_num, labels):
                viewer.addLabel(residue + ": " + label, \
                                {'fontColor':'black', 'fontSize': 8, 'backgroundColor': 'lightgray'}, \
                                {'chain': chain_id, 'resi': residue})

        viewer.zoomTo(surroundings)
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_labels=True, show_bio_assembly=False, show_surface=False, i=s_widget)

## Visualize Results
Residues with reported modifications are shown in an all atom prepresentation as red sticks with transparent spheres. Each modified residue position is labeled by the PDB residue number and the type of the modification. Residues surrounding modified residue (within 6 A) are highlighted as yellow sticks. Small molecules within the structure are rendered as gray sticks.

In [15]:
view_modifications(pst, 6, 'uniProtId', 'uniProtName');

PDB Id: 5TFB chain Id: A
uniProtId: P13569
uniProtName: CFTR_HUMAN


Most PTMs occur at the protein surface. To visualize the surface, check the show_surface checkbox above.

In [16]:
spark.stop()