## Query Post-translational Modifications present in 3D Structures of the PDB.

Post-translational modifications (PTMs) modulate protein function. By mapping the locations of modified amino acid residues onto 3D protein structures, insights into the effect of PTMs can be obtained.

In this notebook PTMs present in PDB structures can be queried by residId, psimodId, uniprotId, and structureChainId. See examples below.

Reference:

BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank (2017) Bioinformatics 33: 2047–2049. [doi: doi.org/10.1093/bioinformatics/btx101](https://doi.org/10.1093/bioinformatics/btx101)

In [1]:
# imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_set, collect_list, col, concat_ws
from ipywidgets import interact, IntSlider, widgets
from IPython.display import display
from mmtfPyspark.datasets import pdbToUniProt, pdbPtmDataset
import py3Dmol
import timeit

In [2]:
start_time = timeit.default_timer()

In [3]:
spark = SparkSession.builder.master("local[4]").appName("QueryPdbPTM").getOrCreate()

## Query PTMs by an Identifier

In [4]:
field = widgets.Dropdown(options=['residId', 'psimodId','uniProtId','structureChainId','all'],description='Select field:')
selection = widgets.Textarea(description='Enter id(s):', value='AA0151')

Select query field and enter a comma separated list of ids. Examples:

residId: AA0151 (N4-(N-acetylamino)glucosyl-L-asparagine)

psimodId: MOD:00046, MOD:00047, MOD:00048 (O-phospho-L-serine, O-phospho-L-threonine, O4'-phospho-L-tyrosine)

uniProtId: P13569

structureChainId: 4ZTN.C

In [5]:
display(field)
display(selection)

Dropdown(description='Select field:', options=('residId', 'psimodId', 'uniProtId', 'structureChainId', 'all'),…

Textarea(value='AA0151', description='Enter id(s):')

## Create query string

In [6]:
query = field.value + " IN " + str(selection.value.split(',')).replace("[",'(').replace("]",')').replace(" ", "")
print("Query: " + query)

Query: residId IN ('AA0151')


## Read dataset of PTMs present in PDB

In [7]:
db_ptm = pdbPtmDataset.get_ptm_dataset()
print("Total number of PTM records: ", db_ptm.count())

Total number of PTM records:  1138955


Filter by PTM identifiers

In [8]:
if field.value in ['residId','psimodId']:
    df = db_ptm.filter(query)
    print("Filtered by query:", query)
    print("Number of PTMs matching query:", df.count())
    df.show(5)
else:
    df = db_ptm

Filtered by query: residId IN ('AA0151')
Number of PTMs matching query: 58646
+----------+---------+-------+---------+-------+----+----------+--------------+
|pdbChainId|pdbResNum|residue| psimodId|residId|ccId|  category|modificationId|
+----------+---------+-------+---------+-------+----+----------+--------------+
|    1A0H.B|      373|    ASN|MOD:00831| AA0151| NAG|attachment|             2|
|    1A0H.B|      584|    NAG|MOD:00831| AA0151| NAG|attachment|             2|
|    1A0H.E|      373|    ASN|MOD:00831| AA0151| NAG|attachment|             2|
|    1A0H.E|      584|    NAG|MOD:00831| AA0151| NAG|attachment|             2|
|    1A14.N|     476A|    NAG|MOD:00831| AA0151| NAG|attachment|             2|
+----------+---------+-------+---------+-------+----+----------+--------------+
only showing top 5 rows



## Get PDB to UniProt Residue Mappings

In [9]:
up = pdbToUniProt.get_cached_residue_mappings().filter("pdbResNum IS NOT NULL")
print("Number of PDB to UniProt mappings:", up.count())

Number of PDB to UniProt mappings: 98498569


Filter by UniProtID or structureChainIds

In [10]:
if field.value in ['uniProtId','structureChainId']:
    up = up.filter(query)
    print("Filtered by query: ", query)
    print("Number of records matching query:", up.count())

Find the intersection between the PTM dataset and PDB to UniProt mappings

In [11]:
df = df.withColumnRenamed("pdbResNum","resNum")
st = up.join(df, (up.structureChainId == df.pdbChainId) & (up.pdbResNum == df.resNum)).drop("pdbChainId").drop("resNum")

Show some sample data

In [12]:
hits = st.count()
print("Hits:", hits)
fraction = min(10/hits, 1.0)
st.sample(False, fraction).toPandas().head()

Hits: 27737


Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum,residue,psimodId,residId,ccId,category,modificationId
0,2NY2.A,339,164,P19551,337,ASN,MOD:00831,AA0151,NAG,attachment,2
1,3I6S.A,376,264,O82777,376,ASN,MOD:00831,AA0151,NAG,attachment,2
2,3R5O.A,332,332,P80025,449,ASN,MOD:00831,AA0151,NAG,attachment,2
3,3ZK4.B,293,274,Q8VX11,318,ASN,MOD:00831,AA0151,NAG,attachment,2
4,4DKU.A,262,146,A0A0M3KKW9,146,ASN,MOD:00831,AA0151,NAG,attachment,2


## Aggregate PTM data by chain-level

In [13]:
st = st.groupBy("structureChainId","uniprotId").agg(collect_list("pdbResNum").alias("pdbResNum"), \
                                                    collect_list("residId").alias("residId"), \
                                                    collect_list("psimodId").alias("psimodId"), \
                                                    collect_list("ccId").alias("ccId"))

Convert aggregated data to Pandas and display some results

In [14]:
pst = st.toPandas()
pst.head(10)

Unnamed: 0,structureChainId,uniprotId,pdbResNum,residId,psimodId,ccId
0,1LGC.J,,[512],[AA0151],[MOD:00831],[NAG]
1,2HOD.B,P02675,[364],[AA0151],[MOD:00831],[NAG]
2,2R5L.A,A3F9D6,"[95, 205, 241, 332]","[AA0151, AA0151, AA0151, AA0151]","[MOD:00831, MOD:00831, MOD:00831, MOD:00831]","[NAG, NAG, NAG, NAG]"
3,3ABG.B,Q12737,"[472, 482]","[AA0151, AA0151]","[MOD:00831, MOD:00831]","[NAG, NAG]"
4,3AL4.C,C3W5S1,"[17, 29, 93, 282]","[AA0151, AA0151, AA0151, AA0151]","[MOD:00831, MOD:00831, MOD:00831, MOD:00831]","[NAG, NAG, NAG, NAG]"
5,3MJ7.B,P97792,[87],[AA0151],[MOD:00831],[NAG]
6,3W4R.A,Q2V6H4,"[87, 305]","[AA0151, AA0151]","[MOD:00831, MOD:00831]","[NAG, NAG]"
7,4GJ9.A,P00797,[67],[AA0151],[MOD:00831],[NAG]
8,4KN0.A,P14207,"[115, 195]","[AA0151, AA0151]","[MOD:00831, MOD:00831]","[NAG, NAG]"
9,4OF3.B,B1Q236,[93],[AA0151],[MOD:00831],[NAG]


In [15]:
print("Time to process data:", int(round(timeit.default_timer() - start_time,0)), "seconds")

Time to process data: 113 seconds


Setup custom visualization

In [16]:
def view_modifications(df, cutoff_distance, *args):

    def view3d(show_labels=True,show_bio_assembly=False, show_surface=False, i=0):
        pdb_id, chain_id = df.iloc[i]['structureChainId'].split('.')
        res_num = df.iloc[i]['pdbResNum']
        lab1 = df.iloc[i]['residId']
        lab2 = df.iloc[i]['psimodId']
        lab3 = df.iloc[i]['ccId']
        
        # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            if df.iloc[i][a]:
                print(a + ": " + df.iloc[i][a])
        
        mod_res = {'chain': chain_id, 'resi': res_num}  
        
        # select neigboring residues by distance
        surroundings = {'chain': chain_id, 'resi': res_num, 'byres': True, 'expand': cutoff_distance}
        
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})
    
        # polymer style
        viewer.setStyle({'cartoon': {'color': 'spectrum', 'width': 0.6, 'opacity':0.8}})
        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})
        
        # style for modifications
        viewer.addStyle(surroundings,{'stick':{'colorscheme':'orangeCarbon', 'radius': 0.15}})
        viewer.addStyle(mod_res, {'stick':{'colorscheme':'redCarbon', 'radius': 0.4}})
        viewer.addStyle(mod_res, {'sphere':{'colorscheme':'gray', 'opacity': 0.7}})
        
        # set residue labels    
        if show_labels:
            for residue, l1, l2, l3 in zip(res_num, lab1, lab2, lab3):
                viewer.addLabel(residue + ": " + l1 + " " + l2 + " " + l3, \
                                {'fontColor':'black', 'fontSize': 10, 'backgroundColor': 'lightgray'}, \
                                {'chain': chain_id, 'resi': residue})

        viewer.zoomTo(surroundings)
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_labels=True, show_bio_assembly=False, show_surface=False, i=s_widget)

## Visualize Results
Residues with reported modifications are shown in an all atom prepresentation as red sticks with transparent spheres. Each modified residue position is labeled by the PDB residue number and the type of the modification. Residues surrounding modified residue (within 6 A) are highlighted as yellow sticks. Small molecules within the structure are rendered as gray sticks.

In [17]:
view_modifications(pst, 6, 'uniprotId');

interactive(children=(Checkbox(value=True, description='show_labels'), Checkbox(value=False, description='show…

Most PTMs occur at the protein surface. To visualize the surface, check the show_surface checkbox above.

In [18]:
spark.stop()