## Find Missense Mutations in PDB
This notebook is a prototype for visualizing the positions of missense mutations from [dbSNP](https://www.ncbi.nlm.nih.gov/projects/SNP/) (GRCh37 build) for cases where a protein structure contains the mutated amino acid.

In [1]:
# Disable Numba: temporary workaround for https://github.com/sbl-sdsc/mmtf-pyspark/issues/288
import os
os.environ['NUMBA_DISABLE_JIT'] = "1"

In [2]:
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import collect_set, collect_list, concat_ws
from mmtfPyspark.datasets import dbSnpDataset
import pandas as pd
from ipywidgets import interact, IntSlider, widgets
from IPython.display import display
import py3Dmol

In [3]:
field = widgets.Dropdown(options=('none','snp_id', 'pdbChainId','uniprotId','sqlQuery'),description='Select field:')
selection = widgets.Textarea(description='Enter id(s):')
significance = widgets.SelectMultiple(description='Significance:', \
                                      options=('All', 'Benign', 'Likely benign', 'Likely pathogenic', \
                                               'Pathogenic', 'drug-response', 'untested', \
                                               'Uncertain significance', 'other', 'null'), \
                                      value=('Benign', 'Likely benign', 'Likely pathogenic', \
                                               'Pathogenic', 'drug-response'))

## Select clinical significance
Select one of more significance levels from ClinVar (MacOS: hold command key to select muliple criteria). 

Default: Benign, Likely benign, Likely pathogenic, Pathogenic, drug-response.

In [4]:
display(significance)

SelectMultiple(description='Significance:', index=(1, 2, 3, 4, 5), options=('All', 'Benign', 'Likely benign', …

## Optionally, filter dataset
Select a query field and enter a comma separated list of identifiers:

Example queries below are for missense mutations in the Cystic Fibrosis [CFTR2 gene](https://www.cftr2.org/mutations_history).

* none: no filtering (Default)
* snp_id:  397508256, 397508796 (also called the rsId, e.g. rs397508256)
* pdbChainId: 5UAK.A
* uniprotId: P13569
* sqlQuery: any valid sql query (e.g., chr = 7 AND pos = 117149089)

In [5]:
# show widgets
display(field)
display(selection)

Dropdown(description='Select field:', options=('none', 'snp_id', 'pdbChainId', 'uniprotId', 'sqlQuery'), value…

Textarea(value='', description='Enter id(s):')

### Create query strings

In [6]:
if significance.value and not 'All' in significance.value:
    sig_query = "clinsig IN " + str(significance.value).replace(",)", ")")
    print("Query:", sig_query)
    
if field.value == 'sqlQuery':
    query = selection.value
    print("Query:", query)
    
elif field.value != 'none':
    query = field.value + " IN " + str(tuple(selection.value.split(","))).replace(",)", ")")
    print("Query:", query)

Query: clinsig IN ('Benign', 'Likely benign', 'Likely pathogenic', 'Pathogenic', 'drug-response')


### Initialize Spark

In [7]:
spark = SparkSession.builder.master("local[4]").appName("SNPsInPDB").getOrCreate()

In [8]:
# Enable Arrow-based columnar data transfers between Spark and Pandas dataframes
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

## Read file with dbSNP info
The following dataset was created from the SNP3D_PDB_GRCH37 dataset by mapping non-synonymous SNPs to human proteins with >= 95% sequence identity in the PDB.

In [9]:
ds = dbSnpDataset.get_cached_dataset()
ds.count()

1171630

## Find mutated residues in PDB structures
This filter keeps only residues where the variant residue type is the same as in the PDB structures.

In [10]:
ds = ds.filter("master_var == pdb_res")

In [11]:
print("Missense mutations in PDB", ds.count())

Missense mutations in PDB 1291


## Filter by clinical significance

In [12]:
if significance.value and not 'All' in significance.value:
    ds = ds.filter(sig_query)
    print("Results: ", ds.count())

Results:  108


## Filter by optional Ids

In [13]:
if field.value in ['snp_id','pdbChainId','uniprotId','sqlQuery']:
    print("Filtered by query: ", query)
    ds = ds.filter(query)
    ds.show(5)

### Show some sample results

In [14]:
ds.toPandas().head(20)

Unnamed: 0,chr,pos,snp_id,master_acc,master_gi,master_pos,master_res,master_var,pdb_gi,pdb_res,pdb_pos,blast_ident,clinsig,pdbChainId,tax_id,pdbResNum,uniprotId,uniprotNum
0,1,161599779,200688856,NP_001257965,402478626,19,S,R,723586676,R,19,95.288,Pathogenic,3WN5.C,9606,15,P08637,36
1,17,1673276,1136287,NP_001316832,1049480189,72,T,M,15988024,M,52,99.246,Benign,1IMV.A,9606,52,P36955,72
2,17,1673276,1136287,NP_002606,39725934,72,T,M,15988024,M,52,99.246,Benign,1IMV.A,9606,52,P36955,72
3,17,64210757,4581,NP_000033,153266841,266,V,L,6573461,L,247,99.693,Benign,1C1Z.A,9606,247,P02749,266
4,8,18258316,1208,NP_000006,116295260,268,R,K,149243115,K,272,99.654,drug-response,2PFR.A,9606,268,P11245,268
5,1,161599693,448740,NP_001257965,402478626,48,N,S,723586676,S,48,95.288,Pathogenic,3WN5.C,9606,44,P08637,65
6,X,8504833,808119,NP_000207,119395746,534,V,I,109156983,I,511,99.848,Benign,1ZLG.A,9606,534,P23352,534
7,1,161599693,448740,NP_001257966,402478628,48,N,S,10121061,S,48,98.889,Pathogenic,1E4K.C,9606,44,O75015,65
8,6,5260936,2224391,NP_001158313,258679535,11,S,A,1209040466,A,11,98.571,Benign,5USR.B,9606,11,Q9HD34,11
9,6,5260936,2224391,NP_001305711,974005181,11,S,A,1209040466,A,11,98.551,Benign,5USR.B,9606,11,Q9HD34,11


## Aggregate data on the residue and chain level

In [15]:
ds = ds.groupBy("pdbChainId","pdbResNum","master_res","uniprotId").agg(collect_set("master_var").alias("master_var"),collect_set("clinsig").alias("clinsig"))
ds = ds.withColumn("master_var", concat_ws((""), ds.master_var))
ds = ds.withColumn("clinsig", concat_ws((","), ds.clinsig))
ds = ds.withColumn("snps", concat_ws(("->"), ds.master_res, ds.master_var))
ds = ds.drop("master_res")
ds = ds.groupBy("pdbChainId","uniprotId").agg(collect_list("pdbResNum").alias("pdbResNums"), \
                                              collect_list("snps").alias("snps"), \
                                              collect_list("clinsig").alias("clinsig"))

In [16]:
df = ds.toPandas()
df.head(20)

Unnamed: 0,pdbChainId,uniprotId,pdbResNums,snps,clinsig
0,3BIC.A,P22033,"[671, 499]","[I->V, A->T]","[Benign, Likely benign]"
1,1C1Z.A,P02749,[247],[V->L],[Benign]
2,5J8R.A,Q05209,[61],[K->R],[Pathogenic]
3,4MZV.A,P16422,[115],[M->T],[Likely benign]
4,3WN5.C,P08637,"[15, 61, 44]","[S->R, N->D, N->S]","[Pathogenic, Pathogenic, Pathogenic]"
5,1IMV.A,P36955,[52],[T->M],[Benign]
6,4KKD.A,P48740,[666],[G->E],[Pathogenic]
7,3RMU.A,Q96PE7,[104],[R->L],[Likely benign]
8,5BV8.A,P04275,"[1381, 1324]","[T->A, G->S]","[Likely benign, Pathogenic]"
9,4FXK.B,P0C0L4,[1176],[S->N],[Benign]


In [17]:
def view_modifications(df, cutoff_distance, *args):

    def view3d(show_bio_assembly=False, show_surface=False, show_labels=True, i=0):
        pdb_id, chain_id = df.iloc[i]['pdbChainId'].split('.')
        res_num = list(df.iloc[i]['pdbResNums'])
        labels = df.iloc[i]['snps']
        sigs = df.iloc[i]['clinsig']
        
        sig_dir = {'Benign':'green', 'Likely benign':'turquoise', 'Likely pathogenic':'palevioletred', \
                    'Pathogenic':'red', 'drug-response':'plum', 'untested':'white', \
                    'Uncertain significance': 'lightgray', 'other':'white', 'null':'white'}
        
       # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            print(a + ": " + df.iloc[i][a])

        all_residues = {'resi': res_num, 'chain': chain_id}
        
        # select neigboring residues by distance
        surroundings = {'chain': chain_id, 'resi': res_num, 'byres': True, 'expand': cutoff_distance}
        
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})

        # polymer style
        viewer.setStyle({'cartoon': {'colorscheme': 'chain', 'width': 0.6, 'opacity':0.9}})

        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})
        
        # residues surrounding mutation positions
        viewer.addStyle(surroundings,{'stick':{'colorscheme':'orangeCarbon', 'radius': 0.15}})
        
        # mutation positions
        for label, res, sig in zip(labels, res_num, sigs):
            sig1 = sig.split(',')[0] # if multiple values, use the first one
            col = (sig_dir[sig1])
            mod_res = {'resi': res, 'chain': chain_id} 
            c_col = col + "Carbon"
            viewer.addStyle(mod_res, {'stick':{'colorscheme':c_col, 'radius': 0.2}})
            viewer.addStyle(mod_res, {'sphere':{'color':col, 'opacity': 0.6}})
                
            if show_labels:
                viewer.addLabel(label + " " + sig, {'fontSize':10,'fontColor':col,'backgroundColor':'ivory'}, {'resi': res, 'chain': chain_id})
        
        viewer.zoomTo(all_residues)
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_bio_assembly=False, show_surface=False, show_labels=True, i=s_widget)

## Visualize locations of missense mutations
Mutated residues are rendered in as sticks and transparent spheres, and colored by ClinVar significance. Each mutation is labeled by the PDB residue number and ClinVar significance. Residues surrounding mutation sites (within 6 A) are rendered as thin orange sticks. Small molecules within the structure are rendered as gray sticks.

In [18]:
view_modifications(df, 6, 'uniprotId');

interactive(children=(Checkbox(value=False, description='show_bio_assembly'), Checkbox(value=False, descriptio…

In [19]:
def view_surface(df, cutoff_distance, *args):

    def view3d(show_bio_assembly=False, show_surface=False, show_labels=True, i=0):
        pdb_id, chain_id = df.iloc[i]['pdbChainId'].split('.')
        res_num = list(df.iloc[i]['pdbResNums'])
        labels = df.iloc[i]['snps']
        sigs = df.iloc[i]['clinsig']
        
        sig_dir = {'Benign':'green', 'Likely benign':'turquoise', 'Likely pathogenic':'palevioletred', \
                   'Pathogenic':'red', 'drug-response':'plum', 'untested':'white', \
                   'Uncertain significance': 'lightgray', 'other':'white', 'null':'white'}
        
        # print header
        print ("PDB Id: " + pdb_id + " chain Id: " + chain_id)
        
        # print any specified additional columns from the dataframe
        for a in args:
            print(a + ": " + df.iloc[i][a])
            
        viewer = py3Dmol.view(query='pdb:' + pdb_id, options={'doAssembly': show_bio_assembly})

        all_residues = {'resi': res_num, 'chain': chain_id}
        
        # polymer style
        viewer.setStyle({'sphere': {'colorscheme': 'chain', 'opacity':0.6}})
        
        # non-polymer style
        viewer.setStyle({'hetflag': True}, {'stick':{'radius': 0.3, 'singleBond': False}})

        # mutation style
        for label, res, sig in zip(labels, res_num, sigs):
            sig1 = sig.split(',')[0] # if multiple values, use the first one
            col = (sig_dir[sig1])
            mod_res = {'resi': res, 'chain': chain_id} 
            viewer.setStyle(mod_res, {'sphere':{'color':col}})
        
            if show_labels:
                viewer.addLabel(label + " " + sig, {'fontSize':10,'fontColor':col,'backgroundColor':'ivory'}, {'resi': res, 'chain': chain_id})
        
        viewer.zoomTo(all_residues)        
        
        if show_surface:
            viewer.addSurface(py3Dmol.SES,{'opacity':0.8,'color':'lightblue'})

        return viewer.show()
       
    s_widget = IntSlider(min=0, max=len(df)-1, description='Structure', continuous_update=False)
    
    return interact(view3d, show_bio_assembly=False, show_surface=False, show_labels=True, i=s_widget)

In [20]:
view_surface(df, 6, 'uniprotId');

interactive(children=(Checkbox(value=False, description='show_bio_assembly'), Checkbox(value=False, descriptio…

In [21]:
spark.stop()