Skip to content

Commit

Permalink
Merge pull request #15 from pwrose/master
Browse files Browse the repository at this point in the history
added presentations
  • Loading branch information
pwrose committed May 6, 2018
2 parents 3632f4a + 0bc8331 commit 47af07a
Show file tree
Hide file tree
Showing 4 changed files with 330 additions and 0 deletions.
Binary file added 0-introduction/MMTF2018-Introduction.pdf
Binary file not shown.
Binary file added 1-3D-visualization/MMTF2018-3D-Visualization.pdf
Binary file not shown.
1 change: 1 addition & 0 deletions 4-mmtf-pyspark-advanced/3-MutationsToStructure.ipynb
Expand Up @@ -4181,6 +4181,7 @@
"metadata": {},
"outputs": [],
"source": [
"psubstring_index(df.s, '.', 2).alias('s')\n",
"pdb_ids = positions.select(\"structureId\").rdd.flatMap(lambda x: x).collect()\n",
"chain_ids = positions.select(\"chainId\").rdd.flatMap(lambda x: x).collect()\n",
"group_numbers = positions.select(\"pdbPosition\").rdd.flatMap(lambda x: x).collect()\n",
Expand Down
329 changes: 329 additions & 0 deletions 4-mmtf-pyspark-advanced/Solution-1.ipynb
@@ -0,0 +1,329 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Solution-1\n",
"This tutorial shows how to identify drug molecules in the PDB by joining two datasets: \n",
"\n",
"1. Drug information from DrugBank\n",
"2. Ligand information from RCSB PDB"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark.sql.functions import substring_index\n",
"from mmtfPyspark.datasets import pdbjMineDataset\n",
"from mmtfPyspark.webfilters import PdbjMineSearch\n",
"from mmtfPyspark.interactions import InteractionFilter, InteractionFingerprinter\n",
"from mmtfPyspark.io import mmtfReader\n",
"from mmtfPyspark.structureViewer import view_binding_site\n",
"from ipywidgets import interact\n",
"from IPython.display import Markdown, display\n",
"import py3Dmol"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Configure Spark"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder.master(\"local[4]\").appName(\"2-JoiningDatasets\").getOrCreate()\n",
"sc = spark.sparkContext"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##\n",
"[See examples](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/PDBMetaDataDemo.ipynb)\n",
"[SIFTS demo](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/demos/datasets/SiftsDataDemo.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For our first task, we need to run a taxonomy query. To figure out how to query for taxonomu, the command below lists the first 10 entries for the SIFTS taxonomy table. As you can see, we can use the science_name field to query for a specific organism."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+-----+------+--------------------+----------------+\n",
"|pdbid|chain|tax_id| scientific_name|structureChainId|\n",
"+-----+-----+------+--------------------+----------------+\n",
"| 101M| A| 9755| PHYCD| 101M.A|\n",
"| 101M| A| 9755| Physeter catodon| 101M.A|\n",
"| 101M| A| 9755|Physeter catodon ...| 101M.A|\n",
"| 101M| A| 9755|Physeter catodon ...| 101M.A|\n",
"| 101M| A| 9755|Physeter macrocep...| 101M.A|\n",
"| 101M| A| 9755| Sperm whale| 101M.A|\n",
"| 101M| A| 9755| sperm whale| 101M.A|\n",
"| 102L| A| 10665| BPT4| 102L.A|\n",
"| 102L| A| 10665| Bacteriophage T4| 102L.A|\n",
"| 102L| A| 10665|Enterobacteria ph...| 102L.A|\n",
"+-----+-----+------+--------------------+----------------+\n",
"\n"
]
}
],
"source": [
"taxonomyQuery = \"SELECT * FROM sifts.pdb_chain_taxonomy LIMIT 10\"\n",
"taxonomy = pdbjMineDataset.get_dataset(taxonomyQuery)\n",
"taxonomy.show()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+-----+------+---------------+----------------+\n",
"|pdbid|chain|tax_id|scientific_name|structureChainId|\n",
"+-----+-----+------+---------------+----------------+\n",
"| 12E8| H| 10090| Mus musculus| 12E8.H|\n",
"| 12E8| L| 10090| Mus musculus| 12E8.L|\n",
"| 12E8| M| 10090| Mus musculus| 12E8.M|\n",
"| 12E8| P| 10090| Mus musculus| 12E8.P|\n",
"| 15C8| H| 10090| Mus musculus| 15C8.H|\n",
"| 15C8| L| 10090| Mus musculus| 15C8.L|\n",
"| 1914| A| 10090| Mus musculus| 1914.A|\n",
"| 1A0Q| H| 10090| Mus musculus| 1A0Q.H|\n",
"| 1A0Q| L| 10090| Mus musculus| 1A0Q.L|\n",
"| 1A14| H| 10090| Mus musculus| 1A14.H|\n",
"+-----+-----+------+---------------+----------------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"taxonomyQuery = \"SELECT * FROM sifts.pdb_chain_taxonomy WHERE scientific_name = 'Mus musculus'\"\n",
"taxonomy = pdbjMineDataset.get_dataset(taxonomyQuery)\n",
"taxonomy.show(10)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"path = \"../resources/mmtf_full_sample/\"\n",
"\n",
"pdb = mmtfReader.read_sequence_file(path, sc)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"pdb = pdb.filter(PdbjMineSearch(taxonomyQuery))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"interactionFilter = InteractionFilter(distanceCutoff=4.5, minInteractions=10)\n",
"\n",
"interactions = InteractionFingerprinter.get_polymer_interactions(pdb, interactionFilter).cache()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------------+------------+-------------+--------------------+--------------------+--------------------+-----------+\n",
"|structureChainId|queryChainId|targetChainId| groupNumbers| sequenceIndices| sequence|structureId|\n",
"+----------------+------------+-------------+--------------------+--------------------+--------------------+-----------+\n",
"| 5YQG.C| F| C|[132, 133, 177, 1...|[62, 137, 138, 18...|GPGSEFMGDREQLLQRA...| 5YQG|\n",
"| 5YQG.D| F| D|[125, 132, 133, 1...|[55, 62, 130, 137...|GPGSEFMGDREQLLQRA...| 5YQG|\n",
"| 1GCQ.A| C| A|[165, 167, 168, 1...|[8, 10, 11, 13, 1...|GSTYVQALFDFDPQEDG...| 1GCQ|\n",
"| 1GCQ.C| A| C|[606, 607, 608, 6...|[15, 16, 17, 18, ...|GSHMPKMEVFQEYYGIP...| 1GCQ|\n",
"| 1GCQ.B| C| B|[162, 163, 164, 1...|[5, 6, 7, 8, 22, ...|GSTYVQALFDFDPQEDG...| 1GCQ|\n",
"| 1GCQ.C| B| C|[592, 593, 594, 5...|[1, 2, 3, 4, 5, 6...|GSHMPKMEVFQEYYGIP...| 1GCQ|\n",
"| 1GL4.A| B| A|[401, 403, 405, 4...|[48, 50, 52, 74, ...|APLAQQTCANNRHQCSV...| 1GL4|\n",
"| 1GL4.B| A| B|[1799, 1800, 1801...|[38, 39, 40, 41, ...|APLAAPSKPIMVTVEEQ...| 1GL4|\n",
"| 1GML.A| D| A|[213, 219, 316, 3...|[4, 10, 107, 109,...|MEDSCVLRGVMINKDVT...| 1GML|\n",
"| 1GML.D| A| D|[212, 213, 214, 2...|[3, 4, 5, 6, 10, ...|MEDSCVLRGVMINKDVT...| 1GML|\n",
"| 4I2A.A| C| A|[255, 256, 257, 2...|[144, 145, 146, 1...|MGSSHHHHHHSSGLVPR...| 4I2A|\n",
"| 4M48.A| H| A|[337, 338, 498, 5...|[70, 274, 275, 43...|MNSISDERETWSGKVDF...| 4M48|\n",
"| 4M48.H| A| H|[100, 101, 102, 1...|[51, 68, 70, 71, ...|MNFGLRLVFLVLILKGV...| 4M48|\n",
"| 4M48.L| H| L|[117, 119, 120, 1...|[53, 54, 56, 58, ...|MDFQVQIFSFLLISASV...| 4M48|\n",
"| 4M48.H| L| H|[102, 103, 104, 1...|[55, 57, 61, 62, ...|MNFGLRLVFLVLILKGV...| 4M48|\n",
"| 3ZY7.A| B| A|[10, 100, 103, 10...|[9, 10, 11, 12, 4...|GSMIPSITAYDALGLKI...| 3ZY7|\n",
"| 3ZY7.B| A| B|[10, 100, 103, 10...|[8, 9, 10, 11, 12...|GSMIPSITAYDALGLKI...| 3ZY7|\n",
"| 3ZZY.B| D| B|[193, 199, 202, 2...|[37, 43, 46, 47, ...|SVQSGNLALAASAAAVD...| 3ZZY|\n",
"| 4NN5.A| C| A|[130, 133, 134, 1...|[14, 15, 19, 23, ...|YNFSNCNFTSITKIYCN...| 4NN5|\n",
"| 4NN5.C| A| C|[106, 107, 108, 1...|[68, 69, 70, 71, ...|AAAVTSRGDVTVVCHDL...| 4NN5|\n",
"+----------------+------------+-------------+--------------------+--------------------+--------------------+-----------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"interactions = interactions.withColumn(\"structureId\", substring_index(interactions.structureChainId, '.', 1)).cache()\n",
"interactions.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize drug binding sites"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extract id columns as lists (required for visualization)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"structure_ids = interactions.select(\"structureId\").rdd.flatMap(lambda x: x).collect()\n",
"query_chain_ids = interactions.select(\"queryChainID\").rdd.flatMap(lambda x: x).collect()\n",
"target_chain_ids = interactions.select(\"targetChainID\").rdd.flatMap(lambda x: x).collect()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Disable scrollbar for the visualization below"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%javascript \n",
"IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Show binding site residues within 4.5 A from the drug molecule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def view_protein_protein_interactions(structure_ids, query_chain_ids, target_chain_ids, distance=4.5):\n",
" \n",
" def view3d(i=0):\n",
" \n",
" print(f\"PDB: {structure_ids[i]}, query: {query_chain_ids[i]}, target: {target_chain_ids[i]}\")\n",
"\n",
" query = {'chain': query_chain_ids[i], 'chain': target_chain_ids[i], 'byres': 'true', 'expand': distance}\n",
" target = {'chain': target_chain_ids[i], 'chain': query_chain_ids[i], 'byres': 'true', 'expand': distance}\n",
" viewer = py3Dmol.view(query='pdb:' + structure_ids[i]) \n",
" viewer.setStyle(query, {'stick': {'colorscheme': 'orangeCarbon'}});\n",
" viewer.setStyle(target, {'stick': {'colorscheme': 'lightblueCarbon'}})\n",
" viewer.zoomTo(query)\n",
"\n",
" return viewer.show()\n",
"\n",
" s_widget = IntSlider(min=0, max=len(structure_ids)-1, description='Structure', continuous_update=False)\n",
" return interact(view3d, i=s_widget)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"view_binding_site(structure_ids, query_chain_ids, target_chain_ids, distance=4.5);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spark.stop()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 47af07a

Please sign in to comment.