# AdvancedSearchDataset Demo

This demo shows how to run an [RCSB PDB Advanced Search](https://www.rcsb.org/pdb/staticHelp.do?p=help/advancedSearch.html) and return a dataset of matching identifiers. Dependent on the type of query, the returned dataset contains one of the following fields:
* structureId for structure-based queries (e.g., 4HHB)
* structureChainId for entity-based queries (e.g., 4HHB.A)
* ligandId for ligand-based queries (e.g., HEM)

[Index of available queries](https://www.rcsb.org/pdb/staticHelp.do?p=help/advancedsearch/index.html)

RCSB PDB advanced queries are described in XML. To create the query XML go to the www.rcsb.org website and follow these steps:

1. Run an advanced query on the RCSB PDB website
2. On the results page, use your browser to find the text: Query Details
3. Click on the blue Query Details button and copy the query in XML format

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.datasets import advancedSearchDataset

#### Configure Spark 

In [2]:
spark = SparkSession.builder.appName("AdvancedSearchDatasetDemo").getOrCreate()

## Structure-based query
Here we search for PDB entries that form a biological assembly with A3B3C3 stoichiometry (trimer of trimers)

In [3]:
structure_query = ("<orgPdbQuery>"
                   "<queryType>org.pdb.query.simple.StoichiometryQuery</queryType>"
                   "<stoichiometry>A3B3C3</stoichiometry>"
                   "</orgPdbQuery>"
                   )

In [4]:
ds = advancedSearchDataset.get_dataset(structure_query)
ds.show()

+-----+
|pdbId|
+-----+
| 1A5K|
| 1A5L|
| 1A5M|
| 1A5N|
| 1A5O|
| 1DGR|
| 1DGW|
| 1E0F|
| 1EF2|
| 1EJR|
| 1EJS|
| 1EJT|
| 1EJU|
| 1EJV|
| 1EJW|
| 1EJX|
| 1FWA|
| 1FWB|
| 1FWC|
| 1FWD|
+-----+
only showing top 20 rows



## Entity-based query
Here we search for protein chains by a BLAST search.

In [5]:
entity_query = ("<orgPdbQuery>"
                 "<queryType>org.pdb.query.simple.SequenceQuery</queryType>"
                 "<sequence>DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ</sequence>"
                 "<searchTool>blast</searchTool>"
                 "<maskLowComplexity>yes</maskLowComplexity>"
                 "<eValueCutoff>0.001</eValueCutoff>"
                 "<sequenceIdentityCutoff>40</sequenceIdentityCutoff>"
                 "</orgPdbQuery>"
                )

In [6]:
ds = advancedSearchDataset.get_dataset(entity_query)
ds.show()

+----------+
|pdbChainId|
+----------+
|    1DF8.B|
|    1DF8.A|
|    1HQQ.D|
|    1HQQ.C|
|    1HQQ.B|
|    1HQQ.A|
|    1HXL.B|
|    1HXL.A|
|    1HXZ.B|
|    1HXZ.A|
|    1HY2.D|
|    1HY2.C|
|    1HY2.B|
|    1HY2.A|
|    1I9H.B|
|    1I9H.A|
|    1KFF.D|
|    1KFF.C|
|    1KFF.B|
|    1KFF.A|
+----------+
only showing top 20 rows



## Ligand-based query
Here we search for ligands based on a substructure search.

In [7]:
ligand_query = ("<orgPdbQuery>"
                "<queryType>org.pdb.query.simple.ChemSmilesQuery</queryType>"
                "<smiles>CC(C)C1=C(Br)C(=O)C(C)=C(Br)C1=O</smiles>"
                "<target>Ligand</target>"
                "<searchType>Substructure</searchType>"
                "<polymericType>Any</polymericType>"
                "</orgPdbQuery>"
               )

In [8]:
ds = advancedSearchDataset.get_dataset(ligand_query)
ds.show()

+--------+
|ligandId|
+--------+
|     BNT|
+--------+



In [9]:
spark.stop()