# AdvancedSearchDataset Demo

This demo shows how to run an [RCSB PDB Advanced Search](https://www.rcsb.org/pdb/staticHelp.do?p=help/advancedSearch.html) and return a dataset of matching identifiers. Dependent on the type of query, the returned dataset contains one of the following fields:
* structureId for structure-based queries (e.g., 4HHB)
* structureChainId for entity-based queries (e.g., 4HHB.A)
* ligandId for ligand-based queries (e.g., HEM)

[Index of available queries](https://www.rcsb.org/pdb/staticHelp.do?p=help/advancedsearch/index.html)

RCSB PDB advanced queries are described in XML. To create the query XML go to the www.rcsb.org website and follow these steps:

1. Run an advanced query on the RCSB PDB website
2. On the results page, use your browser to find the text: Query Details
3. Click on the blue Query Details button and copy the query in XML format

In [None]:
from pyspark.sql import SparkSession
from mmtfPyspark.datasets import advancedSearchDataset

#### Configure Spark 

In [None]:
spark = SparkSession.builder.master("local[4]").appName("AdvancedSearchDatasetDemo").getOrCreate()

## Structure-based query
Here we search for PDB entries that form a biological assembly with A3B3C3 stoichiometry (trimer of trimers)

In [None]:
structure_query = ("<orgPdbQuery>"
                   "<queryType>org.pdb.query.simple.StoichiometryQuery</queryType>"
                   "<stoichiometry>A3B3C3</stoichiometry>"
                   "</orgPdbQuery>"
                   )

In [None]:
ds = advancedSearchDataset.get_dataset(structure_query)
ds.show()

## Entity-based query
Here we search for protein chains by a BLAST search.

In [None]:
entity_query = ("<orgPdbQuery>"
                 "<queryType>org.pdb.query.simple.SequenceQuery</queryType>"
                 "<sequence>DPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ</sequence>"
                 "<searchTool>blast</searchTool>"
                 "<maskLowComplexity>yes</maskLowComplexity>"
                 "<eValueCutoff>0.001</eValueCutoff>"
                 "<sequenceIdentityCutoff>40</sequenceIdentityCutoff>"
                 "</orgPdbQuery>"
                )

In [None]:
ds = advancedSearchDataset.get_dataset(entity_query)
ds.show()

## Ligand-based query
Here we search for ligands based on a substructure search.

In [None]:
ligand_query = ("<orgPdbQuery>"
                "<queryType>org.pdb.query.simple.ChemSmilesQuery</queryType>"
                "<smiles>CC(C)C1=C(Br)C(=O)C(C)=C(Br)C1=O</smiles>"
                "<target>Ligand</target>"
                "<searchType>Substructure</searchType>"
                "<polymericType>Any</polymericType>"
                "</orgPdbQuery>"
               )

In [None]:
ds = advancedSearchDataset.get_dataset(ligand_query)
ds.show()

In [None]:
spark.stop()