# 3-Webfiltering
Webfilters access metadata from external resources to filter PDB structures. For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/webfilters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/webfilters).

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.webfilters import AdvancedQuery, ChemicalStructureQuery, PdbjMineSearch
from mmtfPyspark.structureViewer import view_structure

#### Configure Spark

In [2]:
spark = SparkSession.builder.appName("3-Webfiltering").getOrCreate()

#### Read PDB structures

In [3]:
path = "../resources/mmtf_reduced_sample"

pdb = mmtfReader.read_sequence_file(path).cache()

# NOTE: RCSB PDB's AdvancedQuery web service has been depreciated!!!

## Use RCSB PDB's AdvancedQuery web service to filter structures
The AdvancedQuery filter can run any query supported by RCSB PDB's [Advanced Search](https://www.rcsb.org/pdb/staticHelp.do?p=help/advancedSearch.html). 

RCSB PDB advanced queries are described in XML. Here is how you can create the query XML:

1. Run an advanced query on the [RCSB PDB website](http://www2.rcsb.org/pdb/search/advSearch.do?search=new)
2. On the results page, use your browser to find the text: **Query Details**
3. Click on the blue **Query Details** button and copy the query in XML format

Here we run a 'Protein Stoichiometry' query for bioassemblies with the stoichiometry `A3B3C3` (trimer of trimers). 

In [4]:
# query = (
#     "<orgPdbQuery>"
#          "<queryType>org.pdb.query.simple.StoichiometryQuery</queryType>"
#          "<stoichiometry>A3B3C3</stoichiometry>"
#     "</orgPdbQuery>"
# )

# trimer_of_trimers = pdb.filter(AdvancedQuery(query))

In [5]:
#view_structure(trimer_of_trimers.keys().collect(), bioAssembly=True);

## Use Chemical Structure Filter to search by SMILES String
The ChemicalStructure query is a wrapper around an RCSB PDB Advanced Search. Here, we search for ligands that have a biotin substructure.

In [6]:
# smiles = "OC(=O)CCCC[C@@H]1SC[C@@H]2NC(=O)N[C@H]12"
# result = pdb.filter(ChemicalStructureQuery(smiles, ChemicalStructureQuery.SUBSTRUCTURE))

In [7]:
# result.keys().collect()

In [8]:
# view_structure(result.keys().collect(), bioAssembly=True);

## Use PDBj's Mine Search to filter by PDB metadata
The PdbjMine search let's you query all PDB metadata contained in the PDB archive. The metadata are contained in the PDBx/mmCIF files. 

The example below shows the audit_author and citation_author categories. Each category represents a relational table that can be queried using SQL. Each category represents a table, and fields represent database columns (see
https://pdbj.org/mine-rdb-docs available tables and columns).

Metadata example from [4P6I.cif](https://files.rcsb.org/view/4P6I.cif):
 <pre>
 loop_
 _audit_author.name 
 _audit_author.pdbx_ordinal 
 ...
 'Doudna, J.A.'    4 
 # 
 loop_
 _citation_author.citation_id 
 _citation_author.name 
 _citation_author.ordinal 
 ...
 primary 'Doudna, J.A.'    6
 </pre>

Here we query the name fields in audit_author and citation_author tables.

In [9]:
query = (
    "SELECT pdbid from audit_author "
    "WHERE name LIKE 'Doudna%J.A.%' "
    "UNION "
    "SELECT pdbid from citation_author "
    "WHERE citation_id = 'primary' AND name LIKE 'Doudna%J.A.%'"
)

In [10]:
doudna_structures = pdb.filter(PdbjMineSearch(query))

In [11]:
view_structure(doudna_structures.keys().collect());

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=5), Output()), …

## Use PDBj's Mine to Search by SIFTS metadata
Each category represents a table, and fields represent database columns (see
https://pdbj.org/mine-rdb-docs available tables and columns).

In [12]:
query = "SELECT * from sifts.pdb_chain_enzyme WHERE ec_number = '2.7.11.1'"

kinases = pdb.filter(PdbjMineSearch(query))

In [13]:
view_structure(kinases.keys().collect(), bioAssembly=True);

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure'), Output()), _dom_cl…

In [14]:
spark.stop()