# 3-Webfiltering
Webfilters use webservices to query metadata from the PDB. Webfilters can be used to:
1. Download specific structures or chains that match a query
2. Filter an existing set of structures by a query

For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/webfilters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/webfilters).

The following webfilters are supported:
* AdvancedQuery - runs an RCSB PDB advanced query
* SequenceSimilarity - runs a sequence similarity search using MMseq2 on PDB protein and nucleic acid sequences
* ChemicalStructureQuery - runs a substructure and similarity query on chemical compounds using a SMILES string
* PdbJMineSearch - runs an SQL query on PDBj's PDB database

In [28]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.webfilters import AdvancedQuery, SequenceSimilarity, ChemicalStructureQuery, PdbjMineSearch
from mmtfPyspark.structureViewer import view_structure

#### Configure Spark

In [29]:
spark = SparkSession.builder.appName("3-Webfiltering").getOrCreate()

## 1. Download specific structures or chains that match a query

### AdvancedQuery
The AdvancedQuery runs RCSB PDB's [Advanced Search](https://www.rcsb.org/search/advanced). 

RCSB PDB advanced queries are expressed in JSON format. The following example show how to create a query in JSON format and how to use it to download a set of structures that match the query.

In [30]:
def download_from_query(webfilter):
    from mmtfPyspark.mappers import StructureToPolymerChains
    structure_ids = webfilter.get_structure_ids() 
    
    pdb_ids = []
    for sid in structure_ids:
        pdb_ids.append(sid.split('_')[0])
        
    structures = mmtfReader.download_full_mmtf_files(pdb_ids)
    
    # for polymer entities, return only polymer entities that match the query
    if webfilter.get_result_type() == 'polymer_entity':
        structures = structures.flatMap(StructureToPolymerChains())
        structures = structures.filter(webfilter)
        
    return structures

#### Advanced Query for Structures
This query returns structures with the UniProt Accession: P0DTC2 (SARS-COoV-2 Spike glycoprotein) and oligomeric state: Homo 3-mer.

In [34]:
structure_query1 = '{"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession","operator":"in","negation":false,"value":["P0DTC2"]}},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name","operator":"exact_match","value":"UniProt","negation":false}}],"label":"nested-attribute"},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_struct_symmetry.oligomeric_state","operator":"exact_match","negation":false,"value":"Homo 3-mer"}}],"label":"text"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":25},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]}}'

In [35]:
structures1 = download_from_query(AdvancedQuery(structure_query1))

In [36]:
view_structure(structures1.keys().collect(), bioAssembly=True);

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=24), Output()),…

#### Advanced Query for Entities (returns polymer chains)
Full text query for Hemoglobin and source: human (taxonomy id 9606)

In [7]:
entity_query1 = '{"query":{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_entity_source_organism.taxonomy_lineage.id","operator":"exact_match","negation":false,"value":"9606"}},{"type":"terminal","service":"full_text","parameters":{"value":"hemoglobin"}}]},"return_type":"polymer_entity","request_options":{"group_by_return_type":"representatives","group_by":{"aggregation_method":"sequence_identity","ranking_criteria_type":{"sort_by":"rcsb_entry_info.resolution_combined","direction":"asc"},"similarity_cutoff":95},"pager":{"start":0,"rows":25},"sort":[{"sort_by":"score","direction":"desc"}],"scoring_strategy":"combined"}}'

In [8]:
entities1 = download_from_query(AdvancedQuery(entity_query1))

In [9]:
view_structure(entities1.keys().collect());

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=48), Output()),…

### Sequence Search

In [37]:
sequence = 'GMTKAREFLGTGWKFPVAAGADGAMVL'

In [38]:
entities2 = download_from_query(SequenceSimilarity(sequence, evalue_cutoff=0.1))

In [39]:
view_structure(entities2.keys().collect(), bioAssembly=True);

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), …

### ChemicalStructureQuery
This query downloads structure with chemical structures that have a biotin substructure.

4Q94 - BTN in polymer sequence -> bug

In [40]:
smiles = "OC(=O)CCCC[C@@H]1SC[C@@H]2NC(=O)N[C@H]12"
structures2 = download_from_query(ChemicalStructureQuery(smiles, ChemicalStructureQuery.SUBSTRUCTURE_STEREOSPECIFIC, percentSimilarity=90))

In [42]:
view_structure(structures2.keys().collect(), bioAssembly=True);

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=120), Output())…

### PdbjMineSearch - mmCIF metadata
The PdbjMine queries PDB metadata contained in the PDB archive. The metadata are contained in the PDBx/mmCIF files. 

The example below shows the audit_author and citation_author categories. Each category represents a relational table that can be queried using SQL. Each category represents a table, and fields represent database columns (see
https://pdbj.org/mine-rdb-docs available tables and columns).

Metadata example from [4P6I.cif](https://files.rcsb.org/view/4P6I.cif):
 <pre>
 loop_
 _audit_author.name 
 _audit_author.pdbx_ordinal 
 ...
 'Doudna, J.A.'    4 

 loop_
 _citation_author.citation_id 
 _citation_author.name 
 _citation_author.ordinal 
 ...
 primary 'Doudna, J.A.'    6
 </pre>

Here we query the name fields in audit_author and citation_author tables.

In [43]:
author_query = (
    "SELECT pdbid from audit_author "
    "WHERE name LIKE 'Doudna%J.A.%' "
    "UNION "
    "SELECT pdbid from citation_author "
    "WHERE citation_id = 'primary' AND name LIKE 'Doudna%J.A.%'"
)

In [44]:
structures3 = download_from_query(PdbjMineSearch(author_query))

In [45]:
view_structure(structures3.keys().collect());

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=103), Output())…

### PdbjMineSearch - SIFTS metadata
Each category represents a table, and fields represent database columns (see
https://pdbj.org/mine-rdb-docs available tables and columns).

In [46]:
kinase_query = "SELECT * from sifts.pdb_chain_enzyme WHERE ec_number = '2.7.11.1' LIMIT 25"

kinases = download_from_query(PdbjMineSearch(kinase_query))

In [49]:
view_structure(kinases.keys().collect());

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=19), Output()),…

## 2. Filter an existing set of structures by a query

### Read PDB C-alpha structures
This reads a subset of about 10,000 C-alpha structure from an MMTF Hadoop Sequence File

In [50]:
path = "../resources/mmtf_reduced_sample"

pdb_reduced = mmtfReader.read_sequence_file(path)
# pdb = mmtfReader.read_sequence_file(path).cache() # cache downloaded structure if used more than once.

### PdbjMineSearch - SIFTS metadata

In [51]:
kinase_query = "SELECT * from sifts.pdb_chain_enzyme WHERE ec_number = '2.7.11.1'"

In [52]:
kinases_subset = pdb_reduced.filter(PdbjMineSearch(kinase_query))

In [54]:
view_structure(kinases_subset.keys().collect());

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=82), Output()),…

### Read PDB full (all-atom) structures
This reads a subset of about 10,000 structures from an MMTF Hadoop Sequence File

In [55]:
path = "../resources/mmtf_full_sample"

pdb_full = mmtfReader.read_sequence_file(path)
# pdb = mmtfReader.read_sequence_file(path).cache() # cache downloaded structures if used more than once.

In [56]:
kinases_subset = pdb_full.filter(PdbjMineSearch(kinase_query))

In [57]:
view_structure(kinases_subset.keys().collect());

                                                                                

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=82), Output()),…

In [27]:
spark.stop()