# SIFTS Query Demo

![pdbj](https://pdbj.org/content/default.svg)

PDBj Mine 2 RDB complex query of PDB and SIFTS data, followed by MMTF filtering using pdbid & chain name.
Additional fitering can be done on the client side, in addition the obtained data can also be reused (so not only filtering, but also data gathering).

[PDBj Mine Search Website](https://pdbj.org/mine)

## Imports

In [2]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.webFilters import PdbjMine
from mmtfPyspark.datasets import PdbjMineService
from mmtfPyspark.io import mmtfReader

## Configure Spark Context

In [3]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("SIFTSQuery")
    
sc = SparkContext(conf = conf)

## Read in MMTF files from local directory

In [4]:
path = "../../resources/mmtf_full_sample/"

pdb = mmtfReader.read_sequence_file(path, sc)

## Apply a SQL search on PDBj using a filter

Retrieve PDB chain sequences matching to the Pfam accession "PF00046" (Homeobox) and having a resolution better than 2.0 Angstrom and a sequence length greater than or equal to 58 (residues)

In [6]:
sql = "SELECT concat(s.pdbid, '.', s.chain) as \"structureChainId\", s.*, r.ls_d_res_high as reso,"+ \
      "LENGTH(p.pdbx_seq_one_letter_code_can) as len, " + \
      "('>' || s.pdbid || s.chain) as header, " + \
      "replace(p.pdbx_seq_one_letter_code_can,E'\n','') as aaseq " + \
      "FROM sifts.pdb_chain_pfam s " +\
      "JOIN refine r on r.pdbid = s.pdbid " +\
      "JOIN entity_poly p on p.pdbid = s.pdbid " +\
      "AND s.chain = ANY(regexp_split_to_array(p.pdbx_strand_id,',')) " +\
      "WHERE pfam_id = 'PF00046' " +\
      "AND r.ls_d_res_high < 2.0 " +\
      "AND LENGTH(p.pdbx_seq_one_letter_code_can) >= 58 " +\
      "ORDER BY reso, len,s.chain "


search = PdbjMine(sql)
count = pdb.filter(search).keys().count()
print(f"Number of entries using sql to filter: {count}")

Number of entries using sql to filter: 1


## Apply a SQL search on PDBj and get a dataset

In [7]:
dataset = PdbjMineService.getDataset(sql)
dataset.show(1)
search = PdbjMine(dataset = dataset)
count = pdb.filter(search).keys().count()
print(f"Number of entries using dataset to filter: {count}")

+----------------+-----+-----+----------+-------+----+---+------+--------------------+
|structureChainId|pdbid|chain|sp_primary|pfam_id|reso|len|header|               aaseq|
+----------------+-----+-----+----------+-------+----+---+------+--------------------+
|          4rdu.A| 4rdu|    A|    P56178|PF00046|1.85| 65|>4rduA|GHMVRKPRTIYSSFQLA...|
+----------------+-----+-----+----------+-------+----+---+------+--------------------+
only showing top 1 row

Number of entries using dataset to filter: 1


## Terminate Spark Context

In [6]:
sc.stop()