# Keywork Search Demo

![pdbj](https://pdbj.org/content/default.svg)

PDBj Mine 2 RDB keyword search query and MMTF filtering using pdbid.
This filter searches the 'keyword' column in the brief_summary table for a keyword and returns a couple of columns for the matching entries.

[PDBj Mine Search Website](https://pdbj.org/mine)

## Imports

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.webfilters import PdbjMineSearch
from mmtfPyspark.datasets import pdbjMineDataset
from mmtfPyspark.io import mmtfReader

#### Configure Spark 

In [2]:
spark = SparkSession.builder.appName("KeywordSearchDemo").getOrCreate()

## Read in MMTF files from local directory

In [3]:
path = "../../resources/mmtf_reduced_sample/"

pdb = mmtfReader.read_sequence_file(path)

## Apply a SQL search on PDBj using a filter

In [4]:
sql = "select pdbid from keyword_search('porin')"

pdb = pdb.filter(PdbjMineSearch(sql))
print(pdb.keys().collect())
print("\n")
print(f"Number of entries matching query: {pdb.count()}")

['2POR', '4FQE', '4B0M', '3VY8', '2VQG', '3PGU', '2GUF', '2FGQ', '4RJW', '4RLC', '1ZE3', '4QLP', '1QJP', '5LDV', '3BWU', '2X9K', '4KR8', '2MLT', '2WJR', '3DWO', '2XET', '4FUV', '4MKO', '3SZV', '2PV2']


Number of entries matching query: 25


## Apply a SQL search on PDBj and get a dataset

In [5]:
sql = "select pdbid, resolution, biol_species, db_uniprot, db_pfam, hit_score from keyword_search('porin') order by hit_score desc"

dataset = pdbjMineDataset.get_dataset(sql)
dataset.show(10)

+-----------+----------+--------------------+--------------------+-----------+---------+
|structureId|resolution|        biol_species|          db_uniprot|    db_pfam|hit_score|
+-----------+----------+--------------------+--------------------+-----------+---------+
|       3POR|       2.5|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']| 0.095809|
|       2OMF|       2.4|Escherichia coli K12|['P02931', 'OMPF_...|['PF00267']|0.0954989|
|       2POR|       1.8|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']|0.0951392|
|       1BT9|       3.0|    Escherichia coli|['P02931', 'OMPF_...|['PF00267']| 0.094717|
|       1GFO|       3.3|    Escherichia coli|['P02931', 'OMPF_...|['PF00267']| 0.094717|
|       1GFP|       2.7|    Escherichia coli|['P02931', 'OMPF_...|['PF00267']| 0.094717|
|       1GFQ|       2.8|    Escherichia coli|['P02931', 'OMPF_...|['PF00267']| 0.094717|
|       1H6S|       3.0|RHODOPSEUDOMONAS ...|['P39767', 'PORI_...|['PF13609']| 0.094717|
|       1GFM|       3

## Terminate Spark Context

In [6]:
spark.stop()