# Wild Type Query Demo

This demo selects protein sequences that do not contain mutations in comparison with the reference UniProt sequences.

Expression tags: Some PDB entries include expression tags that were added during the experiment. Select "No" to filter out sequences with expression tags. Percent coverage of UniProt sequence: PDB entries may contain only a portion of the referenced UniProt sequence. The "Percent coverage of UniProt sequence" option defines how much of a UniProt sequence needs to be contained in a PDB entry.


## Imports

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.webfilters import WildTypeQuery

#### Configure Spark 

In [2]:
spark = SparkSession.builder.appName("WildTypeQueryDemo").getOrCreate()

## Read in Hadoop Sequence Files and filter by WildType

In [3]:
path = "../../resources/mmtf_reduced_sample/"

pdb = mmtfReader.read_sequence_file(path) \
                .filter(WildTypeQuery(includeExpressionTags = True, percentSequenceCoverage = WildTypeQuery.SEQUENCE_COVERAGE_95)).cache()

## Count results and show top 5 structures

In [4]:
pdb.keys().top(5)

['6F72', '6ES9', '6ENS', '6EKV', '6EKT']

In [5]:
print(f"Number of structures after filtering : {pdb.count()}")

Number of structures after filtering : 2836


## Terminate Spark

In [6]:
spark.stop()