# Filtering PDB Structures
This tutorial demonstrates how to filter PDB to create subsets of structures. For details see [filters](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/mmtfPyspark/filters) and [demos](https://github.com/sbl-sdsc/mmtf-pyspark/tree/master/demos/filters).

## Import pyspark and mmtfPyspark

In [47]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsLProteinChain, PolymerComposition, Resolution
from mmtfPyspark import structureViewer

## Configure Spark

In [40]:
conf = SparkConf().setMaster("local[*]").setAppName("1-Input")
sc = SparkContext(conf = conf)

## Read PDB structures

In [41]:
path = "/Users/peter/MMTF_Files/reduced_pisces25_2.2_drugs"
pdb = mmtfReader.read_sequence_file(path, sc)

## Filter by Quality Metrics
Structures can be filtered by [Resolution](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution) and [R-free](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/r-value-and-r-free). Each filter takes a minimum and maximum values. The example below returns structures with a resolution in the inclusive range [0.0, 1.5]

In [42]:
pdb = pdb.filter(Resolution(0.0, 1.5))
pdb.count()

3085

## Filter by Polymer Chain Types
A number of filters are available to filter by the type of the polymer chain.

### Create a subset of structures that contain at least one L-protein chain

In [43]:
pdb = pdb.filter(ContainsLProteinChain())
pdb.count()

3056

### Create a subset of structure that exclusively contain L-protein chains (e.g., exclude protein-nucleic acid complexes)

In [44]:
pdb = pdb.filter(ContainsLProteinChain(exclusive=True))
pdb.count()

2924

### Keep protein structures that exclusively contain chains made out of the 20 standard amino acids

In [45]:
pdb = pdb.filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_22, exclusive=True))
pdb.count()

2287

In [None]:
pdb = pdb.filter(ContainsGroup("ATP"))

In [46]:
sc.stop()