# Create Representative Set Demo

Creates an MMTF-Hadoop Sequence file for a Picses representative set of protein chains.


## References

Please cite the following in any work that uses lists provided by PISCES G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.
[PISCES](http://dunbrack.fccc.edu/PISCES.php)


## Imports

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader, mmtfWriter
from mmtfPyspark.mappers import StructureToPolymerChains
from mmtfPyspark.filters import PolymerComposition
from mmtfPyspark.webfilters import Pisces

#### Configure Spark 

In [2]:
spark = SparkSession.builder.appName("CreateRepresentativeSetDemo").getOrCreate()

## Read in Haddop Sequence Files

In [3]:
path = "../../resources/mmtf_full_sample/"

pdb = mmtfReader.read_sequence_file(path)

## Filter by representative protein chains at 40% sequence identity

In [4]:
sequenceIdentity = 40
resolution = 2.0

pdb = pdb.filter(Pisces(sequenceIdentity, resolution)) \
         .flatMap(StructureToPolymerChains()) \
         .filter(Pisces(sequenceIdentity, resolution)) \
         .filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20))

## Show top 10 structures

In [5]:
pdb.keys().top(10)

['8ABP.A',
 '7ODC.A',
 '7A3H.A',
 '6RLX.D',
 '6FLK.B',
 '6FG8.B',
 '6FG8.A',
 '6F8P.A',
 '6F72.A',
 '6F0P.A']

## Terminate Spark

In [6]:
spark.stop()