# Create Representative Set Demo

This Demo shows how to create a representative set by reading in Hadoop sequence files, filter by BlastClusters, flatMap to polymerChains, and filter again AMINO_ACIDS polymer composition.

![RCSB PDB](https://cdn.rcsb.org/rcsb-pdb/v2/common/images/Logo_wwpdb.png)

## Imports

In [2]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import MmtfReader, MmtfWriter
from mmtfPyspark.mappers import structureToPolymerChains
from mmtfPyspark.filters import polymerComposition
from mmtfPyspark.webfilters import blastCluster

## Configure Spark

In [3]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("CreateRepresentativeSetDemo")
sc = SparkContext(conf = conf)

## Read in Haddop Sequence Files

In [4]:
path = "../../resources/mmtf_full_sample/"

pdb = MmtfReader.readSequenceFile(path, sc)

## Filter by representative protein chains at 40% sequence identity

In [7]:
sequenceIdentity = 40

pdb = pdb.filter(blastCluster(sequenceIdentity)) \
         .flatMap(structureToPolymerChains()) \
         .filter(blastCluster(sequenceIdentity)) \
         .filter(polymerComposition(polymerComposition.AMINO_ACIDS_20))

## Show top 10 structures

In [10]:
pdb.top(10)

[('1FYF.B', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874dd8b8d0>),
 ('1FYF.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874d58c4e0>),
 ('1FYE.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874d58c828>),
 ('1FYD.B', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874da36898>),
 ('1FYD.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874db837b8>),
 ('1FYC.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874db83940>),
 ('1FYB.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874d2bfb00>),
 ('1FYA.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874d19bcf8>),
 ('1FY9.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874d0541d0>),
 ('1FY7.A', <mmtf.api.mmtf_writer.MMTFEncoder at 0x7f874cf0a518>)]

## Save representative set

In [13]:
write_path = f'./pdb_representatives_{sequenceIdentity}'

MmtfWriter.writeSequenceFile(write_path, sc, pdb)

## Terminate Spark

In [14]:
sc.stop()