# Poly-peptide Chain Statistics Example

Example demonstrating how to extract protein cahins from PDB entries. This example uses a flatMap function to transform a structure to its polymer chains.

## Imports

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.filters import PolymerComposition
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.mappers import StructureToPolymerChains

#### Configure Spark 

In [2]:
spark = SparkSession.builder.master("local[4]") \
                                 .appName("PolyPeptideChainStatisticsExample") \
                                 .getOrCreate()

## Read in mmtf files, flatMap to polymer chains, filter by polymer composition, and get number of groups

In [4]:
path = "../../resources/mmtf_full_sample/"

chainLengths = mmtfReader.read_sequence_file(path) \
                         .flatMap(StructureToPolymerChains(False, True)) \
                         .filter(PolymerComposition(PolymerComposition.AMINO_ACIDS_20)) \
                         .map(lambda t: t[1].num_groups)

## Print out poly-peptide chain statistics

In [5]:
stats = chainLengths.stats()
print(f"Total number of chains: {stats.count()}")
print(f"Total number of groups: {stats.sum()}")
print(f"Min chain length: {stats.min()}")
print(f"Mean chain length: {stats.mean()}")
print(f"Max chain length: {stats.max()}")

Total number of chains: 7489
Total number of groups: 1655462
Min chain length: 3
Mean chain length: 221.05247696621703
Max chain length: 1231


## Terminate Spark

In [6]:
spark.stop()