# Getting Started with mmf-pyspark on CyVerse/Vice

mmtf-pyspark is a prototype framework for the interactive mining of 3D structures in the Protein Data Bank (PDB).

In [None]:
from pyspark.sql import SparkSession   
from mmtfPyspark import structureViewer
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.filters import ContainsLProteinChain, ExperimentalMethods, Resolution

# A simple Example
This example shows how to setup mmtf-pyspark in a Jupyter Notebook, read the PDB archive, and apply filters to create a subset of structures.

### Configure Spark
mmtf-pyspark use [Apache Spark](https://spark.apache.org/), an analytics engine for large-scale data processing. As a first step, start a Spark session.

In [None]:
spark = SparkSession.builder.master("local[20]").appName("GettingStarted").getOrCreate()

### Read PDB Archive
mmtf-pyspark reads the PDB in [MMTF format](https://mmtf.rcsb.org). This notebook uses a local copy of this file stored in the CyVerse Data Store.

In [None]:
pdb = mmtfReader.read_full_sequence_file()

### Apply filters
Create a subset of high-resolution protein structures.
1. Resolution <= 2 A
2. X-ray structures
3. Structures containing exclusively protein chains (no DNA or RNA)

In [None]:
pdb = pdb.filter(Resolution(0.0, 2.0)) \
         .filter(ExperimentalMethods(ExperimentalMethods.X_RAY_DIFFRACTION)) \
         .filter(ContainsLProteinChain(exclusive=True))

### Create a list of the matching PDB IDs

In [None]:
pdb_ids = pdb.keys().collect()

In [None]:
print("Number of high-resolution protein structures:", len(pdb_ids))

### View structures
Use the slider to browse through the structures.

In [None]:
structureViewer.view_structure(pdb_ids, bioAssembly = False, style='cartoon', color='spectrum');

### Stop spark
Always run the notebooks to the end to stop Spark! Having multiple Spark sessions running at the same time will cause problems.

In [None]:
spark.stop()

# How to Get Started with mmtf-pyspark

## Try out the Demos
Navigate to the **mmtf-pyspark/demos** directory and sub-directories in the left-hand panel to try out the various features of mmtf-pyspark. 

## Work through the Online Tutorial
For a comprehensive introduction to mmtf-pyspark, work through the presentions, notebooks, and problems in our online workshop (https://github.com/sbl-sdsc/mmtf-workshop-2018).

## Read the mmtf-pyspark Documentation
mmtf-pyspark [API documentation](https://mmtf-pyspark.readthedocs.io/en/latest/)

## Try out other applications built on mmtf-pyspark
* [mmtf-genomics](https://github.com/sbl-sdsc/mmtf-genomics) Methods for mapping genomic data onto 3D protein structure.

* [mmtf-proteomics](https://github.com/sbl-sdsc/mmtf-proteomics) Methods for mapping proteomics data on 3D protein structure.

## Get Help
Please post your questions or feature request [here](https://github.com/sbl-sdsc/mmtf-pyspark/issues/new).