# Reading Structures
mmtf-pyspark operates on 3D structures in the MMTF file format.

Protein Data Bank structures are available in two MMTF data representations:
* full
 * All atom representation 
 * 0.001Å coordinate precision, 0.01 B-factor and occupancy precision
* reduced
 * C-alpha atoms only for polypeptides 
 * P-backbone atoms only for polynucleotides 
 * All atom representation for all other residue types 
 * 0.1Å coordinate precision, 0.1 B-factor and occupancy precision.

## Import pyspark and mmtfPyspark

In [1]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.io import mmtfReader

## Configure Spark

In [2]:
conf = SparkConf().setMaster("local[*]").setAppName("1-Input")
sc = SparkContext(conf = conf)

In [3]:
sc.defaultParallelism

4

## Download Structures
For a small list of PDB entries (10s to 100), the download methods are the quickest way to import structures. Here we download a list of 4 structure in the full representation.

In [4]:
pdbids = ['1LQ9','1LXJ','4XPX','1P1J']
structures = mmtfReader.download_full_mmtf_files(pdbids, sc)

Structures are represented as keyword-value pairs:
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

We can print the keys and values using the collect() methods. Note, that the structures are loaded in an arbritray order. You cannot rely on the order of structures.

In [5]:
%%time
structures.keys().collect()

CPU times: user 8.73 ms, sys: 3.96 ms, total: 12.7 ms
Wall time: 1.88 s


['1P1J', '1LQ9', '1LXJ', '4XPX']

In [6]:
structures.values().collect()

[<mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x10fc64f28>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x10fc64e80>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x10fc8b240>,
 <mmtfPyspark.utils.mmtfStructure.MmtfStructure at 0x10fca6eb8>]

Spark represents these keyword-value pairs as Resilient Distributed Datasets (RDDs), which are a fault-tolerant collection of elements that can be operated on in parallel. To see how the dataset was distributed, we can print the number of partitions.

In [7]:
structures.getNumPartitions()

4

## Reading structures from an MMTF Hadoop Sequence File
Next, we read PDB structures from a local copy of an MMTF Hadoop Sequence file. For the following examples to work, the MMTF_FULL and MMTF_REDUCED environment variables need to be set. See installation instructions for details.

If you have long list (1000s) of PDB IDs, you can read the list of structures from a local copy of the MMTF Hadoop Sequence file,
however, it's very inefficent for a few structures, e.g, in the example below.

In [8]:
structures = mmtfReader.read_full_sequence_file(sc, pdbids)

Hadoop Sequence file path: MMTF_FULL=/Users/peter/MMTF_Files/full_pisces25_2.2_drugs


Let's print the keys again and see how long this takes. You can see that Spark loads the data only when and if it's required.

In [9]:
%%time
structures.keys().collect()

CPU times: user 6.6 ms, sys: 3.07 ms, total: 9.67 ms
Wall time: 4.08 s


['1LQ9', '1LXJ', '4XPX', '1P1J']

Now, let's read the entire PDB archive from the MMTF Hadoop Sequence file

In [10]:
structures = mmtfReader.read_full_sequence_file(sc)

Hadoop Sequence file path: MMTF_FULL=/Users/peter/MMTF_Files/full_pisces25_2.2_drugs


In [13]:
structures = mmtfReader.read_sequence_file("/Users/peter/MMTF_Files/reduced_pisces25_2.2_drugs", sc)

In [14]:
%%time
structures.count()

CPU times: user 7.21 ms, sys: 3.96 ms, total: 11.2 ms
Wall time: 17.4 s


10707

Now, let's count the number of structures again. Should this be faster this time since we already loaded the entire PDB? 

No, the data from the Hadoop Sequence file are streamed through parallel threads. If you need the data again, in this case count again, the data need to be reloaded from scratch.

In [12]:
%%time
structures.count()

CPU times: user 9.41 ms, sys: 3.86 ms, total: 13.3 ms
Wall time: 41.2 s


10707

# Very Important: Stop Spark!!!
It is very important to run the notebook all the way to the sc.stop() statement to terminate Spark. Otherwise you will endup running multiple instances of Spark that will interfere with each other. If necessary, kill any running Spark processes using the Activity Monitor on Mac or the Task Manager on Windows.

In [28]:
sc.stop()