# Benchmark using Spark Dataframe
This benchmark demostrates the efficiency of using columnar data formats. Here we run 4 benchmarks on the complete PDB to Uniprot residue-level mapping with a total of 105,594,955 records as of July 28, 2018.

1. Count number of records
2. Run a query
3. Join datasets
4. Convert to a Pandas dataframe

In [1]:
from pyspark.sql import SparkSession

In [2]:
import time
start = time.time()

## Setup Spark

In [3]:
%%time
spark = SparkSession.builder.master("local[4]").appName("Example").getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")

CPU times: user 30.8 ms, sys: 22.8 ms, total: 53.6 ms
Wall time: 18.3 s


# 1. Count number of records
Read PDB to UniProt mapping file in the ORC columnar data format.

In [4]:
%%time
# Dataset in ORC format
ds = spark.read.orc("./data/pdb2uniprot_residues.orc.lzo").dropna()

# Dataset in Parquet format
#ds = spark.read.parquet("./data/pdb2uniprot_residues.parquet.gzip").dropna()

print("total number of records:", ds.count())

total number of records: 96162206
CPU times: user 5.05 ms, sys: 2.57 ms, total: 7.62 ms
Wall time: 10.7 s


# 2. Run a query
## Find Mitogen-activated protein kinase 14
Here we run a query for PDB - UniProt mappings for UniProt ID Q16539 (MK14_HUMAN) and retrieve their residue-level mappings for residues that are observed in the PDB structure.

In [5]:
%%time
mk14_human = ds.filter("uniprotId == 'Q16539'").cache()

print("Number of distinct chains :", mk14_human.select("structureChainId").distinct().count())
print("Number of residue mappings:", mk14_human.count())
mk14_human.show(5)

Number of distinct chains : 243
Number of residue mappings: 82277
+----------------+---------+---------+---------+----------+
|structureChainId|pdbResNum|pdbSeqNum|uniprotId|uniprotNum|
+----------------+---------+---------+---------+----------+
|          2ZB1.A|        4|        4|   Q16539|         4|
|          2ZB1.A|        5|        5|   Q16539|         5|
|          2ZB1.A|        6|        6|   Q16539|         6|
|          2ZB1.A|        7|        7|   Q16539|         7|
|          2ZB1.A|        8|        8|   Q16539|         8|
+----------------+---------+---------+---------+----------+
only showing top 5 rows

CPU times: user 5.87 ms, sys: 2.92 ms, total: 8.79 ms
Wall time: 8.85 s


# 3. Join operation

In [6]:
# create a random dataset of ~10,000 chains
sample = ds.sample(withReplacement=False, fraction=0.0001, seed=1).select("structureChainId").withColumnRenamed("structureChainId", "id").distinct().cache()

print("Sample size:", sample.count())
sample.show(5)

Sample size: 9330
+------+
|    id|
+------+
|4AAQ.L|
|2WGG.A|
|2GBF.A|
|3IE2.D|
|4XET.A|
+------+
only showing top 5 rows



Now we use this sample dataset to run a database inner join for ~10,000 records

In [7]:
%%time
subset = ds.join(sample, ds.structureChainId == sample.id).drop(sample.id)

print("Number of residue in subset:", subset.count())
subset.show(5)

Number of residue in subset: 3566391
+----------------+---------+---------+---------+----------+
|structureChainId|pdbResNum|pdbSeqNum|uniprotId|uniprotNum|
+----------------+---------+---------+---------+----------+
|          1CJ0.B|       -8|        1|   P07511|        15|
|          1CJ0.B|       -7|        2|   P07511|        16|
|          1CJ0.B|       -6|        3|   P07511|        17|
|          1CJ0.B|       -5|        4|   P07511|        18|
|          1CJ0.B|       -4|        5|   P07511|        19|
+----------------+---------+---------+---------+----------+
only showing top 5 rows

CPU times: user 6.85 ms, sys: 3 ms, total: 9.85 ms
Wall time: 9.57 s


# 4. Convert from Spark to Pandas dataframe

In [8]:
%%time
mk14_human.toPandas().head()

CPU times: user 741 ms, sys: 121 ms, total: 862 ms
Wall time: 1.79 s


Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
0,2ZB1.A,4,4,Q16539,4
1,2ZB1.A,5,5,Q16539,5
2,2ZB1.A,6,6,Q16539,6
3,2ZB1.A,7,7,Q16539,7
4,2ZB1.A,8,8,Q16539,8


In [9]:
spark.stop()

In [10]:
end = time.time()
print("Spark dataframe total time", end-start, "sec.")

Spark dataframe total time 57.3903911113739 sec.
