# Benchmark using Pandas dataframe
This benchmark demostrates the efficiency of using columnar data formats. Here we run 3 benchmarks on the complete PDB to Uniprot residue-level mapping with a total of 105,594,955 records as of July 28, 2018.

1. Count number of records
2. Run a query
3. Join datasets

In [1]:
import pandas as pd

In [2]:
import time
start = time.time()

# 1. Count number of records
Read PDB to UniProt mapping file in the Parquet columnar data format (Pandas does not support the ORC format).

In [3]:
%%time
df = pd.read_parquet("../data/pdb2uniprot_residues.parquet.gzip").dropna()
df = df.astype({'uniprotNum': 'int32'})

CPU times: user 48.9 s, sys: 20.6 s, total: 1min 9s
Wall time: 1min 11s


In [4]:
%%time
print("total number of records: ", df.shape[0])

total number of records:  96162206
CPU times: user 198 µs, sys: 62 µs, total: 260 µs
Wall time: 217 µs


In [5]:
df.head()

Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
0,1A5E.A,1,1,P42771,1
1,1A5E.A,2,2,P42771,2
2,1A5E.A,3,3,P42771,3
3,1A5E.A,4,4,P42771,4
4,1A5E.A,5,5,P42771,5


# 2. Run a query
## Find Mitogen-activated protein kinase 14
Here we run a query for PDB - UniProt mappings for UniProt ID Q16539 (MK14_HUMAN) and retrieve their residue-level mappings for residues that are observed in the PDB structure.

In [6]:
%%time
mk14_human = df.query("uniprotId == 'Q16539'")

print("Number of distinct chains :", mk14_human['structureChainId'].nunique())
print("Number of residue mappings:", mk14_human.shape[0])

Number of distinct chains : 243
Number of residue mappings: 82277
CPU times: user 1.88 s, sys: 515 ms, total: 2.4 s
Wall time: 2.42 s


In [7]:
mk14_human.head()

Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
159186,2ZB1.A,4,4,Q16539,4
159187,2ZB1.A,5,5,Q16539,5
159188,2ZB1.A,6,6,Q16539,6
159189,2ZB1.A,7,7,Q16539,7
159190,2ZB1.A,8,8,Q16539,8


# 3. Join operation

In [8]:
# create a random dataset of ~10,000 chains
sample = df.sample(frac=0.0001, random_state=1)['structureChainId'].drop_duplicates()

print("Sample size:", sample.shape[0])
sample.head()

Sample size: 9429


16583128    4TNI.b
27415911    1DZE.A
90344974    2F2H.C
12942022    2R0I.A
98525621    4BBS.I
Name: structureChainId, dtype: object

Now we use this sample dataset to run a database inner join for ~10,000 records

In [9]:
%%time
#subset = df.merge(sample, left_on='structureChainId', right_on='id').drop(columns='id')
subset = df.merge(sample, on='structureChainId')
print("Number of residue in subset ", subset.shape[0])

Number of residue in subset  3703322
CPU times: user 5.57 s, sys: 1.06 s, total: 6.63 s
Wall time: 6.73 s


In [10]:
subset.head()

Unnamed: 0,structureChainId,pdbResNum,pdbSeqNum,uniprotId,uniprotNum
0,1E7L.A,1,1,P13340,1
1,1E7L.A,2,2,P13340,2
2,1E7L.A,3,3,P13340,3
3,1E7L.A,4,4,P13340,4
4,1E7L.A,5,5,P13340,5


In [11]:
end = time.time()
print("Pandas dataframe total time", end-start, "sec.")

Pandas dataframe total time 87.95784878730774 sec.


In [None]:
%load_ext watermark
%watermark 