# Speed comparison between PyPairs and the R verison - PyPairs

Here we ran the sandbag part of the original Pairs method on the oscope dataset for a growing subset of genes. Taking note of the required execution time. Single cored time is taken. For the result please see: [2.3 Differences in code - Python](./2.3%20Differences%20in%20code%20-%20R.ipynb)

<div id="toc"></div>

## Neccessary Imports

In [2]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
import sys
code = "./../../code/"
data = "./../../data/"
sys.path.append(code)
import pandas
import pypairs as pairs
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import numpy as np
from pathlib import Path
from tqdm import tqdm_notebook as tqdm
import helper
import timeit

init_notebook_mode(connected=True)

## Loading Oscope Dataset

In [3]:
# Load matrix
oscope_gencounts = pandas.read_csv(Path(data + "data/GSE64016_H1andFUCCI_normalized_EC_human.csv"))

# Set index right
oscope_gencounts.set_index("Unnamed: 0", inplace=True)

# Subset sorted
oscope_gencounts_sorted = oscope_gencounts.iloc[:, [oscope_gencounts.columns.get_loc(c) for c in oscope_gencounts.columns if "G1_" in c or "G2_" in c or "S_" in c]]

# Define annotation
is_G1 = [oscope_gencounts_sorted.columns.get_loc(c) for c in oscope_gencounts_sorted.columns if "G1_" in c]
is_S = [oscope_gencounts_sorted.columns.get_loc(c) for c in oscope_gencounts_sorted.columns if "S_" in c]
is_G2M = [oscope_gencounts_sorted.columns.get_loc(c) for c in oscope_gencounts_sorted.columns if "G2_" in c]

annotation = {
    "G1": list(is_G1),
    "S": list(is_S),
    "G2M": list(is_G2M)
}

no_genes = len(oscope_gencounts_sorted.index) - 1

print("Total number of genes in oscope dataset {}".format(no_genes))

Total number of genes in oscope dataset 19083


## Running sandbag with increasing number of genes

Notice: Long runtime, result stored in magic please see [Results](#Results)

In [4]:
t = []
genes = [10,100,500,1000,5000,10000,19000]
for g in tqdm(genes):
    
    sub = helper.random_subset(range(0, no_genes), g)
    subset = oscope_gencounts_sorted.iloc[sub, :]
    
    start = timeit.default_timer()
    oscope_marker_pairs = pairs.sandbag(x=subset, phases=annotation, fraction=0.65, processes=1, verbose=True)
    time_sandbag = timeit.default_timer() - start
    t.append(time_sandbag)


[__set_matrix] Original Matrix 'x' has shape 10 x 247
[__set_matrix] Removed 2 genes that were not expressed in any samples. 8 genes remaining.
[__set_matrix] Removed 0 samples that were not annotated in 'phases'. 247 samples remaining.
[__set_matrix] Matrix truncation done. Working with 8 genes for 247 samples.
[sandbag] Identifying marker pairs... Done!
[sandbag] Identified 0 marker pairs (phase: count): {'G1': 0, 'S': 0, 'G2M': 0}
[__set_matrix] Original Matrix 'x' has shape 100 x 247
[__set_matrix] Removed 9 genes that were not expressed in any samples. 91 genes remaining.
[__set_matrix] Removed 0 samples that were not annotated in 'phases'. 247 samples remaining.
[__set_matrix] Matrix truncation done. Working with 91 genes for 247 samples.
[sandbag] Identifying marker pairs... Done!
[sandbag] Identified 0 marker pairs (phase: count): {'G1': 0, 'S': 0, 'G2M': 0}
[__set_matrix] Original Matrix 'x' has shape 500 x 247
[__set_matrix] Removed 49 genes that were not expressed in any sam

In [8]:
%store t

Stored 't' (list)


## Results

Python times are feched from store magic, R times were copied manually  

In [9]:
%store -r

In [10]:
t_python = t
t_r = [0.01, 0.08, 1.49, 6.37, 180.56, 803.64, 2761.00]

In [11]:
# Create traces
trace0 = go.Scatter(
    x= [10,100,500,1000,5000,10000,19000],
    y= t_python,
    mode='markers+lines',
    marker=dict(
        symbol='circle',
        size=10,
        color='green',
    ),
    name='PyPairs'
)

trace1 = go.Scatter(
    x= [10,100,500,1000,5000,10000,19000],
    y= t_r,
    mode='markers+lines',
    marker=dict(
        symbol='square',
        size=10,
        color='blue',
    ),
    name='R Version'
)

layout = go.Layout(
    title='Speed comparison: R implementation vs PyPairs',
    xaxis=dict(
        title='No. of genes',
    ),
    yaxis=dict(
        title='Time in ms',
    )
)

data = go.Figure(data=[trace0, trace1], layout=layout)

iplot(data)