PERF: Sparse ops speedup #13082

sinhrks · 2016-05-04T12:32:14Z

tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Follow-up for #13036. Perf improvements are not significant because the number of previous list.append call is not so large.

import numpy as np
import pandas as pd

np.random.seed(1)
N = 1000000
a = np.array([np.nan] * N)
b = np.array([np.nan] * N)

indexer_a = np.unique(np.random.randint(0, N, N / 10))
indexer_b = np.unique(np.random.randint(0, N, N / 10))
a[indexer_a] = np.random.randint(0, 100, len(indexer_a))
b[indexer_b] = np.random.randint(0, 100, len(indexer_b))
sa = pd.SparseArray(a)
sb = pd.SparseArray(b)
%timeit a.sp_index.intersect(sb.sp_index0)

# before
#100 loops, best of 3: 3.04 ms per loop

# after
#100 loops, best of 3: 2.11 ms per loop

def make_sparse_array(length, num_blocks, block_size, fill_value):
    a = np.array([fill_value] * length)
    for block in range(num_blocks):
        i = np.random.randint(0, length)
        a[i:i + block_size] = np.random.randint(0, 100, len(a[i:i + block_size]))
    return pd.SparseArray(a, fill_value=fill_value)

N = 1000000
B = 10000
BS = 10

a = make_sparse_array(length=N, num_blocks=B,  block_size=BS, fill_value=np.nan) 
b = make_sparse_array(length=N, num_blocks=B,  block_size=BS, fill_value=np.nan) 

%timeit a + b
# before
#10 loops, best of 3: 70.8 ms per loop

# after
#10 loops, best of 3: 66 ms per loop

jreback · 2016-05-04T12:40:06Z

pandas/src/sparse.pyx

@@ -124,9 +124,11 @@ cdef class IntIndex(SparseIndex):

            # TODO: would a two-pass algorithm be faster?
            if yindices[yi] == xind:
-                new_list.append(xind)
+                new_indices[result_indexer] = xind


add a test where x is y (the input is the same)

jreback · 2016-05-04T20:53:36Z

lgtm. ping on green.

sinhrks · 2016-05-05T05:05:32Z

green except for codecov.

jreback · 2016-05-05T12:03:46Z

thanks!

and hopefully turned off annoying codecov warnings.

sinhrks added Performance Memory or execution speed performance Sparse Sparse Data Type labels May 4, 2016

sinhrks added this to the 0.18.2 milestone May 4, 2016

sinhrks force-pushed the sparse_perf branch from b73a450 to 7fb5346 Compare May 4, 2016 12:33

jreback reviewed May 4, 2016
View reviewed changes

sinhrks force-pushed the sparse_perf branch from 7fb5346 to 28a7908 Compare May 4, 2016 20:45

PERF: Sparse ops speedup

acf5933

sinhrks force-pushed the sparse_perf branch from 28a7908 to acf5933 Compare May 4, 2016 21:51

jreback closed this in c5f4d9c May 5, 2016

sinhrks deleted the sparse_perf branch May 5, 2016 12:11

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Sparse ops speedup #13082

PERF: Sparse ops speedup #13082

sinhrks commented May 4, 2016

jreback May 4, 2016

jreback commented May 4, 2016

sinhrks commented May 5, 2016

jreback commented May 5, 2016

PERF: Sparse ops speedup #13082

PERF: Sparse ops speedup #13082

Conversation

sinhrks commented May 4, 2016

jreback May 4, 2016

Choose a reason for hiding this comment

jreback commented May 4, 2016

sinhrks commented May 5, 2016

jreback commented May 5, 2016