Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Sparse ops speedup #13082

Closed
wants to merge 1 commit into from
Closed

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented May 4, 2016

  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

Follow-up for #13036. Perf improvements are not significant because the number of previous list.append call is not so large.

import numpy as np
import pandas as pd

np.random.seed(1)
N = 1000000
a = np.array([np.nan] * N)
b = np.array([np.nan] * N)

indexer_a = np.unique(np.random.randint(0, N, N / 10))
indexer_b = np.unique(np.random.randint(0, N, N / 10))
a[indexer_a] = np.random.randint(0, 100, len(indexer_a))
b[indexer_b] = np.random.randint(0, 100, len(indexer_b))
sa = pd.SparseArray(a)
sb = pd.SparseArray(b)
%timeit a.sp_index.intersect(sb.sp_index0)

# before
#100 loops, best of 3: 3.04 ms per loop

# after
#100 loops, best of 3: 2.11 ms per loop
def make_sparse_array(length, num_blocks, block_size, fill_value):
    a = np.array([fill_value] * length)
    for block in range(num_blocks):
        i = np.random.randint(0, length)
        a[i:i + block_size] = np.random.randint(0, 100, len(a[i:i + block_size]))
    return pd.SparseArray(a, fill_value=fill_value)

N = 1000000
B = 10000
BS = 10

a = make_sparse_array(length=N, num_blocks=B,  block_size=BS, fill_value=np.nan) 
b = make_sparse_array(length=N, num_blocks=B,  block_size=BS, fill_value=np.nan) 

%timeit a + b
# before
#10 loops, best of 3: 70.8 ms per loop

# after
#10 loops, best of 3: 66 ms per loop

@sinhrks sinhrks added Performance Memory or execution speed performance Sparse Sparse Data Type labels May 4, 2016
@sinhrks sinhrks added this to the 0.18.2 milestone May 4, 2016
@@ -124,9 +124,11 @@ cdef class IntIndex(SparseIndex):

# TODO: would a two-pass algorithm be faster?
if yindices[yi] == xind:
new_list.append(xind)
new_indices[result_indexer] = xind
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test where x is y (the input is the same)

@jreback
Copy link
Contributor

jreback commented May 4, 2016

lgtm. ping on green.

@sinhrks
Copy link
Member Author

sinhrks commented May 5, 2016

green except for codecov.

@jreback jreback closed this in c5f4d9c May 5, 2016
@jreback
Copy link
Contributor

jreback commented May 5, 2016

thanks!

and hopefully turned off annoying codecov warnings.

@sinhrks sinhrks deleted the sparse_perf branch May 5, 2016 12:11
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants