Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] unexpected behavior of searchsorted #1674

Open
matteobachetti opened this issue Nov 1, 2021 · 11 comments
Open

[BUG-REPORT] unexpected behavior of searchsorted #1674

matteobachetti opened this issue Nov 1, 2021 · 11 comments
Labels
bug needed: more information There is not enough information to reproduce the issue, or understand the problem

Comments

@matteobachetti
Copy link

matteobachetti commented Nov 1, 2021

Hello,

Thanks for the excellent library. I ran into a couple of problems when using np.searchsorted (or using the seachsorted method of expressions)

Description
I paste here the relevant MWEs:

In [1]: import vaex; vaex.__version__
Out[1]: 
{'vaex': '4.5.0',
 'vaex-core': '4.5.1',
 'vaex-viz': '0.5.0',
 'vaex-hdf5': '0.10.0',
 'vaex-server': '0.6.1',
 'vaex-astro': '0.9.0',
 'vaex-jupyter': '0.6.0',
 'vaex-ml': '0.14.0'}

In [2]: import numpy as np

In [3]: import vaex

In [4]: df = vaex.from_arrays(x=np.arange(0, 100, 0.1))

In [5]: df.x.searchsorted(55).evaluate()   # <-- I use a single value here, and it works
Out[5]: 550

In [6]: df.x.searchsorted(np.array([55])).evaluate()  # <-- The value to search is an array now
Out[6]: 
array([550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
       550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
       550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
(...)

In practice, instead of the expected array([550]), I get an array of the same shape as df.x.
If I try to search for an array longer than 1, it fails:

In [7]: df.x.searchsorted(np.array([55, 57])).evaluate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-b69e14c97073> in <module>
----> 1 df.x.searchsorted(np.array([55, 57])).evaluate()

(...)
ValueError: could not broadcast input array from shape (2,) into shape (1024,)

Even stranger thing: when the original dataframe is longer than 103 elements, it fails in yet another way, even searching for a single number

In [8]: df = vaex.from_arrays(x=np.arange(0, 103, 0.1))

In [9]: df.x.searchsorted(55).evaluate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-258641766582> in <module>
----> 1 df.x.searchsorted(55).evaluate()
(...)
ValueError: Unknown type: 550

Software information

  • Vaex version (import vaex; vaex.__version__):
{'vaex': '4.5.0',
'vaex-core': '4.5.1',
'vaex-viz': '0.5.0',
'vaex-hdf5': '0.10.0',
'vaex-server': '0.6.1',
'vaex-astro': '0.9.0',
'vaex-jupyter': '0.6.0',
'vaex-ml': '0.14.0'}
  • Vaex was installed via: pip
  • OS: Mac OS Big Sur on a Intel Macbook Pro
@JovanVeljanoski
Copy link
Member

Thanks! I think this is a bug. Not sure how to fix it to be honest. Would you like to make a simple unit text exposing this issue?

@matteobachetti
Copy link
Author

I think something like this should be enough to show the problem. I wrote the same tests with simple Numpy arrays and vaex DataFrames, for comparison.

import vaex
import numpy as np

class TestVaex(object):
    @classmethod
    def setup_class(cls):
        cls.data = np.arange(0, 200, 0.1)
        cls.df = vaex.from_arrays(x=cls.data)

    def test_numpy_searchsorted_single_number(self):
        assert np.searchsorted(self.data, 55)== 550

    def test_numpy_searchsorted_one_element(self):
        assert np.equal(np.searchsorted(self.data, [55]),  [550])

    def test_numpy_searchsorted_two_elements(self):
        assert np.all(np.equal(np.searchsorted(self.data, [55, 55]),  [550, 550]))

    def test_vaex_searchsorted_single_number(self):
        assert self.df.x.searchsorted(55).evaluate() == 550

    def test_vaex_searchsorted_one_element(self):
        assert np.equal(self.df.x.searchsorted([55]).evaluate(), [550])

    def test_vaex_searchsorted_two_elements(self):
        assert np.all(np.equal(self.df.x.searchsorted([55, 55]).evaluate(),  [550, 550]))

Thanks a lot for your help, and again, for this beautiful library.

@maartenbreddels
Copy link
Member

Hi Matteo

thanks for your report
To be honest, I don't know why we put in this function in b18f35d and what its use is within Vaex.
Could you explain what you are trying to do?

Regards,

Maarten

@matteobachetti
Copy link
Author

Let's suppose I have an array A with many numbers, and a new number X. I want to know what is the index of the number closest to X in the array. I could proceed with something like min(abs(A-X)), but this takes a lot of time. If the array is sorted, there are much smarter algorithms that can be used. numpy.searchsorted is a quick way to find the index of the element closest to X in A. At least, this is how I use it :).

@JovanVeljanoski
Copy link
Member

To be honest, I don't know why we put in this function in b18f35d and what its use is within Vaex.

I think that is on me :) Long long time ago, i was just testing out which numpy functions appeared compatible with vaex and adding them in.. Don't remember my exact usecase at the time, but what @matteobachetti describes almost rings a bell..

@maartenbreddels maartenbreddels added the needed: more information There is not enough information to reproduce the issue, or understand the problem label Nov 19, 2021
@maartenbreddels
Copy link
Member

Hmm, I think we need a well described use case for this, and a good set of tests to define the behaviour. To me it's not clear how this would be used in vaex, but currently it seems like it is broken, or not usable with vaex.

@matteobachetti
Copy link
Author

@maartenbreddels my use case is the following: I have a non-uniformly sampled time series with times input_times that I need to cut and analyze in segments. In order to select the time values in a given time range [t0, t1], I use searchsorted as
idx0, idx1 = np.searchsorted(input_times, [t0, t1])
and the times I want are input_times[idx:idx1].
The tests above more or less show all uses. Let me know if I can give any more inputs

@maartenbreddels
Copy link
Member

I think it was a mistake to blindly add this, but I still think it's a useful function to have. I think it would make more sense in as an aggregator though.

@yohplala
Copy link
Contributor

yohplala commented Jan 14, 2022

Hello all,
I tried to implement a vaex searchosrted probably the 'naive' way.

1st implementatoin, not taking into account the fact that the input array is sorted.

import vaex as vx
import numpy as np

def vx_searchsorted1(vdf, val):
    col = vdf.get_column_names()[0]
    return [len(vdf[vdf[col]<v]) for v in val]

# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res1 = vx_searchsorted1(vdf, val)
assert res1 == [1,7]

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted1(vdf, val)
# 264 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2nd implementation, taking into account the fact that input array is sorted: no perf improvement

def vx_searchsorted2(vdf, val):
    col = vdf.get_column_names()[0]
    res = []
    start = 0
    for v in val:
        temp = vdf[start:]
        start += len(temp[temp[col]<v].extract())
        res.append(start)
    return res

# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res2 = vx_searchsorted2(vdf, val)
assert res2 == [1,7]

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted2(vdf, val)
# 294 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comparing to numpy, it hurts...

import numpy as np
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
ar = vdf['x'].to_numpy()
%timeit np.searchsorted(ar, val)
# 2.97 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Hmmm, could there be a more appropriate implementation to reach the microsecond world with vaex?

@yohplala
Copy link
Contributor

yohplala commented Jan 16, 2022

Hi,
Exploring a 3rd idea, I am posting it for whom may be interested.
With this 3rd idea, there is a single vaex pass on the input array, whatever the number of values to check.
But no real improvement in execution time though.
(making use of register_function, I tried with multiprocessing=True but it does not change the execution time)

import vaex as vx
import numpy as np
import pyarrow as pa
from copy import copy

@vx.register_function(multiprocessing=True)
def any_in_between(lower, upper, vals):
    if isinstance(lower, pa.Array) or isinstance(lower, pa.lib.ChunkedArray):
        lower = lower.to_numpy()
    if isinstance(upper, pa.Array) or isinstance(upper, pa.lib.ChunkedArray):
        upper = upper.to_numpy()
    return np.any([(lower < val) & (val <= upper) for val in vals], axis=0)

def vx_searchsorted3(vdf, vals):
    vdf2 = copy(vdf)
    upper = vdf2.get_column_names()[0]
    vdf2['__lower'] = vdf2[upper]
    vdf2.shift(1, column='__lower', fill_value=vdf2['__lower'][:0].values[0], inplace=True)
    len_vdf = len(vdf2)
    vdf2['__row_index'] = vx.vrange(0, len_vdf, dtype='float64')
    vdf2['__nan'] = vx.vconstant(np.nan, len_vdf, dtype='float64')

    vdf2['__insert'] = vdf2.func.where(
                  vdf2.func.any_in_between(vdf2['__lower'], vdf2[upper], vals),
                  vdf2['__row_index'], vdf2['__nan'])
    return vdf2['__insert'].dropnan().to_numpy().astype('int64')

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000

# (not using the same machine, I am posting the execution time of the other methods for comparison)

%timeit vx_searchsorted1(vdf, val)
#181 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit vx_searchsorted2(vdf, val)
#265 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vx_searchsorted3(vdf, val)
#229 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This latter test raises some questions. I had troubles to compare int to pyarrow arrays, hence the need to convert them to numpy array. As this is only a 'view' as per pyarrow documentation, I am assuming this has no great impact on execution time. I wonder is there is better coding practise though?

Any feedback is welcome! :)
Bests

@saluto
Copy link

saluto commented Dec 8, 2022

I would find searchsorted (as in numpy) very helpful for various applications. My current usecase is in NLP: Given two dataframes, each with two sorted integer columns "start" and "end" (describing character spans within a text), I want to find overlaps between any pair of integer ranges from the two dataframes.

Unfortunately, I cannot even get a simple example working:

import vaex as vx
import numpy as np


vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2]).values  # => [1, 1, 1]
np.searchsorted([1, 3, 7], [2])  # => [1]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4]).values  # => ValueError: could not broadcast input array from shape (2,) into shape (3,)
np.searchsorted([1, 3, 7], [2, 4])  # => [1, 2]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4, 5]).values  # => [1, 2, 2]
np.searchsorted([1, 3, 7], [2, 4, 5])  # => [1, 2, 2]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4, 5, 8]).values  # => ValueError: could not broadcast input array from shape (4,) into shape (3,)
np.searchsorted([1, 3, 7], [2, 4, 5, 8])  # => [1, 2, 2, 3]

Or am I doing something wrong?

Is there an alternative to searchsorted for efficiently finding overlapping ranges, as described above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needed: more information There is not enough information to reproduce the issue, or understand the problem
Projects
None yet
Development

No branches or pull requests

5 participants