[BUG-REPORT] unexpected behavior of searchsorted #1674

matteobachetti · 2021-11-01T12:53:07Z

Hello,

Thanks for the excellent library. I ran into a couple of problems when using np.searchsorted (or using the seachsorted method of expressions)

Description
I paste here the relevant MWEs:

In [1]: import vaex; vaex.__version__
Out[1]: 
{'vaex': '4.5.0',
 'vaex-core': '4.5.1',
 'vaex-viz': '0.5.0',
 'vaex-hdf5': '0.10.0',
 'vaex-server': '0.6.1',
 'vaex-astro': '0.9.0',
 'vaex-jupyter': '0.6.0',
 'vaex-ml': '0.14.0'}

In [2]: import numpy as np

In [3]: import vaex

In [4]: df = vaex.from_arrays(x=np.arange(0, 100, 0.1))

In [5]: df.x.searchsorted(55).evaluate()   # <-- I use a single value here, and it works
Out[5]: 550

In [6]: df.x.searchsorted(np.array([55])).evaluate()  # <-- The value to search is an array now
Out[6]: 
array([550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
       550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
       550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550, 550,
(...)

In practice, instead of the expected array([550]), I get an array of the same shape as df.x.
If I try to search for an array longer than 1, it fails:

In [7]: df.x.searchsorted(np.array([55, 57])).evaluate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-b69e14c97073> in <module>
----> 1 df.x.searchsorted(np.array([55, 57])).evaluate()

(...)
ValueError: could not broadcast input array from shape (2,) into shape (1024,)

Even stranger thing: when the original dataframe is longer than 103 elements, it fails in yet another way, even searching for a single number

In [8]: df = vaex.from_arrays(x=np.arange(0, 103, 0.1))

In [9]: df.x.searchsorted(55).evaluate()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-258641766582> in <module>
----> 1 df.x.searchsorted(55).evaluate()
(...)
ValueError: Unknown type: 550

Software information

Vaex version (import vaex; vaex.__version__):

{'vaex': '4.5.0',
'vaex-core': '4.5.1',
'vaex-viz': '0.5.0',
'vaex-hdf5': '0.10.0',
'vaex-server': '0.6.1',
'vaex-astro': '0.9.0',
'vaex-jupyter': '0.6.0',
'vaex-ml': '0.14.0'}

Vaex was installed via: pip
OS: Mac OS Big Sur on a Intel Macbook Pro

The text was updated successfully, but these errors were encountered:

JovanVeljanoski · 2021-11-06T08:57:22Z

Thanks! I think this is a bug. Not sure how to fix it to be honest. Would you like to make a simple unit text exposing this issue?

matteobachetti · 2021-11-07T09:57:23Z

I think something like this should be enough to show the problem. I wrote the same tests with simple Numpy arrays and vaex DataFrames, for comparison.

import vaex
import numpy as np

class TestVaex(object):
    @classmethod
    def setup_class(cls):
        cls.data = np.arange(0, 200, 0.1)
        cls.df = vaex.from_arrays(x=cls.data)

    def test_numpy_searchsorted_single_number(self):
        assert np.searchsorted(self.data, 55)== 550

    def test_numpy_searchsorted_one_element(self):
        assert np.equal(np.searchsorted(self.data, [55]),  [550])

    def test_numpy_searchsorted_two_elements(self):
        assert np.all(np.equal(np.searchsorted(self.data, [55, 55]),  [550, 550]))

    def test_vaex_searchsorted_single_number(self):
        assert self.df.x.searchsorted(55).evaluate() == 550

    def test_vaex_searchsorted_one_element(self):
        assert np.equal(self.df.x.searchsorted([55]).evaluate(), [550])

    def test_vaex_searchsorted_two_elements(self):
        assert np.all(np.equal(self.df.x.searchsorted([55, 55]).evaluate(),  [550, 550]))

Thanks a lot for your help, and again, for this beautiful library.

maartenbreddels · 2021-11-17T08:54:21Z

Hi Matteo

thanks for your report
To be honest, I don't know why we put in this function in b18f35d and what its use is within Vaex.
Could you explain what you are trying to do?

Regards,

Maarten

matteobachetti · 2021-11-17T09:29:29Z

Let's suppose I have an array A with many numbers, and a new number X. I want to know what is the index of the number closest to X in the array. I could proceed with something like min(abs(A-X)), but this takes a lot of time. If the array is sorted, there are much smarter algorithms that can be used. numpy.searchsorted is a quick way to find the index of the element closest to X in A. At least, this is how I use it :).

JovanVeljanoski · 2021-11-17T09:49:46Z

To be honest, I don't know why we put in this function in b18f35d and what its use is within Vaex.

I think that is on me :) Long long time ago, i was just testing out which numpy functions appeared compatible with vaex and adding them in.. Don't remember my exact usecase at the time, but what @matteobachetti describes almost rings a bell..

maartenbreddels · 2021-11-19T11:50:07Z

Hmm, I think we need a well described use case for this, and a good set of tests to define the behaviour. To me it's not clear how this would be used in vaex, but currently it seems like it is broken, or not usable with vaex.

matteobachetti · 2021-11-19T12:25:40Z

@maartenbreddels my use case is the following: I have a non-uniformly sampled time series with times input_times that I need to cut and analyze in segments. In order to select the time values in a given time range [t0, t1], I use searchsorted as
idx0, idx1 = np.searchsorted(input_times, [t0, t1])
and the times I want are input_times[idx:idx1].
The tests above more or less show all uses. Let me know if I can give any more inputs

maartenbreddels · 2021-11-23T08:30:23Z

I think it was a mistake to blindly add this, but I still think it's a useful function to have. I think it would make more sense in as an aggregator though.

yohplala · 2022-01-14T09:50:44Z

Hello all,
I tried to implement a vaex searchosrted probably the 'naive' way.

1st implementatoin, not taking into account the fact that the input array is sorted.

import vaex as vx
import numpy as np

def vx_searchsorted1(vdf, val):
    col = vdf.get_column_names()[0]
    return [len(vdf[vdf[col]<v]) for v in val]

# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res1 = vx_searchsorted1(vdf, val)
assert res1 == [1,7]

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted1(vdf, val)
# 264 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2nd implementation, taking into account the fact that input array is sorted: no perf improvement

def vx_searchsorted2(vdf, val):
    col = vdf.get_column_names()[0]
    res = []
    start = 0
    for v in val:
        temp = vdf[start:]
        start += len(temp[temp[col]<v].extract())
        res.append(start)
    return res

# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res2 = vx_searchsorted2(vdf, val)
assert res2 == [1,7]

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted2(vdf, val)
# 294 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comparing to numpy, it hurts...

import numpy as np
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
ar = vdf['x'].to_numpy()
%timeit np.searchsorted(ar, val)
# 2.97 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Hmmm, could there be a more appropriate implementation to reach the microsecond world with vaex?

yohplala · 2022-01-16T18:55:10Z

Hi,
Exploring a 3rd idea, I am posting it for whom may be interested.
With this 3rd idea, there is a single vaex pass on the input array, whatever the number of values to check.
But no real improvement in execution time though.
(making use of register_function, I tried with multiprocessing=True but it does not change the execution time)

import vaex as vx
import numpy as np
import pyarrow as pa
from copy import copy

@vx.register_function(multiprocessing=True)
def any_in_between(lower, upper, vals):
    if isinstance(lower, pa.Array) or isinstance(lower, pa.lib.ChunkedArray):
        lower = lower.to_numpy()
    if isinstance(upper, pa.Array) or isinstance(upper, pa.lib.ChunkedArray):
        upper = upper.to_numpy()
    return np.any([(lower < val) & (val <= upper) for val in vals], axis=0)

def vx_searchsorted3(vdf, vals):
    vdf2 = copy(vdf)
    upper = vdf2.get_column_names()[0]
    vdf2['__lower'] = vdf2[upper]
    vdf2.shift(1, column='__lower', fill_value=vdf2['__lower'][:0].values[0], inplace=True)
    len_vdf = len(vdf2)
    vdf2['__row_index'] = vx.vrange(0, len_vdf, dtype='float64')
    vdf2['__nan'] = vx.vconstant(np.nan, len_vdf, dtype='float64')

    vdf2['__insert'] = vdf2.func.where(
                  vdf2.func.any_in_between(vdf2['__lower'], vdf2[upper], vals),
                  vdf2['__row_index'], vdf2['__nan'])
    return vdf2['__insert'].dropnan().to_numpy().astype('int64')

# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000

# (not using the same machine, I am posting the execution time of the other methods for comparison)

%timeit vx_searchsorted1(vdf, val)
#181 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit vx_searchsorted2(vdf, val)
#265 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vx_searchsorted3(vdf, val)
#229 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This latter test raises some questions. I had troubles to compare int to pyarrow arrays, hence the need to convert them to numpy array. As this is only a 'view' as per pyarrow documentation, I am assuming this has no great impact on execution time. I wonder is there is better coding practise though?

Any feedback is welcome! :)
Bests

saluto · 2022-12-08T16:28:37Z

I would find searchsorted (as in numpy) very helpful for various applications. My current usecase is in NLP: Given two dataframes, each with two sorted integer columns "start" and "end" (describing character spans within a text), I want to find overlaps between any pair of integer ranges from the two dataframes.

Unfortunately, I cannot even get a simple example working:

import vaex as vx
import numpy as np


vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2]).values  # => [1, 1, 1]
np.searchsorted([1, 3, 7], [2])  # => [1]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4]).values  # => ValueError: could not broadcast input array from shape (2,) into shape (3,)
np.searchsorted([1, 3, 7], [2, 4])  # => [1, 2]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4, 5]).values  # => [1, 2, 2]
np.searchsorted([1, 3, 7], [2, 4, 5])  # => [1, 2, 2]

vx.from_arrays(x=[1, 3, 7]).x.searchsorted([2, 4, 5, 8]).values  # => ValueError: could not broadcast input array from shape (4,) into shape (3,)
np.searchsorted([1, 3, 7], [2, 4, 5, 8])  # => [1, 2, 2, 3]

Or am I doing something wrong?

Is there an alternative to searchsorted for efficiently finding overlapping ranges, as described above?

JovanVeljanoski added the bug label Nov 6, 2021

maartenbreddels added the needed: more information There is not enough information to reproduce the issue, or understand the problem label Nov 19, 2021

yohplala mentioned this issue Jan 17, 2022

[FEATURE-REQUEST] Parameter dream list for register_function #1835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-REPORT] unexpected behavior of searchsorted #1674

[BUG-REPORT] unexpected behavior of searchsorted #1674

matteobachetti commented Nov 1, 2021 •

edited

Loading

JovanVeljanoski commented Nov 6, 2021

matteobachetti commented Nov 7, 2021

maartenbreddels commented Nov 17, 2021

matteobachetti commented Nov 17, 2021

JovanVeljanoski commented Nov 17, 2021

maartenbreddels commented Nov 19, 2021

matteobachetti commented Nov 19, 2021

maartenbreddels commented Nov 23, 2021

yohplala commented Jan 14, 2022 •

edited

Loading

yohplala commented Jan 16, 2022 •

edited

Loading

saluto commented Dec 8, 2022

[BUG-REPORT] unexpected behavior of searchsorted #1674

[BUG-REPORT] unexpected behavior of searchsorted #1674

Comments

matteobachetti commented Nov 1, 2021 • edited Loading

JovanVeljanoski commented Nov 6, 2021

matteobachetti commented Nov 7, 2021

maartenbreddels commented Nov 17, 2021

matteobachetti commented Nov 17, 2021

JovanVeljanoski commented Nov 17, 2021

maartenbreddels commented Nov 19, 2021

matteobachetti commented Nov 19, 2021

maartenbreddels commented Nov 23, 2021

yohplala commented Jan 14, 2022 • edited Loading

yohplala commented Jan 16, 2022 • edited Loading

saluto commented Dec 8, 2022

matteobachetti commented Nov 1, 2021 •

edited

Loading

yohplala commented Jan 14, 2022 •

edited

Loading

yohplala commented Jan 16, 2022 •

edited

Loading