-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG-REPORT] unexpected behavior of searchsorted #1674
Comments
Thanks! I think this is a bug. Not sure how to fix it to be honest. Would you like to make a simple unit text exposing this issue? |
I think something like this should be enough to show the problem. I wrote the same tests with simple Numpy arrays and vaex DataFrames, for comparison.
Thanks a lot for your help, and again, for this beautiful library. |
Hi Matteo thanks for your report Regards, Maarten |
Let's suppose I have an array |
I think that is on me :) Long long time ago, i was just testing out which numpy functions appeared compatible with vaex and adding them in.. Don't remember my exact usecase at the time, but what @matteobachetti describes almost rings a bell.. |
Hmm, I think we need a well described use case for this, and a good set of tests to define the behaviour. To me it's not clear how this would be used in vaex, but currently it seems like it is broken, or not usable with vaex. |
@maartenbreddels my use case is the following: I have a non-uniformly sampled time series with times |
I think it was a mistake to blindly add this, but I still think it's a useful function to have. I think it would make more sense in as an aggregator though. |
Hello all, 1st implementatoin, not taking into account the fact that the input array is sorted. import vaex as vx
import numpy as np
def vx_searchsorted1(vdf, val):
col = vdf.get_column_names()[0]
return [len(vdf[vdf[col]<v]) for v in val]
# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res1 = vx_searchsorted1(vdf, val)
assert res1 == [1,7]
# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted1(vdf, val)
# 264 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 2nd implementation, taking into account the fact that input array is sorted: no perf improvement def vx_searchsorted2(vdf, val):
col = vdf.get_column_names()[0]
res = []
start = 0
for v in val:
temp = vdf[start:]
start += len(temp[temp[col]<v].extract())
res.append(start)
return res
# functional test
vdf = vx.from_arrays(x=[1,2,2,2,3,4,5,6,6])
val = [2,6]
res2 = vx_searchsorted2(vdf, val)
assert res2 == [1,7]
# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
%timeit vx_searchsorted2(vdf, val)
# 294 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Comparing to numpy, it hurts... import numpy as np
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
ar = vdf['x'].to_numpy()
%timeit np.searchsorted(ar, val)
# 2.97 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Hmmm, could there be a more appropriate implementation to reach the microsecond world with vaex? |
Hi, import vaex as vx
import numpy as np
import pyarrow as pa
from copy import copy
@vx.register_function(multiprocessing=True)
def any_in_between(lower, upper, vals):
if isinstance(lower, pa.Array) or isinstance(lower, pa.lib.ChunkedArray):
lower = lower.to_numpy()
if isinstance(upper, pa.Array) or isinstance(upper, pa.lib.ChunkedArray):
upper = upper.to_numpy()
return np.any([(lower < val) & (val <= upper) for val in vals], axis=0)
def vx_searchsorted3(vdf, vals):
vdf2 = copy(vdf)
upper = vdf2.get_column_names()[0]
vdf2['__lower'] = vdf2[upper]
vdf2.shift(1, column='__lower', fill_value=vdf2['__lower'][:0].values[0], inplace=True)
len_vdf = len(vdf2)
vdf2['__row_index'] = vx.vrange(0, len_vdf, dtype='float64')
vdf2['__nan'] = vx.vconstant(np.nan, len_vdf, dtype='float64')
vdf2['__insert'] = vdf2.func.where(
vdf2.func.any_in_between(vdf2['__lower'], vdf2[upper], vals),
vdf2['__row_index'], vdf2['__nan'])
return vdf2['__insert'].dropnan().to_numpy().astype('int64')
# perf test
vdf = vx.from_arrays(x=range(10_000_000))
val = np.arange(10)*800_000
# (not using the same machine, I am posting the execution time of the other methods for comparison)
%timeit vx_searchsorted1(vdf, val)
#181 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit vx_searchsorted2(vdf, val)
#265 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vx_searchsorted3(vdf, val)
#229 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) This latter test raises some questions. I had troubles to compare Any feedback is welcome! :) |
I would find Unfortunately, I cannot even get a simple example working:
Or am I doing something wrong? Is there an alternative to |
Hello,
Thanks for the excellent library. I ran into a couple of problems when using
np.searchsorted
(or using theseachsorted
method of expressions)Description
I paste here the relevant MWEs:
In practice, instead of the expected
array([550])
, I get an array of the same shape asdf.x
.If I try to search for an array longer than 1, it fails:
Even stranger thing: when the original dataframe is longer than 103 elements, it fails in yet another way, even searching for a single number
Software information
import vaex; vaex.__version__)
:The text was updated successfully, but these errors were encountered: