Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] df.shift(period) with negative period returns incorrect result when df has more than 2**20 rows #1607

Closed
mateusglucas opened this issue Oct 4, 2021 · 3 comments

Comments

@mateusglucas
Copy link
Contributor

Description
When using df.shift(period) with negative period in a dataframe with more than 2**20=1048576 rows, an incorrect result is returned.

Software information

  • Vaex version: 4.5.0
  • Vaex was installed via: pip3
  • OS: Ubuntu 20.04.3 LTS

Example

import pandas as pd
import numpy as np
import vaex as vx

N=2**20+1 # 1048577
a = pd.DataFrame()
a['A']=np.linspace(1,N,N)

b = vx.from_pandas(a)

print(b.shift(-1))
#          A
0          1048577.0
1          2.0
2          3.0
3          4.0
4          5.0
...        ...
1,048,572  1048573.0
1,048,573  1048574.0
1,048,574  1048575.0
1,048,575  1048576.0
1,048,576  --
@mateusglucas
Copy link
Contributor Author

I've found something at line 261 of shift.py:

chunk_size = chunk_size or 1024**2 # TODO: should we have a constant somewhere

Increasing the 1024**2 to a value greater or equal to the row number produces the correct result.

mateusglucas added a commit to mateusglucas/vaex that referenced this issue Oct 5, 2021
@mateusglucas mateusglucas changed the title [BUG-REPORT] df.shift(period) with negative period returns incorrect result when df have more than 2**20 rows [BUG-REPORT] df.shift(period) with negative period returns incorrect result when df has more than 2**20 rows Oct 5, 2021
@mateusglucas
Copy link
Contributor Author

I think this issue is caused by the continue in line 162 in shift.py. When a dataframe has more than 1024**2 rows, this line prevents to yield the first chunk.

mateusglucas added a commit to mateusglucas/vaex that referenced this issue Oct 7, 2021
mateusglucas added a commit to mateusglucas/vaex that referenced this issue Oct 7, 2021
maartenbreddels pushed a commit that referenced this issue Oct 11, 2021
* Fix df.shift() with negative period

This fixes the issue  #1607

* Fix df.shift() with negative period

This fixes the issue  #1607

* Fix df.shift() with negative period

This fixes the issue  #1607

* try fix shift with negative period

* fix: shift and diff of large datasets

- shift() with negative period and diff() now returns the correct value for datasets larger than 1024**2
xdssio pushed a commit to xdssio/vaex that referenced this issue Dec 31, 2021
* Fix df.shift() with negative period

This fixes the issue  vaexio#1607

* Fix df.shift() with negative period

This fixes the issue  vaexio#1607

* Fix df.shift() with negative period

This fixes the issue  vaexio#1607

* try fix shift with negative period

* fix: shift and diff of large datasets

- shift() with negative period and diff() now returns the correct value for datasets larger than 1024**2
@mateusglucas
Copy link
Contributor Author

Closed via #1608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant