Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG(?): rolling sum with pyarrow types results in float64 instead of preserving integer type #61144

Open
3 tasks done
MarcoGorelli opened this issue Mar 18, 2025 · 3 comments
Open
3 tasks done
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Window rolling, ewma, expanding

Comments

@MarcoGorelli
Copy link
Member

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [23]: pd.Series([1,2,3], dtype='Int64[pyarrow]').rolling(2).sum()
Out[23]:
0    NaN
1    3.0
2    5.0
dtype: float64

Issue Description

Given that 'Int64[pyarrow]' supports missing values, should the above not result in

0    <NA>
1    3
2    5
dtype: Int64[pyarrow]

to avoid the usual issues around floating point numbers?

Expected Behavior

To preserve integer type

Other tools for reference:

In [16]: pl.Series([1,2,3]).rolling_sum(2)
Out[16]:
shape: (3,)
Series: '' [i64]
[
        null
        3
        5
]

In [17]: duckdb.sql("""
    ...: from values (1),(2),(3) df(a)
    ...: select case when count(a) over w >= 2 then sum(a) over w else null end as a
    ...: window w as (rows between 1 preceding and current row)
    ...: """)
Out[17]:
┌────────┐
│   a    │
│ int128 │
├────────┤
│   NULL │
│      3 │
│      5 │
└────────┘

Installed Versions

INSTALLED VERSIONS

commit : 57fd502
python : 3.10.12
python-bits : 64
OS : Linux
OS-release : 5.15.167.4-microsoft-standard-WSL2
Version : #1 SMP Tue Nov 5 00:21:55 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1979.g57fd50221e
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.33.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.2.0
html5lib : 1.1
hypothesis : 6.127.5
gcsfs : 2025.2.0
jinja2 : 3.1.5
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.2.0
scipy : 1.15.2
sqlalchemy : 2.0.38
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2025
@mroeschke
Copy link
Member

Agreed, we should be able to preserve the input type for rolling sum aggregations. Also min and max.

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions Window rolling, ewma, expanding Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2025
@snitish
Copy link
Contributor

snitish commented Mar 19, 2025

@mroeschke @MarcoGorelli Given that groupby operations are able to correctly handle extension (and other non-float64) dtypes, I'm guessing we will need to do something similar for window aggregations -

  1. Update all window cython aggregations to be able to handle numpy types other than float64_t (i.e. don't hardcode float64_t)
  2. Update all window aggregations to handle masked arrays (i.e. allow extra mask and result_mask params)
  3. Create a separate _window_op() path for extension arrays similar to _groupby_op()
  4. Implement _window_op() for each extension array type

This seems like quite a bit of work, so please suggest if you can think of a simpler way. We may also do this in stages - implement # 1 first as it is independent of the other 3 and will resolve other issues like #23002 and partially #11446.

@mroeschke
Copy link
Member

The shorter way would be to astype the result from the Cython aggregations, but your steps are definitely the more thorough approach that we would want in the longer term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants