-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: OverflowError: value too large to convert to int when manipulating very large dataframes #59531
Comments
Definely some ints in https://github.com/pandas-dev/pandas/blob/795cce2a12b6ff77b998d16fcd3ffd22add0711f/pandas/_libs/lib.pyx#L522C1-L558C1:
|
Confirmed on main. Looks like switching the argument to |
Sure, happy to take a look. I think I have something that works but is proving awkward to test due to lack of RAM on dev environment. Leave it with me. |
Yea - I was afraid of that. I think we may have to fix this without a test unless there is a creative way that I'm missing. |
Progress! I used a node on our HPC (512GB) to pickle the test dataset and despite being 102GB it loads on my machine with 64GB. I can now manipulate it!
I'll just run some more tests to make sure it doesn't throw any more exceptions elsewhere. |
Updates maybe_indices_to_slice to use uint64 allowing massive dataframes to be manipulated (see pandas-dev#59531)
Update maybe_indices_to_slice to use unit64_t allowing manipulation of massive data frames (See pandas-dev#59531)
Update maybe_indices_to_slice to use uint64_t allowing manipulation of massive data frames (See pandas-dev#59531)
The rest of my analysis code now works so creating a PR for review. Thanks! :) |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When working with a very large data frame (4261028590) rows I get ".... in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int:
While similar to other reports, this occurs in 'pandas._libs.lib.maybe_indices_to_slice'.
Other manipulations on the df also fail. Perhaps rows are represented internally as int32s ?
Expected Behavior
Return array where 'count' is greater than zero allowing for df to be filtered down further.
Installed Versions
pd.show_versions()
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.12.4.final.0
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 72.1.0
pip : 24.2
Cython : 3.0.11
pytest : 7.4.4
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.8.4
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.13.1
sqlalchemy : 2.0.30
tables : 3.9.2
tabulate : 0.9.0
xarray : 2023.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: