Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: OverflowError: value too large to convert to int when manipulating very large dataframes #59531

Closed
2 of 3 tasks
benjamindonnachie opened this issue Aug 16, 2024 · 6 comments · Fixed by #61080
Closed
2 of 3 tasks
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@benjamindonnachie
Copy link
Contributor

benjamindonnachie commented Aug 16, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

test = pd.DataFrame({'count' : np.random.randint(0,100,size=(4261028590))}, index=(pd.DatetimeIndex(np.empty(4261028590))))

stripped = test[test['count'] > 0]

Issue Description

When working with a very large data frame (4261028590) rows I get ".... in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4093, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4155, in _getitem_bool_array
    return self._take_with_is_copy(indexer, axis=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4153, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4133, in take
    new_data = self._mgr.take(
               ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 893, in take
    new_labels = self.axes[axis].take(indexer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/datetimelike.py", line 839, in take
    maybe_slice = lib.maybe_indices_to_slice(indices, len(self))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 522, in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int

While similar to other reports, this occurs in 'pandas._libs.lib.maybe_indices_to_slice'.

Other manipulations on the df also fail. Perhaps rows are represented internally as int32s ?

Expected Behavior

Return array where 'count' is greater than zero allowing for df to be filtered down further.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.4.final.0
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 72.1.0
pip : 24.2
Cython : 3.0.11
pytest : 7.4.4
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.8.4
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.13.1
sqlalchemy : 2.0.30
tables : 3.9.2
tabulate : 0.9.0
xarray : 2023.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

@benjamindonnachie benjamindonnachie added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 16, 2024
@benjamindonnachie
Copy link
Contributor Author

benjamindonnachie commented Aug 16, 2024

Definely some ints in https://github.com/pandas-dev/pandas/blob/795cce2a12b6ff77b998d16fcd3ffd22add0711f/pandas/_libs/lib.pyx#L522C1-L558C1:

def maybe_indices_to_slice(ndarray[intp_t, ndim=1] indices, int max_len):
    cdef:
        Py_ssize_t i, n = len(indices)
        intp_t k, vstart, vlast, v

    if n == 0:
        return slice(0, 0)

    vstart = indices[0]
    if vstart < 0 or max_len <= vstart:
        return indices

    if n == 1:
        return slice(vstart, <intp_t>(vstart + 1))

    vlast = indices[n - 1]
    if vlast < 0 or max_len <= vlast:
        return indices

    k = indices[1] - indices[0]
    if k == 0:
        return indices
    else:
        for i in range(2, n):
            v = indices[i]
            if v - indices[i - 1] != k:
                return indices

        if k > 0:
            return slice(vstart, <intp_t>(vlast + 1), k)
        else:
            if vlast == 0:
                return slice(vstart, None, k)
            else:
                return slice(vstart, <intp_t>(vlast - 1), k)

@rhshadrach
Copy link
Member

rhshadrach commented Aug 16, 2024

Confirmed on main. Looks like switching the argument to Py_ssize_t resolves. @benjamindonnachie would you be interested in submitting a PR to fix?

@rhshadrach rhshadrach added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 16, 2024
@benjamindonnachie
Copy link
Contributor Author

benjamindonnachie commented Aug 17, 2024

Sure, happy to take a look. I think I have something that works but is proving awkward to test due to lack of RAM on dev environment. Leave it with me.

@rhshadrach
Copy link
Member

Yea - I was afraid of that. I think we may have to fix this without a test unless there is a creative way that I'm missing.

@benjamindonnachie
Copy link
Contributor Author

Progress! I used a node on our HPC (512GB) to pickle the test dataset and despite being 102GB it loads on my machine with 64GB. I can now manipulate it!

>>> import pandas as pd
>>> print (pd.__version__)
3.0.0.dev0+1351.g523afa840a.dirty
>>> test = pd.read_pickle("summary.pkl")
>>> print(len(test))
4261028590
>>> stripped = test[test['count'] > 0]
>>> print(len(stripped))
1621743

I'll just run some more tests to make sure it doesn't throw any more exceptions elsewhere.

benjamindonnachie added a commit to benjamindonnachie/pandas that referenced this issue Aug 17, 2024
Updates maybe_indices_to_slice to use uint64 allowing massive dataframes to be manipulated (see pandas-dev#59531)
benjamindonnachie added a commit to benjamindonnachie/pandas that referenced this issue Aug 17, 2024
Update maybe_indices_to_slice to use unit64_t allowing manipulation of massive data frames (See pandas-dev#59531)
benjamindonnachie added a commit to benjamindonnachie/pandas that referenced this issue Aug 17, 2024
Update maybe_indices_to_slice to use uint64_t allowing manipulation of massive data frames (See pandas-dev#59531)
@benjamindonnachie
Copy link
Contributor Author

The rest of my analysis code now works so creating a PR for review. Thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
2 participants