Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't take max of arrays at least as large as 2 ** 32 #495

Closed
wecassidy opened this issue Jun 27, 2021 · 9 comments
Closed

Can't take max of arrays at least as large as 2 ** 32 #495

wecassidy opened this issue Jun 27, 2021 · 9 comments
Labels

Comments

@wecassidy
Copy link

wecassidy commented Jun 27, 2021

Describe the bug
Calling sparse.COO.max on an array larger than 2 ** 32 - 1 fails a TypeError like so:

>>> a.shape
(4294967296,)
>>> a.max()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 444, in max
    return np.maximum.reduce(self, out=out, axis=axis, keepdims=keepdims)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 307, in __array_ufunc__
    result = SparseArray._reduce(ufunc, *inputs, **kwargs)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 278, in _reduce
    return self.reduce(method, **kwargs)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 360, in reduce
    out = self._reduce_calc(method, axis, keepdims, **kwargs)
  File "C:\<path_redacted>\sparse\_coo\core.py", line 692, in _reduce_calc
    data, inv_idx, counts = _grouped_reduce(a.data, a.coords[0], method, **kwargs)
  File "C:\<path_redacted>\sparse\_coo\core.py", line 1566, in _grouped_reduce
    result = method.reduceat(x, inv_idx, **kwargs)
TypeError: Cannot cast array data from dtype('uint64') to dtype('int64') according to the rule 'safe'

To Reproduce
Create an array a at least as large as 2 ** 32 with at least one nonzero element, then call a.max(). For example:

>>> b = sparse.DOK((2 ** 32,))
>>> b[0] = 1
>>> a = sparse.COO(b)
>>> a.nnz
1
>>> a.max() # TypeError

Expected behavior
Return the maximum value of the array (1 in the example above).

System

  • OS and version: Windows 10
  • sparse version: 0.12.0+44.g765e297 (bug is also present in 0.12.0, installed from pip)
  • NumPy version: 1.18.5
  • Numba version: 0.53.1

Additional context
sparse.COO.max works on an array of size 2 ** 32 if it is empty (i.e. a.nnz == 0).

@hameerabbasi
Copy link
Collaborator

Are you on 32-bit Windows by any chance?

@wecassidy
Copy link
Author

I'm on 64-bit Windows.

I just checked and this bug is not present on Manjaro 21.0.7 with Linux 5.12.9-1-MANJARO (x86_64).

@hameerabbasi
Copy link
Collaborator

Mentoring instructions: Replace all uses of np.[as]array(list) with np.[as]array(list, dtype=np.int64).

@GPhilo
Copy link

GPhilo commented Jul 5, 2022

Hello, I ran into the same problem. Was there any solution to this?

@GPhilo
Copy link

GPhilo commented Jul 5, 2022

A quick update since I'm now digging into the library. I see that there is an idx_dtype parameter for the constructor of COO that -I believe- should force COO to use a specific type as index format. However, if data is None in the constructor's call the array is converted via as_coo, which in turn relies on DOK's as_format, which here calls COO.from_iter, which doesn't take the idx_dtype and doesn't forward it to the final call to COO's constructor here.

The result is, effectively, that idx_dtype gets ignored.

A proposal for improving this would be:

  • as_coo should take idx_dtype (and possibly more parameters of the constructor, maybe directly **kwargs?) anf forward them down as appropriate.
  • as_format should take **kwargs and should forward them to whichever constructor/factory it uses internally
  • from_iter should take **kwargs and forward them to the COO constructor.

I don't know which, if any, parameter combinations should be forbidden to ensure there is no infinite recursion in the constructor, but I believe someone with more knowledge of the codebase might know what and where to check so this doesn't happen.

@GPhilo
Copy link

GPhilo commented Jul 5, 2022

I traced the issue to its source and came up with a hack to make this work, should anyone else also run into this problem.
Basically, when this reshape is called, because idx_type is ignored, as mentioned in the comment above, it uses the default int32 idx_type. Since in32 can't store the new shape, this test checks positive and idx_type gets converted to the result of np.min_scalar_type(max(shape)), which is np.uint64 and that's what causes the problem.

My hack to solve this is to hardcode np.int64 instead of letting numpy choose:

idx_type = np.int64

This solves the problem when calling max().

@hameerabbasi
Copy link
Collaborator

Thanks @GPhilo for digging into this, I'll try to set some time aside this weekend to fix it and cut a release.

@Violin9906
Copy link

It has been more than 2 years and this issue seems still exists. Any update on this?

@hameerabbasi
Copy link
Collaborator

This doesn't happen anymore on sparse 0.15.1, which is the latest release. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants