[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

vyasr · 2024-03-26T22:59:54Z

Currently we proxy numpy.ndarray to ensure that cupy arrays are produced instead where possible. However, this behavior only partially addresses the possible ways to interact with numpy. Similarly to how we handle pandas in cudf.pandas, we may want to install a proxied library for numpy itself in such cases to ensure that we get the desired behavior.

Unfortunately, this is far more challenging with numpy than with pandas due to the fact that numpy exposes a C API. This issue is not unique to numpy, but is also present for other libraries (like torch) that rely on being able to translate Python objects from standard vocabulary types like numpy arrays down to C representations to leverage in their own C code. In general, our approach for proxying is incompatible with code that relies on converting Python objects into their C representations since we cannot mimic the latter, only the former.

We should collect cases where we observe failures like this and see if we can, at minimum, improve the ways in which the code fails. If possible, we can also try to come up with more robust strategies for consuming libraries to use such that they won't accidentally go down such bad code paths with cudf.pandas objects.

In an ideal world, we would come up with a solution that actually enables support for such use cases, but at present I don't see how we could manage doing so without doing something crazy like hooking every function call with sys.setprofile, and even if that worked (I'm not sure that it would) the cure might be worse than the disease because we'd most likely slow down every function call in Python enough to overcome any gains from using cudf.pandas.

The text was updated successfully, but these errors were encountered:

Matt711 · 2024-06-05T22:56:27Z

We should collect cases where we observe failures like this and see if we can, at minimum, improve the ways in which the code fails.

This is directly related to #15910, so I should also try to catch cases in our proxying scheme where we fallback to Pandas due to NumPy (or other libraries with C APIs).

If possible, we can also try to come up with more robust strategies for consuming libraries to use such that they won't accidentally go down such bad code paths with cudf.pandas objects.

I think once we run the pandas test suite with cudf.pandas debugging modes turned on we should get a better of what kinds of failures we're running in to.

Apart of #15397. Closes #14537. Creates `ProxyNDarray` which inherits from `np.ndarray`. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) URL: #16286

vyasr · 2024-10-02T00:57:17Z

With the solution in #16601, we are actually returning a (subclass of a) numpy array. Our primary goal was to make instance checks isinstance(df['a'].values, np.ndarray) pass, but I think we may have also implicitly resolved this issue in the process because the result will be true numpy arrays. All the cases where we thought that we would need to proxy numpy itself stemmed from thinking that we needed to modify other attributes of the numpy module or nested attributes of the numpy array proxy type to also be compatible with the pandas accelerator mode. Therefore, I would think that libraries using the NumPy C API or any other part of the numpy API beyond just the array class will also work now (to the extent that cudf itself works with numpy) with the caveat that there will be more host-device transfers happening than we would like. Does that sound right to others @wence- @galipremsagar @beckernick? Are there good tests cases on which we can verify this hypothesis? CC @Matt711 for #16955.

Matt711 · 2024-10-02T01:32:51Z

Thanks for not losing track of this issue @vyasr. Your summary sounds correct to me. We do have a few tests that verify that cudf.pandas works with functions using the NumPy C API. They are in the third-party integration test suite (specifically test_numpy.py and test_pytorch). But I we should add more testing. Specifically, I think we should

Add more tests in test_numpy.py and possibly test_pytorch.py that use the C API (ie. operate on the data buffer directly)
Add more libraries to the third-party integration test suite. We've discussed catboost offline before, so we can start there.

I think doing both of those things will go far, but it may not cover the case where a user writes a function in pure C, compiles it so a shared object, and uses it in python. It may be overkill, but maybe we should add a few tests where the custom function is in pure C?

vyasr · 2024-10-04T17:10:19Z

I don't think that we need to go that far. I think your points 1 and 2 are great and would be sufficient. I would love to see a catboost test if you could throw something together. As long as we cover enough third-party libraries I'll feel good about us covering a wide range of possible uses of the C API. The case you're describing is fundamentally no different than any of the others.

vyasr added the feature request New feature or request label Mar 26, 2024

galipremsagar added the cudf.pandas Issues specific to cudf.pandas label Apr 15, 2024

galipremsagar added this to the Proxying - cudf.pandas milestone Apr 15, 2024

wence- mentioned this issue Jun 14, 2024

[QST] TypeError: Argument 'real' has incorrect type (expected numpy.ndarray, got ndarray) #16029

Closed

Matt711 self-assigned this Jul 15, 2024

Matt711 mentioned this issue Jul 16, 2024

Make proxy NumPy arrays pass isinstance check in cudf.pandas #16286

Merged

3 tasks

vyasr mentioned this issue Oct 2, 2024

[DOC] Document limitation using cudf.pandas proxy arrays #16955

Merged

3 tasks

Matt711 linked a pull request Nov 7, 2024 that will close this issue

Add catboost to the third-party integration tests #17267

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

vyasr commented Mar 26, 2024

Matt711 commented Jun 5, 2024

vyasr commented Oct 2, 2024

Matt711 commented Oct 2, 2024

vyasr commented Oct 4, 2024 •

edited

Loading

[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

Comments

vyasr commented Mar 26, 2024

Matt711 commented Jun 5, 2024

vyasr commented Oct 2, 2024

Matt711 commented Oct 2, 2024

vyasr commented Oct 4, 2024 • edited Loading

vyasr commented Oct 4, 2024 •

edited

Loading