-
Notifications
You must be signed in to change notification settings - Fork 78
Decode ragged string columns #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Would appreciate a check over the tricky Python C API stuff here. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3228 +/- ##
==========================================
- Coverage 89.62% 89.61% -0.02%
==========================================
Files 28 28
Lines 31916 31983 +67
Branches 5876 5888 +12
==========================================
+ Hits 28604 28660 +56
- Misses 1882 1888 +6
- Partials 1430 1435 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
I think we want to just use the numpy 2.0 StringDtype. I had a look through at one point and it does imply making a full copy of the string data, but I think that's fine. |
|
I think this entails something like this: I haven't used this API yet though, so haven't seen good examples of it in practise. Maybe that would be a good first step - let's track down an idiomatic example of it in use that we can copy? |
2ec7dba to
03801ed
Compare
|
I searched github and couldn't find a single example of how to do this, but it wasn't too hard to figure out. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Will it be possible to conditionally compile this if we're on numpy 2? I had hoped we could do this, as keeping numpy 1 support for a while longer would be good.
9af734e to
4b0c48d
Compare
|
How's this approach for conditional compilation? |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty clean, yeah. There's a bunch of subtleties though and I'd have to check the code out and poke at it to really understand
4b0c48d to
009f521
Compare
|
I've checked what the error message I remember seeing was: So by pinning to This I think makes our "build artifacts on 2, pin to 2, but conditionally compile 2.0 features" strategy the right thing to do, with a note in the docs about what to do if you encounter the above error. |
|
Seems numba doesn't support the new dtype: There isn't an issue to track this support over at the numba github - considering filling one. The solution for now is to convert to a supported dtype before passing to numba. |
|
Some more details about the new string type: 1- Main array is struct of (length, pointer) |
fd7658c to
46d7137
Compare
|
I've run tests in a loop for sometime without observing memory leaks. Think this is ready for review. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a close look at this, and gone through the numpy 2.0 code to see how it works. I'm happy this is correct and to merge (modulo on minor nit). I added a few tests also.
|
Ok, if you're happy I'll squash and merge. |
e50dd37 to
6d69081
Compare
A start on exposing the ragged columns as arrays #2632