Skip to content

Conversation

@benjeffery
Copy link
Member

A start on exposing the ragged columns as arrays #2632

@benjeffery
Copy link
Member Author

Would appreciate a check over the tricky Python C API stuff here.

@codecov
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

Attention: Patch coverage is 82.60870% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.61%. Comparing base (b118e2c) to head (6d69081).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/_tskitmodule.c 77.35% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3228      +/-   ##
==========================================
- Coverage   89.62%   89.61%   -0.02%     
==========================================
  Files          28       28              
  Lines       31916    31983      +67     
  Branches     5876     5888      +12     
==========================================
+ Hits        28604    28660      +56     
- Misses       1882     1888       +6     
- Partials     1430     1435       +5     
Flag Coverage Δ
c-tests 86.59% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.15% <77.35%> (-0.07%) ⬇️
python-tests 98.80% <75.00%> (-0.06%) ⬇️
python-tests-numpy1 52.44% <50.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.85% <100.00%> (+<0.01%) ⬆️
python/_tskitmodule.c 88.15% <77.35%> (-0.07%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jeromekelleher
Copy link
Member

I think we want to just use the numpy 2.0 StringDtype. I had a look through at one point and it does imply making a full copy of the string data, but I think that's fine.

@jeromekelleher
Copy link
Member

I think this entails something like this:

allocator = NpyString_acquire_allocator(...)
for (j = 0; j < num_records; j++)
    NpyString_pack(allocator, packed_string, data[offset[j]], offset[j+ 1] - offset[j])
return packed_string

I haven't used this API yet though, so haven't seen good examples of it in practise. Maybe that would be a good first step - let's track down an idiomatic example of it in use that we can copy?

@benjeffery benjeffery force-pushed the ragged-string-columns branch 3 times, most recently from 2ec7dba to 03801ed Compare June 20, 2025 10:16
@benjeffery
Copy link
Member Author

I searched github and couldn't find a single example of how to do this, but it wasn't too hard to figure out.
We're now pinning to numpy>2.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Will it be possible to conditionally compile this if we're on numpy 2? I had hoped we could do this, as keeping numpy 1 support for a while longer would be good.

@benjeffery benjeffery force-pushed the ragged-string-columns branch 2 times, most recently from 9af734e to 4b0c48d Compare June 20, 2025 12:26
@benjeffery
Copy link
Member Author

How's this approach for conditional compilation?
Will probably need a separate CI test, as my attempt here to add it to the existing one has hit issues with python versions.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty clean, yeah. There's a bunch of subtleties though and I'd have to check the code out and poke at it to really understand

@benjeffery benjeffery force-pushed the ragged-string-columns branch from 4b0c48d to 009f521 Compare June 20, 2025 13:31
@benjeffery
Copy link
Member Author

I've checked what the error message I remember seeing was:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.

So by pinning to numpy>2 we are making tskit incompatible with any python wheels that were built with numpy 1.X.
The solution for this will be to build tskit from source with numpy 1.X.

This I think makes our "build artifacts on 2, pin to 2, but conditionally compile 2.0 features" strategy the right thing to do, with a note in the docs about what to do if you encounter the above error.

@benjeffery
Copy link
Member Author

Seems numba doesn't support the new dtype:
numba.core.errors.NumbaValueError: Unsupported array dtype: StringDType()

There isn't an issue to track this support over at the numba github - considering filling one.

The solution for now is to convert to a supported dtype before passing to numba.

@benjeffery
Copy link
Member Author

Some more details about the new string type:

1- Main array is struct of (length, pointer)
2 - Short string optimisation stores string directly in the place of the pointer
3 - Arena allocator means that strings are contiguous on the heap.

@benjeffery benjeffery force-pushed the ragged-string-columns branch from fd7658c to 46d7137 Compare June 24, 2025 09:17
@benjeffery
Copy link
Member Author

I've run tests in a loop for sometime without observing memory leaks. Think this is ready for review.

@benjeffery benjeffery marked this pull request as ready for review June 24, 2025 12:11
@benjeffery benjeffery changed the title WIP - decode ragged string columns Decode ragged string columns Jun 24, 2025
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a close look at this, and gone through the numpy 2.0 code to see how it works. I'm happy this is correct and to merge (modulo on minor nit). I added a few tests also.

@benjeffery
Copy link
Member Author

Ok, if you're happy I'll squash and merge.

@benjeffery benjeffery force-pushed the ragged-string-columns branch from e50dd37 to 6d69081 Compare June 26, 2025 08:20
@benjeffery benjeffery enabled auto-merge June 26, 2025 08:38
@benjeffery benjeffery added this pull request to the merge queue Jun 26, 2025
Merged via the queue into tskit-dev:main with commit 9c87a27 Jun 26, 2025
17 of 19 checks passed
@benjeffery benjeffery deleted the ragged-string-columns branch June 26, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants