Decode ragged string columns #3228

benjeffery · 2025-06-19T13:39:54Z

A start on exposing the ragged columns as arrays #2632

benjeffery · 2025-06-19T13:40:17Z

Would appreciate a check over the tricky Python C API stuff here.

codecov · 2025-06-19T13:45:15Z

Codecov Report

Attention: Patch coverage is 82.60870% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.61%. Comparing base (b118e2c) to head (6d69081).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
python/_tskitmodule.c	77.35%	6 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3228      +/-   ##
==========================================
- Coverage   89.62%   89.61%   -0.02%     
==========================================
  Files          28       28              
  Lines       31916    31983      +67     
  Branches     5876     5888      +12     
==========================================
+ Hits        28604    28660      +56     
- Misses       1882     1888       +6     
- Partials     1430     1435       +5

Flag	Coverage Δ
c-tests	`86.59% <ø> (ø)`
lwt-tests	`80.38% <ø> (ø)`
python-c-tests	`88.15% <77.35%> (-0.07%)`	⬇️
python-tests	`98.80% <75.00%> (-0.06%)`	⬇️
python-tests-numpy1	`52.44% <50.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
python/tskit/trees.py	`98.85% <100.00%> (+<0.01%)`	⬆️
python/_tskitmodule.c	`88.15% <77.35%> (-0.07%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jeromekelleher · 2025-06-19T13:52:20Z

I think we want to just use the numpy 2.0 StringDtype. I had a look through at one point and it does imply making a full copy of the string data, but I think that's fine.

jeromekelleher · 2025-06-19T13:58:20Z

I think this entails something like this:

allocator = NpyString_acquire_allocator(...)
for (j = 0; j < num_records; j++)
    NpyString_pack(allocator, packed_string, data[offset[j]], offset[j+ 1] - offset[j])
return packed_string

I haven't used this API yet though, so haven't seen good examples of it in practise. Maybe that would be a good first step - let's track down an idiomatic example of it in use that we can copy?

benjeffery · 2025-06-20T10:18:50Z

I searched github and couldn't find a single example of how to do this, but it wasn't too hard to figure out.
We're now pinning to numpy>2.

jeromekelleher

Cool!

Will it be possible to conditionally compile this if we're on numpy 2? I had hoped we could do this, as keeping numpy 1 support for a while longer would be good.

python/_tskitmodule.c

benjeffery · 2025-06-20T12:45:01Z

How's this approach for conditional compilation?
Will probably need a separate CI test, as my attempt here to add it to the existing one has hit issues with python versions.

jeromekelleher

Looks pretty clean, yeah. There's a bunch of subtleties though and I'd have to check the code out and poke at it to really understand

python/_tskitmodule.c

benjeffery · 2025-06-23T15:24:05Z

I've checked what the error message I remember seeing was:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.

So by pinning to numpy>2 we are making tskit incompatible with any python wheels that were built with numpy 1.X.
The solution for this will be to build tskit from source with numpy 1.X.

This I think makes our "build artifacts on 2, pin to 2, but conditionally compile 2.0 features" strategy the right thing to do, with a note in the docs about what to do if you encounter the above error.

benjeffery · 2025-06-23T16:28:34Z

Seems numba doesn't support the new dtype:
numba.core.errors.NumbaValueError: Unsupported array dtype: StringDType()

There isn't an issue to track this support over at the numba github - considering filling one.

The solution for now is to convert to a supported dtype before passing to numba.

benjeffery · 2025-06-23T16:51:27Z

Some more details about the new string type:

1- Main array is struct of (length, pointer)
2 - Short string optimisation stores string directly in the place of the pointer
3 - Arena allocator means that strings are contiguous on the heap.

benjeffery · 2025-06-24T12:11:14Z

I've run tests in a loop for sometime without observing memory leaks. Think this is ready for review.

jeromekelleher

I've had a close look at this, and gone through the numpy 2.0 code to see how it works. I'm happy this is correct and to merge (modulo on minor nit). I added a few tests also.

python/CHANGELOG.rst

python/tskit/trees.py

benjeffery · 2025-06-25T18:04:58Z

Ok, if you're happy I'll squash and merge.

benjeffery force-pushed the ragged-string-columns branch 3 times, most recently from 2ec7dba to 03801ed Compare June 20, 2025 10:16

jeromekelleher reviewed Jun 20, 2025

View reviewed changes

python/_tskitmodule.c Outdated Show resolved Hide resolved

python/_tskitmodule.c Show resolved Hide resolved

benjeffery force-pushed the ragged-string-columns branch 2 times, most recently from 9af734e to 4b0c48d Compare June 20, 2025 12:26

jeromekelleher reviewed Jun 20, 2025

View reviewed changes

python/_tskitmodule.c Show resolved Hide resolved

benjeffery force-pushed the ragged-string-columns branch from 4b0c48d to 009f521 Compare June 20, 2025 13:31

benjeffery force-pushed the ragged-string-columns branch from fd7658c to 46d7137 Compare June 24, 2025 09:17

benjeffery marked this pull request as ready for review June 24, 2025 12:11

benjeffery changed the title ~~WIP - decode ragged string columns~~ Decode ragged string columns Jun 24, 2025

jeromekelleher reviewed Jun 25, 2025

View reviewed changes

python/CHANGELOG.rst Show resolved Hide resolved

python/tskit/trees.py Show resolved Hide resolved

jeromekelleher mentioned this pull request Jun 25, 2025

Mutation's "inherited" state #2631

Closed

jeromekelleher approved these changes Jun 25, 2025

View reviewed changes

Add sites_ancestral_state and mutation_derived_state array properties

6d69081

benjeffery force-pushed the ragged-string-columns branch from e50dd37 to 6d69081 Compare June 26, 2025 08:20

benjeffery enabled auto-merge June 26, 2025 08:38

benjeffery added this pull request to the merge queue Jun 26, 2025

Merged via the queue into tskit-dev:main with commit 9c87a27 Jun 26, 2025
17 of 19 checks passed

benjeffery deleted the ragged-string-columns branch June 26, 2025 11:06

benjeffery mentioned this pull request Jun 26, 2025

Numpy access to ancestral and derived state columns in TreeSequence #2632

Closed

Decode ragged string columns #3228

Decode ragged string columns #3228

Uh oh!

Conversation

benjeffery commented Jun 19, 2025

Uh oh!

benjeffery commented Jun 19, 2025

Uh oh!

codecov bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Jun 19, 2025

Uh oh!

jeromekelleher commented Jun 19, 2025

Uh oh!

benjeffery commented Jun 20, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

benjeffery commented Jun 20, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benjeffery commented Jun 23, 2025

Uh oh!

benjeffery commented Jun 23, 2025

Uh oh!

benjeffery commented Jun 23, 2025

Uh oh!

benjeffery commented Jun 24, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

benjeffery commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jun 19, 2025 •

edited

Loading