Batch ancestor matching #917

benjeffery · 2024-05-14T22:26:01Z

Stacked on #896

codecov · 2024-05-14T22:41:01Z

Codecov Report

Attention: Patch coverage is 96.08696% with 9 lines in your changes missing coverage. Please review.

Project coverage is 93.31%. Comparing base (9d8f934) to head (0e617c2).

Files	Patch %	Lines
tsinfer/inference.py	96.08%	4 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #917      +/-   ##
==========================================
+ Coverage   87.04%   93.31%   +6.27%     
==========================================
  Files           5       18      +13     
  Lines        1767     6279    +4512     
  Branches      310     1131     +821     
==========================================
+ Hits         1538     5859    +4321     
- Misses        140      285     +145     
- Partials       89      135      +46

Flag	Coverage Δ
C	`93.31% <96.08%> (+6.27%)`	⬆️
python	`95.76% <96.08%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jeromekelleher

LGTM. I'm in favour of dropping LMBD for simpler file-system based methods 👍

tsinfer/inference.py

jeromekelleher · 2024-05-15T09:03:52Z

tsinfer/inference.py

+            os.makedirs(match_data_dir, exist_ok=True)
+            for file in os.listdir(match_data_dir):
+                with open(os.path.join(match_data_dir, file), "rb") as f:
+                    (group, num_sites), batch_results = pickle.load(f)


It's worthwhile making dataclass for this payload, we'll inevitably need to add more things to it.

hyanwong · 2024-05-15T09:41:30Z

I'm in favour of dropping LMBD for simpler file-system based methods 👍

If we do this, will it be possible to run a lightweight version of tsinfer on pyodide? Since pyodide already includes zarr, I'm thinking that maybe it's only LMBD that's the blocker on getting tsinfer to run in-browser for tutorials etc?

jeromekelleher · 2024-05-15T10:14:06Z

Yes, seems likely. We'd have to drop the original SampleData format though.

benjeffery · 2024-05-24T16:45:13Z

The latest commit implements part of the scheme suggested at #921. Remaining is the partition matching function.

benjeffery · 2024-06-11T12:03:42Z

This is ready to go - have filed #932 for follow up work.

jeromekelleher

Generally looks good. A couple of high-level things:

The mix of pathlib and os.path/os functions is weird. I would suggest embracing pathlib by doing some_path = pathlib.Path(some_path) at the top of a function that takes a pathlike as an argument, and then exclusively using pathlib functions. It leads to much more readable code. (for new code, old code can stay as is)
There's some stuff that's not tested, mostly error conditions. Good to cover these.

jeromekelleher · 2024-06-20T08:46:49Z

tsinfer/inference.py

+        else:
+            total_work = sum(ancestor_lengths[ancestor] for ancestor in group_ancestors)
+            min_work_per_job_group = min_work_per_job
+            if total_work / 1000 > min_work_per_job:


Probably good to cover this somehow, it's the sort of thing that would catch us out in real applications. Could make the 1000 a parameter, and then test with something small?

jeromekelleher · 2024-06-20T08:49:10Z

tsinfer/inference.py

+            partitions.append(current_partition)
+        # Make directories for the path data
+        if len(partitions) > 1:
+            os.mkdir(os.path.join(working_dir, f"group_{group_index}"))


I don't see the attraction of mixing os.path/os.mkdir etc with pathlib. I would make sure working_dir is a pathlib at the top of the function, and then just use pathlib methods on it:

group_dir = working_dir / f"group_{group_index} group_dir.mkdir()

tsinfer/inference.py

benjeffery · 2024-07-19T13:15:52Z

Addressed the comments here - believe this is ready to merge.

jeromekelleher

LGTM. Just spotted a few redundant comments ("x+= 1 # Add one to x")

We should cover those error cases if we can also. Happy to merge, though

tsinfer/inference.py

benjeffery · 2024-07-23T00:11:05Z

Comments addressed. I've added a bit more testing - some of the missing coverage in error cases is in code paths that will be removed by the sample matching batch refactor.

jeromekelleher · 2024-07-23T08:33:38Z

Merge away

jeromekelleher reviewed May 15, 2024

View reviewed changes

benjeffery force-pushed the batch_match branch 3 times, most recently from 19207b3 to 8729437 Compare May 15, 2024 23:50

benjeffery force-pushed the batch_match branch from 8351c99 to e03cc88 Compare May 24, 2024 16:28

benjeffery force-pushed the batch_match branch 10 times, most recently from 8216f5b to dadb93c Compare June 5, 2024 13:43

benjeffery mentioned this pull request Jun 6, 2024

Batching during match ancestors wastes worker CPU. #880

Closed

benjeffery marked this pull request as ready for review June 11, 2024 11:48

jeromekelleher mentioned this pull request Jun 20, 2024

Grep for "sgkit" in names and replace #934

Closed

jeromekelleher reviewed Jun 20, 2024

View reviewed changes

benjeffery added 3 commits July 19, 2024 13:17

Remove dask ancestor matching

378d81a

Switch to file based caching

9f03d63

Add batch ancestor matching - whole groups

982f7df

benjeffery force-pushed the batch_match branch from 2890da3 to 72668c8 Compare July 19, 2024 13:03

jeromekelleher approved these changes Jul 22, 2024

View reviewed changes

tsinfer/inference.py Outdated Show resolved Hide resolved

tsinfer/inference.py Outdated Show resolved Hide resolved

Add partition matching

0e617c2

benjeffery force-pushed the batch_match branch from 72668c8 to 0e617c2 Compare July 23, 2024 00:09

benjeffery added the AUTOMERGE-REQUESTED label Jul 23, 2024

mergify bot merged commit 6b2116d into tskit-dev:main Jul 23, 2024
14 checks passed

mergify bot removed the AUTOMERGE-REQUESTED label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch ancestor matching #917

Batch ancestor matching #917

benjeffery commented May 14, 2024 •

edited

Loading

codecov bot commented May 14, 2024 •

edited

Loading

jeromekelleher left a comment

jeromekelleher May 15, 2024

hyanwong commented May 15, 2024

jeromekelleher commented May 15, 2024

benjeffery commented May 24, 2024

benjeffery commented Jun 11, 2024

jeromekelleher left a comment

jeromekelleher Jun 20, 2024

jeromekelleher Jun 20, 2024

benjeffery commented Jul 19, 2024

jeromekelleher left a comment

benjeffery commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

Batch ancestor matching #917

Batch ancestor matching #917

Conversation

benjeffery commented May 14, 2024 • edited Loading

codecov bot commented May 14, 2024 • edited Loading

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher May 15, 2024

Choose a reason for hiding this comment

hyanwong commented May 15, 2024

jeromekelleher commented May 15, 2024

benjeffery commented May 24, 2024

benjeffery commented Jun 11, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Jun 20, 2024

Choose a reason for hiding this comment

jeromekelleher Jun 20, 2024

Choose a reason for hiding this comment

benjeffery commented Jul 19, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

benjeffery commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

benjeffery commented May 14, 2024 •

edited

Loading

codecov bot commented May 14, 2024 •

edited

Loading