Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch ancestor matching #917

Merged
merged 4 commits into from
Jul 23, 2024
Merged

Batch ancestor matching #917

merged 4 commits into from
Jul 23, 2024

Conversation

benjeffery
Copy link
Member

@benjeffery benjeffery commented May 14, 2024

Stacked on #896

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 96.08696% with 9 lines in your changes missing coverage. Please review.

Project coverage is 93.31%. Comparing base (9d8f934) to head (0e617c2).

Files Patch % Lines
tsinfer/inference.py 96.08% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #917      +/-   ##
==========================================
+ Coverage   87.04%   93.31%   +6.27%     
==========================================
  Files           5       18      +13     
  Lines        1767     6279    +4512     
  Branches      310     1131     +821     
==========================================
+ Hits         1538     5859    +4321     
- Misses        140      285     +145     
- Partials       89      135      +46     
Flag Coverage Δ
C 93.31% <96.08%> (+6.27%) ⬆️
python 95.76% <96.08%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm in favour of dropping LMBD for simpler file-system based methods 👍

tsinfer/inference.py Show resolved Hide resolved
os.makedirs(match_data_dir, exist_ok=True)
for file in os.listdir(match_data_dir):
with open(os.path.join(match_data_dir, file), "rb") as f:
(group, num_sites), batch_results = pickle.load(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worthwhile making dataclass for this payload, we'll inevitably need to add more things to it.

@hyanwong
Copy link
Member

I'm in favour of dropping LMBD for simpler file-system based methods 👍

If we do this, will it be possible to run a lightweight version of tsinfer on pyodide? Since pyodide already includes zarr, I'm thinking that maybe it's only LMBD that's the blocker on getting tsinfer to run in-browser for tutorials etc?

@jeromekelleher
Copy link
Member

Yes, seems likely. We'd have to drop the original SampleData format though.

@benjeffery
Copy link
Member Author

The latest commit implements part of the scheme suggested at #921. Remaining is the partition matching function.

@benjeffery benjeffery force-pushed the batch_match branch 10 times, most recently from 8216f5b to dadb93c Compare June 5, 2024 13:43
@benjeffery benjeffery marked this pull request as ready for review June 11, 2024 11:48
@benjeffery
Copy link
Member Author

This is ready to go - have filed #932 for follow up work.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. A couple of high-level things:

  • The mix of pathlib and os.path/os functions is weird. I would suggest embracing pathlib by doing some_path = pathlib.Path(some_path) at the top of a function that takes a pathlike as an argument, and then exclusively using pathlib functions. It leads to much more readable code. (for new code, old code can stay as is)
  • There's some stuff that's not tested, mostly error conditions. Good to cover these.

else:
total_work = sum(ancestor_lengths[ancestor] for ancestor in group_ancestors)
min_work_per_job_group = min_work_per_job
if total_work / 1000 > min_work_per_job:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to cover this somehow, it's the sort of thing that would catch us out in real applications. Could make the 1000 a parameter, and then test with something small?

partitions.append(current_partition)
# Make directories for the path data
if len(partitions) > 1:
os.mkdir(os.path.join(working_dir, f"group_{group_index}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the attraction of mixing os.path/os.mkdir etc with pathlib. I would make sure working_dir is a pathlib at the top of the function, and then just use pathlib methods on it:

     group_dir = working_dir / f"group_{group_index}
     group_dir.mkdir()

tsinfer/inference.py Show resolved Hide resolved
@benjeffery
Copy link
Member Author

Addressed the comments here - believe this is ready to merge.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just spotted a few redundant comments ("x+= 1 # Add one to x")

We should cover those error cases if we can also. Happy to merge, though

tsinfer/inference.py Outdated Show resolved Hide resolved
tsinfer/inference.py Outdated Show resolved Hide resolved
@benjeffery
Copy link
Member Author

Comments addressed. I've added a bit more testing - some of the missing coverage in error cases is in code paths that will be removed by the sample matching batch refactor.

@jeromekelleher
Copy link
Member

Merge away

@mergify mergify bot merged commit 6b2116d into tskit-dev:main Jul 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants