Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fastalignment metric #456

Merged
merged 19 commits into from Jan 4, 2024
Merged

Add fastalignment metric #456

merged 19 commits into from Jan 4, 2024

Conversation

tobgan
Copy link
Contributor

@tobgan tobgan commented Oct 18, 2023

Closes #304

  • CHANGELOG.md updated
  • Tests added (For bug fixes or new features)
  • Tutorial updated (if necessary)

tobgan and others added 5 commits October 3, 2023 23:57
@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (7dd3377) 80.22% compared to head (21f06c1) 80.49%.

Files Patch % Lines
src/scirpy/ir_dist/metrics.py 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #456      +/-   ##
==========================================
+ Coverage   80.22%   80.49%   +0.27%     
==========================================
  Files          49       49              
  Lines        3939     3994      +55     
==========================================
+ Hits         3160     3215      +55     
  Misses        779      779              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tobgan,

thank you so much for adding this!
The code you added looks good, but there are a few point that still need some work to make this more easily accessible to users. I'm happy to guide you, but I can also work on those tasks myself if you prefer.

  • documentation: The new class should be added to the API docs
  • tutorial: the new method should be mentioned in the docs and probably even be suggested as the standard method.
  • The code needs unit tests (it's mostly applying the existing tests to the new method as well and adjusting the expected values)

Before we tackle those I have a more general question: Do I remember correctly, that if only the length filtering is applied there are no losses and the results are exactly the same as with the plain AlignmentDistanceCalculator but faster? (obviously not as fast as the full method). Because in that case I would consider getting rid of the old AlignmentDistanceCalculator altogether and just use the new class with different parameters for alignment and fastalignment. What do you think?

Best,
Gregor

src/scirpy/ir_dist/metrics.py Outdated Show resolved Hide resolved
Comment on lines 599 to 601
else penalty_dict[subst_mat]
if subst_mat in penalty_dict.keys()
else 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be simplified to

Suggested change
else penalty_dict[subst_mat]
if subst_mat in penalty_dict.keys()
else 0.0
else penalty_dict.get(subst_mat, 0.0)

I would even consider raising an error if the substitution matrix is unnown and no penalty is specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think raising an error would be better

src/scirpy/ir_dist/metrics.py Show resolved Hide resolved
@grst
Copy link
Collaborator

grst commented Nov 16, 2023

Hi @tobgan -- would be great if I could get your feedback on the points raised above.

Co-authored-by: Gregor Sturm <mail@gregor-sturm.de>
@tobgan
Copy link
Contributor Author

tobgan commented Nov 16, 2023

Hi @grst,

I'm very sorry for the belated response, I have been dealing with some health issues lately.

Regarding the length filter - yes, the results are exactly the same, at least in practice. Theoretically, it could still result in some loss, e.g. with unusually high-scoring mismatches (as, for example, in PAM500, where the mismatch N-D scores higher than the match N-N), but this has not happened in any of my test runs and I would wager to say it is unlikely to happen with real data, and even less likely to do so to a relevant degree. So I think replacing the AlignmentDistanceCalculator altogether would make sense.

If you agree, I would then just make the FastAlignmentDistanceCalculator the new AlignmentDistanceCalculator. I will also add the unit tests, but I could use some pointers on how to best add to the documentation and tutorial.

Best,
Tobias

@grst
Copy link
Collaborator

grst commented Nov 22, 2023

Thanks, that sounds good. Honestly, I doubt anyone has ever changed the substitution matrix to anything other than blosum62, so that should be fine.

I would then suggest that in pp.ir_dist

  • specifying metric="alignment" sets the parameters of the FastAlignmentDistanceCalculator such that the result is identical to the previous AlignmentDistanceCalculator for backwards compatibility (ignoring the PAM500 caveat)
  • specifying metric="fastalignment" uses the same class, but sets the parameters such that the heuristic is used for additional speedup.

Regarding the API docs, the list of functions is here: https://github.com/scverse/scirpy/blob/main/docs/api.rst?plain=1#L298

For the tutorial, it's probably easiest if I take care of that myself as a final step.

@tobgan
Copy link
Contributor Author

tobgan commented Dec 3, 2023

Hi @grst,
I added the changes to pp.ir_dist and the documentation. I also reran the tests on the Wu dataset, for the default parameters I get a 5x speedup from the old to the new alignment metric on my laptop (with the expected zero loss), so this seems to work as expected. Let me know if I overlooked anything.

@grst
Copy link
Collaborator

grst commented Dec 6, 2023

Perfect, thanks! I'll go over the documentation one more time myself and update the tutorial and then merge this!

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@grst
Copy link
Collaborator

grst commented Dec 27, 2023

Adding the tables with the expected penalty here for future reference:

Wu2020 3k dataset
image

Wu2020 full dataset
"using the nighest expected penalty which kept loss <1% for the 3k dataset"
image

Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped up the open things... Looks good to me now!

Some CI job is failing, but it doesn't look related to this PR.

Thanks again @tobgan!

@grst grst merged commit e1c8028 into scverse:main Jan 4, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up ir_dist
2 participants