Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary: further improvements to protein handling in sourmash #1525

Open
ctb opened this issue May 15, 2021 · 1 comment
Open

summary: further improvements to protein handling in sourmash #1525

ctb opened this issue May 15, 2021 · 1 comment

Comments

@ctb
Copy link
Contributor

ctb commented May 15, 2021

This is an update of and replacement for #999, which raised a lot of issues around how we were doing protein k-mer calculations.

This issue is being updated after the release of sourmash 4.1.


Over the past year, several of the issues in #999 were resolved by the release of sourmash v4, which introduced sourmash sketch (via #1159)

Taking from @bluegenes excellent summary, here are the remaining unresolved issues from #999.

I do think differentiating sketches by hash functions #751 is its whole own thing and not specifically protein-esque.


Notes and thoughts:

It would be nice to figure out if #1037, which checks the first 100bp of FASTA files, is a good approach. Thoughts from #999 (comment) that seem relevant -

  • I think we do need both command line and API level checking. The command line can make use of additional info (filename, aggregated across sequences, etc) while the API has to do the trickier job of working with only the sequence it's given.
  • I am leaning towards add_dna_sequence and add_protein_sequence at the API level;
  • it's not 100% clear how robust it will be to check that any given k-mer is DNA vs prot;
  • one strategy might be to look at what fraction of k-mers are valid alphabet;

#1277 changed the Python layer so that ksize for protein was "correct" (the actual length of the word, not k*3!). This still needs to be changed at the Rust layer, though, which would involve changing the JSON signature formats and version.

Also see "Next steps for sourmash sketch" #1169.

@ctb
Copy link
Contributor Author

ctb commented May 15, 2021

here are some notes I took while working through #1159 - copied over from #999 (comment)

Changes to signature computation

Suggested changes to signature computation:

Signature JSON format changes

What signature format changes should we do in tandem? see #268 for rollup issue

Changes to MinHash API

general issue here, #338
more here, #720
and more here, #885, although that is mostly about docs and tests now.

@ctb ctb changed the title further improvements to protein handling in sourmash summary: further improvements to protein handling in sourmash May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant