Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"sourmash sketch translate -p k=7,dayhoff" does not respect k in sourmash v4 #1383

Open
phiweger opened this issue Mar 10, 2021 · 6 comments
Labels
doc documentation content or issues faq things to add to an FAQ or docs

Comments

@phiweger
Copy link

I want to sketch a genome to get 7-mers of amino-acids (ie peptides), so I give the new CLI a spin:

sourmash sketch translate -p k=7,k=10,scaled=1000,dayhoff genome.fasta

However, looking into the resulting signature, I suspect that the params are not applied:

... "signatures":[{"num":0,"ksize":21,"seed":42,"max_hash" ...
@phiweger
Copy link
Author

Ah! The ksize in the signature is a multiple of 3 of the ksize specified. What I don't understand then (from having read the v4 migration guide and the sketch documentation): I thought the protein-hashing commands would hash, well, proteins, not nucleotide kmers. I'm sure I'm missing something here, so thank you for your help!

@phiweger
Copy link
Author

Hm. And sourmash index -k 7 ... does load the corresponding signature. I'm confused.

@bluegenes
Copy link
Contributor

@phiweger If you run sourmash sig describe on this signature, do you see k=7?

If I recall correctly, the decision was made to enable amino-acid sizes for all command-line and python interfaces, but to keep the k=k*3 representation of kmer size within the signature files themselves, in order to maintain compatibility with existing signatures.

@ctb
Copy link
Contributor

ctb commented Mar 11, 2021

also, sketch translate hashes 21-mers of DNA into 7-mers of aa. It looks like this is confusing no matter what so we opted for describing the output rather than the input and we will transition to having k=7 in the JSON file ...soon :).

will link to relevant issues in a bit, just for posterity!

@phiweger
Copy link
Author

yes @bluegenes sourmash sig describe says k=7 :) thx all for the explanations!

@ctb
Copy link
Contributor

ctb commented Mar 11, 2021

A few historical notes -

  • sourmash sketch dna does exactly what you expect, ksize=ksize
  • sourmash sketch protein does exactly what you want, visible ksize=ksize; it's just the internal ksize storage that's wonky for the moment, because we didn't want to update the signature format yet! see [MRG] Divide non-DNA MinHash ksize by 3 for external consumption. #1277 for rationale and more links. Note that in sourmash < 4, we confusingly always divided protein ksizes by 3, so you'd get nonintuitive output from sig describe and in the JSON file and... - this was the motivating concern for the change, b/c @bluegenes started working more with protein k-mers and was wondering why she had to set ksize=30 to get aa ksize=10 😆
  • sourmash sketch translate has no obvious behavior options - either you specify DNA ksize and then output signature has that ksize but is actually working with ksize/3 amino acids, OR you specify protein ksize and then input ksize is effectively 3x that. Since we wanted sketch translate signatures to be compatible with sketch protein signatures, it seemed easiest to have translate produce signatures with the "correct" protein ksize.
  • the way compute worked before (protein ksize*3) was because I implemented DNA first, then translate, and then protein, and so it wasn't clear how borked the protein ksize decision was until too late!

These are some good FAQ entries so I'll put them there, and I'll keep this issue open 'til we update the docs appropriately!

@ctb ctb added doc documentation content or issues faq things to add to an FAQ or docs labels Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues faq things to add to an FAQ or docs
Projects
None yet
Development

No branches or pull requests

3 participants