Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DNA/protein check on first 100 bp of FASTA files #1037

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 21, 2020

One of the most dangerous flaws in our protein hash calculations is that sourmash doesn't do any sequence type checks: you need to explicitly specify --input-is-protein for amino acid inputs, and there is no error checking on that, e.g. see #999 (comment).

En route to bigger changes in the way we do things per
#999, this adds checks for compute to verify proper --input-is-protein behavior.

This does NOT deal with API-level issues like add_sequence and add_protein, this is just about command-line signature compute.

Example output

% sourmash compute -k 31 GCA_001593925.1_ASM159392v1_genomic.fna.gz -o xyz.sig --input-is-protein 

...

WARNING: input is protein, turning off nucleotide hashing
...
... reading sequences from GCA_001593925.1_ASM159392v1_genomic.fna.gz
** ERROR: for filename GCA_001593925.1_ASM159392v1_genomic.fna.gz,
** ERROR: this looks like DNA sequence, but you're using --input-is-protein

and

% sourmash compute -k 31 GCA_001593925.1_ASM159392v1_protein.faa.gz -o xyz.sig 
...

computing signatures for files: GCA_001593925.1_ASM159392v1_protein.faa.gz
...
... reading sequences from GCA_001593925.1_ASM159392v1_protein.faa.gz
** ERROR: for filename GCA_001593925.1_ASM159392v1_protein.faa.gz,
** ERROR: this looks like protein sequence; use --input-is-protein

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jun 21, 2020

Codecov Report

Merging #1037 into master will decrease coverage by 0.09%.
The diff coverage is 80.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1037      +/-   ##
==========================================
- Coverage   92.37%   92.27%   -0.10%     
==========================================
  Files          72       72              
  Lines        5454     5492      +38     
==========================================
+ Hits         5038     5068      +30     
- Misses        416      424       +8     
Impacted Files Coverage Δ
sourmash/command_compute.py 95.20% <80.48%> (-2.44%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97d35e6...4ee4143. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant