Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do bounds checking on --scaled and --num via command line #1243

Closed
ctb opened this issue Nov 24, 2020 · 17 comments · Fixed by #1688
Closed

do bounds checking on --scaled and --num via command line #1243

ctb opened this issue Nov 24, 2020 · 17 comments · Fixed by #1688

Comments

@ctb
Copy link
Contributor

ctb commented Nov 24, 2020

e.g. --scaled 0 works => produces a num sketch, which is technically ok but...

@ctb
Copy link
Contributor Author

ctb commented Feb 4, 2021

Update w/explanations after labeling with a good-first-issue tag -

  • there are lots of places in the command-line interface code in src/sourmash/commands.py where we take in -n or --scaled to set num/scaled values. see these docs for high level explanation, and API docs for some details.
  • a good first effort on this would be to start putting in place bounds checking on some of them, as well as adding tests to ensure that the bounds checking is working.
  • num values should always be > 0. if they are unreasonable, say > 50,000 or < 50, we should emit a warning.
  • scaled values should always be > 0. if they are unreasonable, say > 1e6 or < 100, we should emit a warning.

suggest starting with sourmash sketch commands, implemented in src/sourmash/command_sketch.py.

@keyabarve
Copy link
Contributor

I would like to work on this. @ctb

@ctb
Copy link
Contributor Author

ctb commented May 24, 2021

go for it!

@keyabarve
Copy link
Contributor

@ctb I can't seem to find any instances of -n or --scaled in src/sourmash/commands.py. Could you specify some lines where I can find these?

@ctb
Copy link
Contributor Author

ctb commented May 25, 2021

ahh, yes! look for args.scaled; the argparse machinery that exists under src/sourmash/cli/ converts --scaled and --num into args.num and args.scaled.

I think there is some information on which commands accept --scaled in the command line docs, and/or in tests/test_sourmash.py, too.

git grep -- --scaled will also help you find places in the codebase where that term exists.

@keyabarve
Copy link
Contributor

How can I test the changes that I make in src/sourmash/commands.py? What command should I run to get the values of the print statements, etc.?

@ctb
Copy link
Contributor Author

ctb commented May 25, 2021

essentially all of the commands in that file are top-level command-line commands. So, find the function name, go look up what the function does in the docs, and run it that way :).

If I forget exactly what a command does, sometimes I go find a test in tests/test_sourmash.py that exercises that command and run it the same way at the command line.

@keyabarve
Copy link
Contributor

Oh okay. I'll check it out. Thanks!

@ctb
Copy link
Contributor Author

ctb commented May 28, 2021

hi @keyabarve, you asked on slack about where to get started - here's a command that accepts --scaled values that are absurd.

% sourmash gather tests/test-data/63.fa.sig tests/test-data/47.fa.sig  --scaled=1e9

Instead of just proceeding, this command should do what sourmash sketch does and emit a warning:

% sourmash sketch dna -p scaled=1000000000 data/GCF_000005845.2_ASM584v2_genomic.fna

== This is sourmash version 4.1.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

WARNING: scaled value of 1000000000 is nonsensical!?

@keyabarve
Copy link
Contributor

Okay, I will try it out!

@keyabarve
Copy link
Contributor

@ctb I went through the --num code and was wondering where to write the function to check the bounds.
For --scaled, I wrote the function in utils.py and found the subparser.add_argument code in gather.py, but for --num I can’t seem to find the code.

@ctb
Copy link
Contributor Author

ctb commented Jul 20, 2021

what do you find when you use

git grep -- --num

in the git repo? If you list out all place where you find it, and annotate them as relevant/not relevant to this issue, I can provide feedback there.

In terms of how to implement it, I think you can use a similar approach as to what you did with --scaled, no? At any rate that's what I'd suggest starting with, and then modifying it where it doesn't work. I'd be happy to comment at that point.

@keyabarve
Copy link
Contributor

keyabarve commented Jul 21, 2021

what do you find when you use

git grep -- --num

in the git repo? If you list out all place where you find it, and annotate them as relevant/not relevant to this issue, I can provide feedback there.

This is the output:

doc/command-line.md:sourmash signature downsample --num 500 scaled_file.sig -o downsampled.sig
src/sourmash/cli/compute.py: * `-n/--num <int>` or `--scaled <int>`: set size or resolution of sketches
src/sourmash/cli/compute.py:        '-n', '--num-hashes', type=int, default=500,
src/sourmash/cli/gather.py:        '-n', '--num-results', default=None, type=int, metavar='N',
src/sourmash/cli/search.py:        '-n', '--num-results', default=3, type=int, metavar='N',
src/sourmash/cli/sig/downsample.py:        '--num', metavar='N', type=int, default=0,
src/sourmash/cli/watch.py:        '-n', '--num-hashes', type=int, default=500,
src/sourmash/commands.py:        print_results(f'(truncated gather because --num-results={args.num_results})')
src/sourmash/sig/__main__.py:        error('ERROR: must specify either --num or --scaled value')
src/sourmash/sig/__main__.py:        error('ERROR: cannot specify both --num and --scaled')
tests/test_cmd_signature.py:    c.run_sourmash('sig', 'downsample', '--num', '500', sig47)
tests/test_cmd_signature.py:        c.run_sourmash('sig', 'downsample', '--num', '50000', sig47)
tests/test_cmd_signature.py:    c.run_sourmash('sig', 'downsample', '--num', '500',
tests/test_cmd_signature.py:    # cannot specify both --num and --scaled
tests/test_cmd_signature.py:        c.run_sourmash('sig', 'downsample', '--scaled', '100', '--num', '50',
tests/test_sourmash.py:    cmd = 'gather {} gcf_all -k 21 --num-results 10'.format(query_sig)
tests/test_sourmash.py:    assert '(truncated gather because --num-results=10)' in out

I think the following are the relevant ones to this issue:

src/sourmash/cli/compute.py: * `-n/--num <int>` or `--scaled <int>`: set size or resolution of sketches
src/sourmash/cli/compute.py:        '-n', '--num-hashes', type=int, default=500,
src/sourmash/cli/gather.py:        '-n', '--num-results', default=None, type=int, metavar='N',
src/sourmash/cli/search.py:        '-n', '--num-results', default=3, type=int, metavar='N',
src/sourmash/cli/sig/downsample.py:        '--num', metavar='N', type=int, default=0,
src/sourmash/cli/watch.py:        '-n', '--num-hashes', type=int, default=500,
src/sourmash/commands.py:        print_results(f'(truncated gather because --num-results={args.num_results})')

In terms of how to implement it, I think you can use a similar approach as to what you did with --scaled, no? At any rate that's what I'd suggest starting with, and then modifying it where it doesn't work. I'd be happy to comment at that point.

Alright! The only issue is, I couldn't find any subparser.add_argument code in gather.py for --num like there was for --scaled, so I'm not sure what to use as reference.

@ctb
Copy link
Contributor Author

ctb commented Jul 21, 2021

The --num-results in gather and search are about display, not MinHash configuration. So it looks like it's just compute, downsample, and watch. It's also supplied in sourmash sketch via the -p num= parameter, but that can't be checked with argparse, I don't think; you'll have to add separate code there.

Alright! The only issue is, I couldn't find any subparser.add_argument code in gather.py for --num like there was for --scaled, so I'm not sure what to use as reference.

What about the places you just listed above? Aren't those subparser.add_argument calls?

@keyabarve
Copy link
Contributor

Yes, they are, but they're divided into --num-results, --num-hashes, --num, etc.

@ctb
Copy link
Contributor Author

ctb commented Jul 27, 2021

Please update the --num arguments in compute, downsample, and watch. Try using the downsample argparse code as a reference.

@keyabarve
Copy link
Contributor

keyabarve commented Jul 27, 2021

Okay, I'll do that. Should I also use the things that I did for --scaled as a reference and write the function in utils.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants