Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add long options to samtools view #1442

Merged
merged 4 commits into from
Jul 1, 2021
Merged

Conversation

jmarshall
Copy link
Member

[Another PR that's been gestating for rather too long…]

This adds long option equivalents for all of view's command-line options. The added expressiveness of the long names I think clarifies some of the usage explanations (while staying within 80 characters!).

It also adds ‑‑region-file and ‑‑regions-file as synonyms for the -M -L combination. Along with the ‑‑target-file/‑‑targets-file long forms for plain -L, this means samtools view has the same long options for jump-to-region and stream-filtering-by-region as bcftools. And hopefully highlighting ‑‑region-file as an option will help raise the visibility of -M.

Also -s subsampling is split into two long options, thus finally fixing #217.

Proposed usage display:

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Output options:
  -b, --bam                  Output BAM
  -C, --cram                 Output CRAM (requires -T)
  -1, --fast                 Use fast BAM compression (implies --bam)
  -u, --uncompressed         Uncompressed BAM output (implies --bam)
  -h, --with-header          Include header in SAM output
  -H, --header-only          Print SAM header only (no alignments)
      --no-header            Print SAM alignment records only [default]
  -c, --count                Print only the count of matching records
  -o, --output FILE          Write output to FILE [standard output]
  -U, --unoutput FILE, --output-unselected FILE
                             Output reads not selected by filters to FILE
Input options:
  -t, --fai-reference FILE   FILE listing reference names and lengths
  -M, --use-index            Use index and multi-region iterator for regions
      --region[s]-file FILE  Use index to include only reads overlapping FILE
  -X, --customized-index     Expect extra index file argument after <in.bam>

Filtering options (Only include in output reads that...):
  -L, --target[s]-file FILE  ...overlap (BED) regions in FILE
  -r, --read-group STR       ...are in read group STR
  -R, --read-group-file FILE ...are in a read group listed in FILE
  -N, --qname-file FILE      ...whose read name is listed in FILE
  -d, --tag STR1[:STR2]      ...have a tag STR1 (with associated value STR2)
  -D, --tag-file STR:FILE    ...have a tag STR whose value is listed in FILE
  -q, --min-MQ INT           ...have mapping quality >= INT
  -l, --library STR          ...are in library STR
  -m, --min-qlen INT         ...cover >= INT query bases (as measured via CIGAR)
  -e, --expr STR             ...match the filter expression STR
  -f, --all-flags INT        ...have all of the FLAGs in INT present
  -F, --excl[ude]-flags INT  ...have none of the FLAGs in INT present
  -G, --no-all-flags INT     EXCLUDE reads with all of the FLAGs in INT present
      --subsample FLOAT      Keep only FLOAT fraction of templates/read pairs
      --subsample-seed INT   Influence WHICH reads are kept in subsampling [0]
  -s INT.FRAC                Same as --subsample 0.FRAC --subsample-seed INT

Processing options:
  -x, --remove-tag STR       Strip tag STR from reads (option may be repeated)
  -B, --remove-B             Collapse the backward CIGAR operation

General options:
  -?, --help   Print long help, including note about region specification
  -S           Ignored (input format is auto-detected)
      --no-PG  Do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --write-index
               Automatically index the output files [off]
      --verbosity INT
               Set level of verbosity

Notes:

1. This command now auto-detects the input format (BAM/CRAM/SAM).
   Further control over the CRAM format can be specified by using the
   --output-fmt-option, e.g. to specify the number of sequences per slice
   and to use avoid reference based compression:

	samtools view -C --output-fmt-option seqs_per_slice=5000 \
	   --output-fmt-option no_ref -o out.cram in.bam

   Options can also be specified as a comma separated list within the
   --output-fmt value too.  For example this is equivalent to the above

	samtools view --output-fmt cram,seqs_per_slice=5000,no_ref \
	   -o out.cram in.bam

2. The file supplied with `-t' is SPACE/TAB delimited with the first
   two fields of each line consisting of the reference name and the
   corresponding sequence length. The `.fai' file generated by 
   `samtools faidx' is suitable for use as this file. This may be an
   empty file if reads are unaligned.

3. SAM->BAM conversion:  samtools view -bT ref.fa in.sam.gz

4. BAM->SAM conversion:  samtools view -h in.bam

5. A region should be presented in one of the following formats:
   `chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
   specified, the input alignment file must be a sorted and indexed
   alignment (BAM/CRAM) file.

6. Option `-u' is preferred over `-b' when the output is piped to
   another samtools command.

7. Option `-M`/`--use-index` causes overlaps with `-L` BED file regions and
   command-line region arguments to be computed using the multi-region iterator
   and an index. This increases speed, omits duplicates, and outputs the reads
   as they are ordered in the input SAM/BAM/CRAM file.

8. Options `-L`/`--target[s]-file` and `--region[s]-file` may not be used
   together. `--region[s]-file FILE` is simply equivalent to `-M -L FILE`,
   so using both causes one of the specified BED files to be ignored.

@daviesrob
Copy link
Member

daviesrob commented Jun 8, 2021

Looks good in the main. I'm not sure if I like --all-flags and --no-all-flags though. Would --require-flags and --avoid-flags be better?

@jmarshall
Copy link
Member Author

I was pretty unimpressed that I couldn't come up with anything better than --no-all-flags!

--require-flags is probably an improvement, but while I like --avoid-flags better than --no-all-flags I'm not sure it conveys the “all“ aspect of the test — avoid sounds like “avoid any of these” which would make it a synonym of --exclude-flags (I think).

Let's all mull those ones over for a while…

@jmarshall
Copy link
Member Author

jmarshall commented Jun 16, 2021

I think --require-flags is better, and I've realised how to describe -G positively and clearly (obvious in retrospect):

Filtering options (Only include in output reads that...):
[…]
  -f, --require-flags INT    ...have all of the FLAGs in INT present
  -F, --excl[ude]-flags INT  ...have none of the FLAGs in INT present
  -G, --something-flags INT  ...have any of the FLAGs in INT not present

Hmmm… what for that last something…
--some-unset-flags? --absent-flags? --unset-flags? --any-unset-flags?
Frankly it's pretty hard to do better than --not-all-flags

There is something to be said for the symmetry of --all-flags vs --no[t]-all-flags

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 16, 2021

It's hard! The double negative in the language is bad enough: "Only include in output reads that have any of the FLAGs in INT not present". I had to think hard for a while. Firstly "not present" is better expressed as "absent", and secondly the double negative may be better expressed as a positive. I think it's the difference in language that is (flag & i) != i vs ~flag & i in code.

If we got rid of that fixed sentence start, we may write:

  • -f: include reads with all of the FLAGs in INT present
  • -F: exclude reads with any of the FLAG in INT present
  • -G: exclude reads with all of the FLAGs in INT present

If we wish to start everything with the same sentence, then maybe flip the initial sentence to filter-out instead of filter-in, as that feels more natural to "filter" to me anyway. Ie

"Filtering options (exclude from output reads that ...)"

  • -f: have any of the FLAGs in INT absent
  • -F: have any of the FLAGs in INT present
  • -G have all of the FLAGs in INT present

-f, --require-flags is still an appropriate long option, but maybe --require-all-flags, --exclude-any-flags and --exclude-all-flags is more explicit and covers the "something" case better.

@jmarshall
Copy link
Member Author

Firstly "not present" is better expressed as "absent", and secondly the double negative may be better expressed as a positive.

I had it as “absent” but felt that it was too easy for the eye to skip over the half-a-word difference from the other lines without noticing. I'll call this a vote (and I'm inclined to agree) for the status quo:

Filtering options (Only include in output reads that...):
[…10 other filtering options…]
  -f, --require-flags INT    ...have all of the FLAGs in INT present
  -F, --excl[ude]-flags INT  ...have none of the FLAGs in INT present
  -G, --something-flags INT  EXCLUDE reads with all of the FLAGs in INT present

--exclude-any/all-flags is not bad, but --excl[ude]-flags is so-named because it is the same as pileup's --excl-flags option. Hmmm…

@jkbonfield
Copy link
Contributor

Compatibility with mpileup is good, but I note it has also reversed the filtering language to describe what to skip, rather than what to keep. Also I wouldn't be averse to changing mpileup's description if we come up with something better, but changing long-opts is obviously not good unless we add aliases. That does mean of course --excl-flags, --exclude-flags and --exclude-any-flags could all be synonyms if we wished.

It just feels odd that we have a long-opt --exclude-flags with a description of "only include in output reads that have none of the FLAGs in INT present". But then again maybe phrasing it in both directions means people can consider it from either side of the fence and go with what works for their brain.

@jmarshall
Copy link
Member Author

That phrasing was introduced by you in 8b779df.

The samtools view usage has always phrased these filter options in terms of what to keep. This PR doesn't change the language much, it mostly just factors “only include reads” out of every line into the subheading. There are 13 filtering options here, and IMHO the keep phrasing is overall definitely the right one — e.g. consider this 🤮 for -d:

-d, --tag STR1[:STR2] Skip reads that do not have a tag STR1 (or that do but its value is not STR2)

That mpileup's usage describes similar options in terms of skipping is a separate minor niggle.

Anyway, that mostly unchanged usage phrasing is not the major problem here. The thing that's holding this PR up is what words to choose for --require-flags and --not-all-flags / --exclude-all-flags / something.

@jkbonfield

This comment has been minimized.

@jkbonfield
Copy link
Contributor

Anyway, that mostly unchanged usage phrasing is not the major problem here. The thing that's holding this PR up is what words to choose for --require-flags and --not-all-flags / --exclude-all-flags / something.

My choice would still be for --exclude-all-flags as it matches the description the best.

@jmarshall
Copy link
Member Author

Rebased now that #1441 has been merged. Still no final epiphany on the last couple of long options, and the new --add-flag/--remove-flag haven't been adjusted to match the other --foo-flags option names' plurality (but I did fix a couple of man page typos there).

@whitwham
Copy link
Contributor

Just to get things moving, how about --reject-all-flags?

@daviesrob
Copy link
Member

... or we could remove the long opt for -G for now, so we can merge the rest and go back to fix it when inspiration hits ...

@jmarshall
Copy link
Member Author

Yep, happy to do that just to get this moving. With -f/--require-flags and just -G for now? And should the new --add/remove-flag be pluralised?

@whitwham
Copy link
Contributor

And should the new --add/remove-flag be pluralised?

Yes, probably.

@daviesrob
Copy link
Member

Actually, it may be best to leave all of -f, -F and -G for now, as I see we currently have:

amplicon_stats.c:        {"flag-require", required_argument, NULL, 'f'},
amplicon_stats.c:        {"flag-filter", required_argument, NULL, 'F'},
bam_plcmd.c:        {"rf", required_argument, NULL, 1},   // require flag
bam_plcmd.c:        {"ff", required_argument, NULL, 2},   // filter flag
bam_plcmd.c:        {"incl-flags", required_argument, NULL, 1},
bam_plcmd.c:        {"excl-flags", required_argument, NULL, 2},
coverage.c:        {"rf", required_argument, NULL, 1}, // require flag
coverage.c:        {"ff", required_argument, NULL, 2}, // filter flag
coverage.c:        {"incl-flags", required_argument, NULL, 1}, // require flag
coverage.c:        {"excl-flags", required_argument, NULL, 2}, // filter flag
stats.c:        {"required-flag", required_argument, NULL, 'f'},
stats.c:        {"filtering-flag", required_argument, NULL, 'F'},

so it's probably best to avoid adding another variant for now. We can make a follow-up pull request to try to unify them, as much as we can.

@jmarshall
Copy link
Member Author

jmarshall commented Jun 30, 2021

I think the situation is not as irregular as all that:

Skip when: (FLAG & N) != N (FLAG & N) != 0 (FLAG & N) == 0 (FLAG & N) == N
ampliconstats -f/--flag-require -F/--flag-filter
coverage --ff/--excl-flags --rf/--incl-flags
depth -G
mpileup --ff/--excl-flags --rf/--incl-flags
stats -f/--required-flag -F/--filtering-flag
view -f/--require-flags1 -F/--excl-flags --rf/--incl[ude]-flags2 -G

I had previously categorised only coverage and mpileup; adding the other two into the table shows:

  • Using require as in your --require-flags suggestion is clearly the right choice
  • -G is a weird one-off so we should be comfortable in ignoring it for long-option purposes
  • filter[ing] is not a great choice (as it implies nothing about the test's meaning), so following --excl[ude] is better

So I think adding --require-flags and --excl[ude]-flags (but no long form for -G) is a step towards what a desired unification would look like.

Need to decide whether to add them as verbs (‑‑require-flags ‑‑excl[ude]-flags) or adjectives (‑‑required-flags ‑‑excl[uded]-flags) or just implement both. The mpileup options were originally abbreviations of the verbs (see e9ea1ba) but in a way using the adjectival form à la stats is a useful distinction from the modifying verbs of ‑‑add/remove-flag[s].

(I can never remember which option I want to use and have to look it up each time. What I would really like to have for this is to come up with some [+-=].* syntax something like chmod(1) symbolic permissions so that all these options could be unified into a single -f/--flags option — at which point these legacy options would become somewhat moot.)

[Aug 2022 due to #1702: Edited to add the hilariously irregular depth, for which -G is --excl-flags (starting from a non-empty default set) and -g is not a separate filter but effectively --no-excl-flags as it removes the specified flags from that set.]

Footnotes

  1. Proposed view additions shown in code.

  2. Since added in PR Add view --rf option #1508.

@jmarshall
Copy link
Member Author

jmarshall commented Jun 30, 2021

Reading over the existing filtering long options in other subcommands, I think actually the simpler verb form for these is generally the way to go. So for your consideration, I've pushed a version of this that adds ‑‑require-flags for -f, ‑‑excl[ude]-flags for -F, no long option for -G, and pluralises the ‑‑add-flags/‑‑remove-flags options.

Hopefully this may be suitable for merging and then we can try to approach cross-subcommand nirvana later 
😄

(Note that GNU getopt_long and probably other implementations accept an unambiguous prefix of long options, so all these options can in fact be spelt as either --add-flag or --add-flags as the user prefers… though they shouldn't really do that in scripts unless they like living on the edge!)

@daviesrob
Copy link
Member

I guess we can live with --require-flags and sort out the the other subcommands later. incidentally, did you forget to move it to its correct position in the alphabetically-ordered lopts[] array?

@jmarshall
Copy link
Member Author

incidentally, did you forget to move it to its correct position in the alphabetically-ordered lopts[] array?

D'oh. Thanks.

Categorise view options in usage, similarly to mpileup's usage.
Add long options to man page.

Reword -m description in usage based on man page and implementation.

Add new samtools view --no-header option to restore the default
behaviour after -h/-H and for symmetry with bcftools view.
Add region query long options with similar names to those used by
bcftools. --target[s]-file is an alias for -L, and for simplicity
implement --region[s]-file as equivalent to -M -L FILE.

Document accordingly, and move the existing -M description to a
new -? note.
Now that there is a way to set a fraction without simultaneously
setting the seed, mention the default seed ("[0]").

Fixes samtools#217.
Make these recently added option names follow the same convention
as the new --require/--excl[ude]-flags options.
@whitwham whitwham merged commit b2f23ed into samtools:develop Jul 1, 2021
@jmarshall jmarshall deleted the long-view branch July 1, 2021 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants