Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools mpileup #414

Merged
merged 22 commits into from
Aug 5, 2016
Merged

bcftools mpileup #414

merged 22 commits into from
Aug 5, 2016

Conversation

mcshane
Copy link
Contributor

@mcshane mcshane commented Apr 26, 2016

Bring mpileup over to bcftools. The motivation being that the vcf/bcf generation by mpileup is tightly coupled to the calling process in bcftools call. Having these in separate tools/repos made mistakes with mismatched versions too likely and hampered development as release of new features often needed to happen in both samtools and bcftools simultaneously.

See the individual commit messages for more information.

@mcshane
Copy link
Contributor Author

mcshane commented Apr 26, 2016

This is the beginning of bringing @pd3 bcftools mpileup changes over. Have started from scratch by bringing over the mpileup related files from current samtools develop and adding changes on top of that (current state is essentially a redo of pd3/bcftools@d7a93f5 and the non-gvcf part of pd3/bcftools@cf3219c).

Before bringing in the new things it would be good if people see if everything looks okay.

Candidates for removal:

  1. -g/--BCF, -v/--VCF, -u/--uncompressed could all be dropped. Argument for keeping for the time being would be that users could simply change samtools <-> bcftools and commands would still work although with a warning to start using the -O type option instead.
  2. -D, -V, -S have been deprecated since samtools 1.0, so probably time to remove them.

Candidates for updates to default:

  1. The default for --excl-flags is [UNMAP,SECONDARY,QCFAIL,DUP] at the moment. @pd3 has added SUPPLEMENTARY to that list in his fork. Arguments for and against that?
  2. --bam-list long option. Bother to rename, since we now take CRAM too? Silently allow --bam-list, but advertise something else, like --aln-list
  3. -C internally we always set this to -C50. No idea why.
  4. @jkbonfield spent some time looking at the -d depth option in samtools. Any reason to change?
  5. others?

There are basically 3 features on the @pd3 fork to be added in after this.

  1. --gvcf: mpileup gVCF creation. If we remove the old -g option, that could be used as the short option for this. This will essentially be pd3/bcftools@cf3219c and the gvcf part of pd3/bcftools@ee8210d
  2. --samples: include only selected samples. If we remove the old -s option, that could be used as the short option for this (pd3/bcftools@cf5c354). We talked about adding a feature to map readgroups to new sample names (Feature request: mpileup rename samples (@RG SM:) on fly samtools#324) on the fly. Keep in mind when bringing this over.
  3. --regions/--regions-file: To allow index jumping to multiple regions. e.g. all the non-chromosomal contigs. -r option can be repurposed for --region/--regions, but -R that is used in other bcftools commands for --regions-file is occupied by -R/--ignore-RG at the moment. This feature is tied up with regidx additions in htslib. pd3/bcftools@cfd7cf9

putc(c, fp);
} else putc(p->is_refskip? (bam_is_rev(p->b)? '<' : '>') : '*', fp);
if (p->indel > 0) {
putc('+', fp); printw(p->indel, fp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also remove printw(), which is only used here in pileup_seq().

@jkbonfield
Copy link
Contributor

My thoughts.

In general, unless it can be kept 100% identical (which for good reasons it cannot) then we should make a clean break and feel free to change whatever we wish. Our end goal should be a command within bcftools that is internally consistent with other bcftools commands. It's preferable for this to happen in one step rather than have two releases with changes in.

The caveat here is it obviously needs careful and explicit documentation outlning what has changed and what the mapping from old to new (if existing) is.

  • I have no issues with removal of -g and -v if there is a more canonical way to do this. Ideally with the move to bcftools it should follow any conventions put in place by the rest of bcftools. -u I'm less sure about as it's used in many places as a synonym for uncompressed data, but again I don't really know bcftools conventions so I'll leave that up to you/Petr.
  • Deprecated items; drop them to free up the much over-used single letter alphabet.
  • --excl-flags shouldn't include supplementary. Supplementary reads can be primary alignments (unless marked as secondary) and therefore shouldn't be filtered.
  • -s and -S for samples / sample-file seems common in other bcftools commands, so seems a good fit. Maybe -r/-R should be for regions and change the existing -R to something else (or long only?).

Other things I don't really have an opinion about.

@mcshane
Copy link
Contributor Author

mcshane commented May 25, 2016

Thanks James. I agree with all that. One other change in the name of consistency within bcftools should probably be to drop the -l, --positions FILE option in favour of the -t, --targets/-T, --targets-file options. This would require a new short option for the -t, --output-tags option though.

@jkbonfield
Copy link
Contributor

On Wed, May 25, 2016 at 03:53:00AM -0700, Shane McCarthy wrote:

Thanks James. I agree with all that. One other change in the name of consistency within bcftools should probably be to drop the -l, --positions FILE option in favour of the -t, --targets/-T, --targets-file options. This would require a new short option for the -t, --output-tags option though.

Rather confusingly this is -b BED-file in samtools depth and -l in
mpileup. The help text is somewhat obfuscated too as:

"skip unlisted positions (chr pos) or regions (BED)"

It took me some time to parse the double negative and lack of
bracketting. My initial mental reading implied skip ... regions
(BED).

Maybe reword it as "Only output specified positions (chr pos) or regions (BED)".

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@mcshane mcshane force-pushed the feature/mpileup branch 2 times, most recently from bf1c255 to cddc380 Compare May 27, 2016 11:54
@mcshane
Copy link
Contributor Author

mcshane commented May 27, 2016

OK. I've added three more commits. First one removes the easily deprecated options. The second introduces the --gvcf generation from @pd3's branch. The third introduces sample selection options.

The final sync-up for mpileup with the @pd3 branch is pd3/bcftools@cfd7cf9, which is waiting on the decision on the htslib regidx changes.

In terms of further possible option rearrangements:

  1. include the option to add the ^ prefix to exclude samples for the -s and -S options.
  2. replace -l by -t and -T for non-indexed region extraction, including the option to add the ^ prefix to exclude regions.
  3. the -t,--output-tags short option will need to be changed to avoid clashing with the --targets option. Perhaps -a is a good option - standing for annotation? Could also change the long option to --annotate TAG or similar.
  4. change -G/--exclude-RG FILE to something like -G/--readgroups FILE to select RGs with exclusion coming via the ^ prefix again. Also allow for a second column to set sample names as for Feature request: mpileup rename samples (@RG SM:) on fly samtools#324. Sample name mapping could also be defined by a second column in the --samples-file FILE. This reading should be somewhat flexible as I could imagine adding other RG or SM specific annotation to these files, such as contamination estimate that could be used to adjust the likelihood calculations.

@mcshane mcshane force-pushed the feature/mpileup branch 2 times, most recently from 7b48fd6 to 79a2317 Compare June 23, 2016 08:23
pd3 added a commit that referenced this pull request Jun 27, 2016
* prefix with ^ to negate the selection
* assign/rename samples by providing second field:
    RG_ID_1  SAMPLE_A
    RG_ID_2  SAMPLE_A
    RG_ID_3  SAMPLE_B
* on read group name conflict give the alignment file,
  asterisk for all reads in the file:
    RG_ID_1  FILE_1.bam  SAMPLE_A
    RG_ID_2  FILE_2.bam  SAMPLE_A
    *        FILE_3.bam  SAMPLE_C

Resolves 4th item in
#414 (comment)
and samtools/samtools#324.
mcshane pushed a commit that referenced this pull request Jul 2, 2016
* prefix with ^ to negate the selection
* assign/rename samples by providing second field:
    RG_ID_1  SAMPLE_A
    RG_ID_2  SAMPLE_A
    RG_ID_3  SAMPLE_B
* on read group name conflict give the alignment file,
  asterisk for all reads in the file:
    RG_ID_1  FILE_1.bam  SAMPLE_A
    RG_ID_2  FILE_2.bam  SAMPLE_A
    *        FILE_3.bam  SAMPLE_C

Resolves 4th item in
#414 (comment)
and samtools/samtools#324.
@mcshane
Copy link
Contributor Author

mcshane commented Jul 22, 2016

This branch is ready to be merged now. We should review docs and consider further changes once people have tried out the functionality on develop.

Not functional yet. Just copying over the files.

* bam_plcmd.c for the mpileup command
* bam2bcf.[ch] bam2bcf_indel.c for the VCF/BCF creation
* sample.[ch] for RG:SM handling
minimal changes to files copied from samtools in order to compile
in bcftools.

* use regidx from htslib rather than bedidx from samtools
* remove sam_opts calls as sam_opts.h not copied from samtools

Todo:

* copy over relevant functionality from sam_opts.h
* remove text based mpileup output
* update options and defaults
* bring over `---gvcf` and other changes from @pd3 fork
* deprecate `-g -v -u` options (still functional, but with warning)
* exit with message to use `samtools mpileup` if `-s/--output-MQ` used
* `-O` option was `--output-BP` and is not `--output-type` for consistency
  with other bcftools commands. If `[buzv]` not given as an option
  will warn

These catches for old text output options are probably not necessary
as users may not expect text output from `bcftools mpileup`.
* mpileup.1.out, mpileup.2.out, mpileup.3.out and mpileup.4.out
  are from samtools with mpileup.1.out and mpileup.3.out converted
  from the text output
* mpileup.5.out a new test with the newer AD, etc annotations
* sam/bam/cram test files all stored. perhaps there is some
  way to store one version and convert within the test ala
  the vcf-miniview in samtools?
adds to @4e7c8fb86349761fed1b290357dbc792222ecdcb
* remove deprecated `-g`, `-v`, `-u`, `-D`, `-V`, `-S`
* remove `-R` short option to make way for `--regions-file` option later
This commit brings over the `--gvcf` functionality from @pd3's branch,
consisting of relevant bits from pd3/bcftools@cf3219c
and pd3/bcftools@ee8210d

Reference only blocks will be merged into gVCF blocks when the
minimum per-sample depth falls in the intervals defined by the
argument to the `-gvcf` option. Documentaion added to explain
the merging and a test added.
pulling over of pd3/bcftools@cf5c354
adding in `-S,--samples-file` option and exiting if no samples are
read from the file or list

TODO: add exclude logic with `^` prefix as in other bcftools commands

removed `config.h` from `sample.c` as leftover this is in samtools,
but not bcftools at the moment
switch `-t/--output-tags` option to `-a/--annotate` to make room for
the `-t/--targets` option

available annotations are now listed on request with `-a ?` rather than
cluttering up the help output.
This is meant as a temporary change while we extend the
regidx api, but allow bcftools code to use these changes
before they appear in some form in htslib.

This commit does not add new features, just copies over
`regidx.[ch]` and rejiggers the linking to use these
local bcftools copies.

the `*_c` are removed due to relying on `hts_internal.h`
(see fc9aeb6f77668afed412119701c5c58b0fca8091)
* added functions to loop over all regions
* lazy index build in case random access is not required
* support for chromosome names only, beg-end coordinates not mandatory
* set cap at maximum coordinate at 2147483647, hts_itr does not support larger
* tab and reg parsers will throw on finding a `0` to catch user
  error of using 0-based rather than 1-based coords
* `-r/--region` replaced by `-r/--regions` which will accept a comma separated
  list of regions as in other bcftools commands. `--region` still accepted
* `-R/--regions-file` option added to read regions from a file

This commit lifts over work originally done in pd3/bcftools@cfd7cf9

Note: when more than one region is given, all indices are stored in memory,
which can be a problem when running on many bams. An alternative would be
to cache pre-filled `hts_itr`'s for each region.

Resolves #369
mcshane and others added 7 commits August 5, 2016 11:02
…ools commands

the point of `--no-version` is to remove invocation specify metadata in the header
lines for pipeline systems that are tracking this separately. we are outputting
the `##reference` line though in mpileup. could drop this as well when
`--no-version` used. seems silly to add a separate option.
* prefix with ^ to negate the selection
* assign/rename samples by providing second field:
    RG_ID_1  SAMPLE_A
    RG_ID_2  SAMPLE_A
    RG_ID_3  SAMPLE_B
* on read group name conflict give the alignment file,
  asterisk for all reads in the file:
    RG_ID_1  FILE_1.bam  SAMPLE_A
    RG_ID_2  FILE_2.bam  SAMPLE_A
    *        FILE_3.bam  SAMPLE_C

Resolves 4th item in
#414 (comment)
and samtools/samtools#324.
…us behaviour

Such reads can be matched explicitly using the question mark "?"
@mcshane
Copy link
Contributor Author

mcshane commented Aug 5, 2016

Rebased prior to merge.

@mcshane mcshane merged commit 924d4e2 into develop Aug 5, 2016
@mcshane mcshane deleted the feature/mpileup branch September 21, 2016 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants