Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools consensus with deletions starting outside region of interest #2091

Closed
nage0178 opened this issue Feb 1, 2024 · 3 comments
Closed

Comments

@nage0178
Copy link

nage0178 commented Feb 1, 2024

Is there a way for the consensus command to take into account deletions that start outside of the region of interest, analogous to the -r option for other commands? For example, if the sample from individual_ID has a deletion starting at 5219000 and ending at 5221000, the command below will not remove the deletion. I do not want the deletion in the consensus sequence.

samtools faidx reference.fa chr11:5220000-5227000 > reference_region.fa
cat reference_region.fa |bcftools consensus -H 1pIU -s individual_ID --mark-del - --mark-ins lc example.vcf.gz
@pd3 pd3 closed this as completed in d25e3f1 Feb 12, 2024
@pd3
Copy link
Member

pd3 commented Feb 12, 2024

There is now a new option --regions-overlap which works similarly to other commands: overlapping variants starting outside of the target region of the fasta file can be ignored with --regions-overlap 0 or taken into account when set to 1 or 2.

Note for the latter there will be ambiguous cases, see samtools/htslib#1746.

Please test this out and let me know if you encounter anything odd.

@nage0178
Copy link
Author

Thank you for your response. The --regions-overlap does not appear to work with the consensus command. The command

cat reference_region.fa|bcftools consensus -H 1pIU -s individual_ID --mark-del - --mark-ins lc --regions-overlap 1 example.vcf.gz

gives the following output

consensus: unrecognized option '--regions-overlap'

About: Create consensus sequence by applying VCF variants to a reference fasta
       file. By default, the program will apply all ALT variants. Using the
       --samples (and, optionally, --haplotype) option will apply genotype
       (or haplotype) calls from FORMAT/GT. The program ignores allelic depth
       information, such as INFO/AD or FORMAT/AD.
Usage: bcftools consensus [OPTIONS] <file.vcf.gz>
Options:
    -c, --chain FILE               Write a chain file for liftover
    -a, --absent CHAR              Replace positions absent from VCF with CHAR
    -e, --exclude EXPR             Exclude sites for which the expression is true (see man page for details)
    -f, --fasta-ref FILE           Reference sequence in fasta format
    -H, --haplotype WHICH          Choose which allele to use from the FORMAT/GT field, note
                                   the codes are case-insensitive:
                                       N: N={1,2,3,..} is the index of the allele from GT, regardless of phasing (e.g. "2")
                                       R: REF allele in het genotypes
                                       A: ALT allele
                                       I: IUPAC code for all genotypes
                                       LR,LA: longer allele and REF/ALT if equal length
                                       SR,SA: shorter allele and REF/ALT if equal length
                                       NpIu: index of the allele for phased and IUPAC code for unphased GTs (e.g. "2pIu")
    -i, --include EXPR             Select sites for which the expression is true (see man page for details)
    -I, --iupac-codes              Output IUPAC codes based on FORMAT/GT, use -s/-S to subset samples
        --mark-del CHAR            Instead of removing sequence, insert character CHAR for deletions
        --mark-ins uc|lc|CHAR      Highlight insertions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
        --mark-snv uc|lc|CHAR      Highlight substitutions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
    -m, --mask FILE                Replace regions according to the next --mask-with option. The default is --mask-with N
        --mask-with CHAR|uc|lc     Replace with CHAR (skips overlapping variants); change to uppercase (uc) or lowercase (lc)
    -M, --missing CHAR             Output CHAR instead of skipping a missing genotype "./."
    -o, --output FILE              Write output to a file [standard output]
    -p, --prefix STRING            Prefix to add to output sequence names
    -s, --samples LIST             Comma-separated list of samples to include, "-" to ignore samples and use REF,ALT
    -S, --samples-file FILE        File of samples to include
Examples:
   # Get the consensus for one region. The fasta header lines are then expected
   # in the form ">chr:from-to".
   samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

   # See also http://samtools.github.io/bcftools/howtos/consensus-sequence.html

I confirmed I have the latest version of bcftools and the --regions-overlap options works with the view command.

@pd3
Copy link
Member

pd3 commented Feb 12, 2024

This feature was added few hours ago by d25e3f1. You'll need to update to the latest github version
http://samtools.github.io/bcftools/howtos/install.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants