-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools consensus: --mask/--mask-with overwrites --mask-del #1386
Comments
It is not exactly intended, but is a consequence of the implementation: after the sequence is masked with another character (which is the first step in the process) overlapping variants are skipped. The only reason for this behavior is to make the code simpler, some of the functionality (e.g. applying of IUPAC codes, etc) would break otherwise. I can see that applying indels to the masked sequence informs about the total length, not sure how important that is. This behavior should be at least documented. |
I agree that it is probably a good idea to document this more clearly. You make a good point that |
The documentation and the usage page has been updated to reflect this. |
Hello, Sorry to open an old issue, but I'm having the same problem and I'm trying to wrap my head around it. Basically, I would like to generate a consensus fasta sequence for our SARS-CoV-2 samples based on a vcf file. I want (1) the reference bases to be replaced with well supported variants, and (2) low-depth regions to be masked with 'N'. It works fine, except for the case of deletions. The deletions can be strongly supported by the reads, and as a consequence those sites in the genome are 'low coverage' (depth < 10). Consider the following:
When I try to mark the deletions with '-', the deletions are applied as expected. The low-coverage sites (e.g., 5' and 3' UTRs) remain in the consensus since I am not masking them. Command:
When I try to mark the deletions with '-' and mask low-coverage sites, the deletions are not applied as expected. The low-coverage sites are masked with N in the consensus. Command:
The $SITES variable refers to a simple text file with column1 = chromosome (reference) name and column2 = position in genome (1-based). I first get the depth at every position using Screenshot of deletion region: Since positions 22029, 22030, and 22031 have >10x depth (just), they are not depth masked and are in the consensus with ref bases. Positions 22032, 22033, and 22034 have <10x depth, and are masked with N. The desired outcome would be for all six bases to have a '-' character. I would hope that there is a way to make the two settings play together, i.e. to mask low-coverage sites unless they correspond to a variant. In the case of a deletion, the sites should not be depth-masked with N if they fall within the deletion. I tried following the above conversation in this issue, and it seems that the sites are masked with N first, and then if a variant falls within the masked region then it is skipped (i.e., the opposite of my desired outcome). Is there a way to have the desired outcome? Have there been any changes to Version info: Thanks, |
How about something like this, would this work?
|
Fantastic - that works a treat! Thanks for the fast reply. In case it helps anyone else, I had to change my upstream commands with
Charles |
The recently added parameters
--mask-del
(#1381) and--mask-with
(#1382) exhibit unexpected behaviour.Consider the following command:
where the regions defined in
temp.lowcoverage.bed
coincide with a deletion.As the
--mark-del
parameter comes after the--mask/--mask-with
parameters, I would expect the consensus sequence to have a-
in the position of the deletion. However, the position contains anN
.Is this intended behaviour, or a bug?
The text was updated successfully, but these errors were encountered: