Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing sample names following reheader #1451

Closed
etnite opened this issue Mar 24, 2021 · 4 comments · Fixed by samtools/htslib#1262
Closed

Error parsing sample names following reheader #1451

etnite opened this issue Mar 24, 2021 · 4 comments · Fixed by samtools/htslib#1262
Assignees

Comments

@etnite
Copy link

etnite commented Mar 24, 2021

I have encountered an error when using bcftools reheader to update sample names using a two-column text file. I have attached an example using the header of the file I'm working with. I have simply changed the extension from .vcf to .txt so that GitHub would allow upload. This file has many samples, but only a few thousand variants.

bcftools query -l in.vcf gives the expected output of sample names (in this case there are 12,833 of them).

Running bcftools reheader in.vcf -s samps_rename_key.txt -o out.vcf completes without any output to stderr.

However, attempting to use the output file, for instance bcftools view out.vcf gives the error:

[E::bcf_hdr_add_sample_len] Duplicated sample name '[Thousands of whitespace-delimited sample names here]
'
Failed to read from out.vcf: could not parse header

I thought this might be a duplicate of #1408, but I have double-checked that all files involved are ASCII encoded, and I have not found any whitespace, tabs, or other unusual characters in any sample names.

in.txt
samps_rename_key.txt

@etnite
Copy link
Author

etnite commented Mar 25, 2021

As an update, I tried the substitution in Julia using a dictionary lookup/replace and ended up with the same result.

It strikes me now that many of the sample names being replaced are valid hexadecimal numbers, e.g. "02C01". Not sure if that is significant in any way.

@etnite
Copy link
Author

etnite commented Mar 26, 2021

A further update - it appears that the name substitution ends up creating some duplicate sample names. However, it appears that all sample names after the first duplicated one (OH07-159-62) get printed out as one sample name with tabs in the error message. It would be helpful if bcftools reheader could catch the newly-created duplicates, throw an error, and avoid creating a malformed output file in the first place.

@valeriuo valeriuo self-assigned this Mar 29, 2021
@valeriuo
Copy link
Contributor

The problem is your VCF file already contains OH07-159-62 as an original sample name, while your sample name translation file (samps_rename_key.txt) contains the mapping GSTP-69 -> OH07-159-62. Thus, after the sample name translation performed by reheader, your output VCF will contain both the original OH07-159-62 sample name, as well as the translation of GSTP-69.

@etnite
Copy link
Author

etnite commented Apr 1, 2021

Thank you @valeriuo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants