New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487
base: develop
Are you sure you want to change the base?
Conversation
As of VCF 4.3 encoding is explicit:
Newlines are also explicit:
|
The data file in the OP's case was a VCF file, and it is true that VCF 4.3 explicitly and VCF 4.1/4.2 de facto require UTF-8 encoding. However tabix also works with arbitrary position-oriented files that don't have specifications about encodings. However there is no suggestion (in this PR, at least) that HTSlib or tabix should go out of their way to accept such wackily encoded input files. This PR just improves the error messages so that the user is aided in identifying the underlying cause of the parsing failure. |
This looks useful, but I'm not sure it quite works as intended yet. If I make myself a UTF-16 encoded vcf and try to index it, I get:
So it's reporting once for each line, because the indexer skips lines it can't parse. I guess that isn't much change from previous behaviour which also printed a message for each bad line, but should we treat this as more fatal? If the whole file is Somewhat more interesting is the last line. That appeared because |
I'd be +1 for making it a fatal error instead. Although as for warnings, I'm quite OK with invalid input producing 1 warning per line for invalid data. It "encourages" people to fix their inputs and makes spotting the warning significantly more likely. :-) (Obviously warnings produced from valid data, eg something we don't fully support or another form of warning, shouldn't be quite so spammy.) |
Yes. In general I prefer not to add diagnostic output to library routines, so in the |
Well, as no files supported by Also, as |
Text files badly transferred from Windows may occasionally be UTF-16-encoded, and this may not be easily noticed by the user. HTSlib should not accept such encoding (as other tools surely don't, hence doing so would cause interoperability problems), but it should ideally emit a warning or error message identifying the problem. Reading text from a htsFile/samFile/vcfFile will already have failed with EFTYPE/ENOEXEC if the text file is UTF-16-encoded, as the encoding will not have been recognised by hts_detect_format(). OTOH bgzf_getline() will return a UTF-16-encoded text line. Add a suitable context-dependent diagnostic to the BGZF-based bgzf_getline() calls in HTSlib: in hts_readlist()/hts_readlines(), emit a warning (once, on the first line); in tbx.c, emit a more specific error message if get_intv() parsing failure is due to UTF-16 encoding. [TODO] If utf16_text_format were added to htsFormatCategory, the new is_utf16_text() function is suitable for detecting it.
Text files badly transferred from Windows may occasionally be UTF-16-encoded, and this may not be easily noticed by the user — see for example, this report of tabix woes (
??#
is as reproduced locally by me, and equally as confusing as the OP's output):It turned out that the text VCF file was UTF-16-encoded, and the lines returned contained alternating NULs when interpreted as ASCII C-strings, hence all lines appearing truncated to 0 or 1 or BOM+1 characters.
HTSlib should not accept such encoding (as other tools surely don't, hence doing so would cause interoperability problems), but it should ideally emit a warning or error message identifying the problem clearly.
Reading text from a
htsFile
/samFile
/vcfFile
will already have failed with EFTYPE/ENOEXEC (and printed a suitableInappropriate file type or format
message) if the text file is UTF-16-encoded, as the encoding will not have been recognised byhts_detect_format()
. So no changes are needed here, although this could be extended to addutf16_text_format
or so tohtsFormatCategory
to aid in reporting this (but I don't think that is really warranted).OTOH code that uses plain BGZF handles does not return a clear diagnostic, as was seen with
tabix
.bgzf_getline()
will return a UTF-16-encoded text line successfully, but code not expecting it will misinterpret the resulting string. This PR adds a diagnostic suitable for each context to the BGZF-basedbgzf_getline()
calls in HTSlib:hts_readlist()
/hts_readlines()
, emit a warning (once, on the first line);get_intv()
parsing failure is due to UTF-16 encoding.With this PR, the biostars poster's test case produces the following diagnostics and the root cause is apparent: