Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

jmarshall · 2022-08-01T09:05:29Z

Text files badly transferred from Windows may occasionally be UTF-16-encoded, and this may not be easily noticed by the user — see for example, this report of tabix woes (??# is as reproduced locally by me, and equally as confusing as the OP's output):

$ tabix myVcf.vcf.gz
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "??#"
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""

It turned out that the text VCF file was UTF-16-encoded, and the lines returned contained alternating NULs when interpreted as ASCII C-strings, hence all lines appearing truncated to 0 or 1 or BOM+1 characters.

HTSlib should not accept such encoding (as other tools surely don't, hence doing so would cause interoperability problems), but it should ideally emit a warning or error message identifying the problem clearly.

Reading text from a htsFile/samFile/vcfFile will already have failed with EFTYPE/ENOEXEC (and printed a suitable Inappropriate file type or format message) if the text file is UTF-16-encoded, as the encoding will not have been recognised by hts_detect_format(). So no changes are needed here, although this could be extended to add utf16_text_format or so to htsFormatCategory to aid in reporting this (but I don't think that is really warranted).

OTOH code that uses plain BGZF handles does not return a clear diagnostic, as was seen with tabix. bgzf_getline() will return a UTF-16-encoded text line successfully, but code not expecting it will misinterpret the resulting string. This PR adds a diagnostic suitable for each context to the BGZF-based bgzf_getline() calls in HTSlib:

in hts_readlist()/hts_readlines(), emit a warning (once, on the first line);
in tbx.c, emit a more specific error message if get_intv() parsing failure is due to UTF-16 encoding.

With this PR, the biostars poster's test case produces the following diagnostics and the root cause is apparent:

tabix myVcf.vcf.gz
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: was wrong -p [type] used?
The offending line was: ""

d-cameron · 2022-08-01T11:03:23Z

As of VCF 4.3 encoding is explicit:

The character encoding of VCF files is UTF-8.

Newlines are also explicit:

Line separators must be CR+LF or LF

jmarshall · 2022-08-01T11:15:30Z

The data file in the OP's case was a VCF file, and it is true that VCF 4.3 explicitly and VCF 4.1/4.2 de facto require UTF-8 encoding. However tabix also works with arbitrary position-oriented files that don't have specifications about encodings.

However there is no suggestion (in this PR, at least) that HTSlib or tabix should go out of their way to accept such wackily encoded input files. This PR just improves the error messages so that the user is aided in identifying the underlying cause of the parsing failure.

daviesrob · 2022-08-01T11:31:50Z

This looks useful, but I'm not sure it quite works as intended yet. If I make myself a UTF-16 encoded vcf and try to index it, I get:

$ iconv -t UTF-16 test/index.vcf | ./bgzip > /tmp/i16.vcf.gz
$ ./tabix /tmp/i16.vcf.gz
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
  [ ... lots of similar messages omitted ...]
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::hts_idx_push] Unsorted positions on sequence #1: 2 followed by 1
tbx_index_build failed: /tmp/i16.vcf.gz

So it's reporting once for each line, because the indexer skips lines it can't parse. I guess that isn't much change from previous behaviour which also printed a message for each bad line, but should we treat this as more fatal? If the whole file is UTF-16 then there's no way you're going to get an index for it.

Somewhat more interesting is the last line. That appeared because tbx_parse1() gets to look at the line before any UTF-16 check happens. It will also happily split input lines on NUL bytes, so occasionally it interprets what it gets as VCF even though the returned data is rubbish. Possibly it should reject lines that appear to have premature NUL bytes so the rest of the detection can work more reliably?

jkbonfield · 2022-08-01T11:38:34Z

I'd be +1 for making it a fatal error instead.

Although as for warnings, I'm quite OK with invalid input producing 1 warning per line for invalid data. It "encourages" people to fix their inputs and makes spotting the warning significantly more likely. :-) (Obviously warnings produced from valid data, eg something we don't fully support or another form of warning, shouldn't be quite so spammy.)

jmarshall · 2022-08-01T11:39:19Z

Yes. In general I prefer not to add diagnostic output to library routines, so in the get_intv() case I just changed (in some cases) the existing message being printed rather than altered when a diagnostic might be printed. So it's working as intended by the PR's author 😄 but I suppose I could be convinced to make more drastic changes to get_intv()'s / tbx_parse1()'s behaviour. (Is it the right time for contemplating further behaviour changes?)

daviesrob · 2022-08-01T14:04:29Z

Well, as no files supported by tabix should have NUL bytes in them, I don't think it's too unreasonable to have tbx_parse1() reject lines that look obviously wrong. It would be pretty easy to make it look for something like len > 6, a NUL is found in the first four bytes and maybe also there's another NUL two bytes on. That should make it better at catching the UTF-16 case without getting false positives.

Also, as hts_is_utf16_text() is not going to mistake something valid for UTF-16, I think it would be reasonable to make get_intv() return -2 if it triggers, turning the error into a fatal one. Otherwise you could end up with the program apparently working, albeit very noisily, but making an empty index - which actually gets more likely if tbx_parse1() is better at detecting bad data.

Text files badly transferred from Windows may occasionally be UTF-16-encoded, and this may not be easily noticed by the user. HTSlib should not accept such encoding (as other tools surely don't, hence doing so would cause interoperability problems), but it should ideally emit a warning or error message identifying the problem. Reading text from a htsFile/samFile/vcfFile will already have failed with EFTYPE/ENOEXEC if the text file is UTF-16-encoded, as the encoding will not have been recognised by hts_detect_format(). OTOH bgzf_getline() will return a UTF-16-encoded text line. Add a suitable context-dependent diagnostic to the BGZF-based bgzf_getline() calls in HTSlib: in hts_readlist()/hts_readlines(), emit a warning (once, on the first line); in tbx.c, emit a more specific error message if get_intv() parsing failure is due to UTF-16 encoding. [TODO] If utf16_text_format were added to htsFormatCategory, the new is_utf16_text() function is suitable for detecting it.

daviesrob self-assigned this Aug 30, 2022

jmarshall force-pushed the utf16 branch from c8a062b to 714fc63 Compare May 12, 2024 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

jmarshall commented Aug 1, 2022 •

edited

d-cameron commented Aug 1, 2022 •

edited

jmarshall commented Aug 1, 2022

daviesrob commented Aug 1, 2022

jkbonfield commented Aug 1, 2022

jmarshall commented Aug 1, 2022 •

edited

daviesrob commented Aug 1, 2022

Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

Are you sure you want to change the base?

Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

Conversation

jmarshall commented Aug 1, 2022 • edited

d-cameron commented Aug 1, 2022 • edited

jmarshall commented Aug 1, 2022

daviesrob commented Aug 1, 2022

jkbonfield commented Aug 1, 2022

jmarshall commented Aug 1, 2022 • edited

daviesrob commented Aug 1, 2022

jmarshall commented Aug 1, 2022 •

edited

d-cameron commented Aug 1, 2022 •

edited

jmarshall commented Aug 1, 2022 •

edited