Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn if bgzf_getline() returned apparently UTF-16-encoded text #1487

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Aug 1, 2022

Text files badly transferred from Windows may occasionally be UTF-16-encoded, and this may not be easily noticed by the user — see for example, this report of tabix woes (??# is as reproduced locally by me, and equally as confusing as the OP's output):

$ tabix myVcf.vcf.gz
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "??#"
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""

It turned out that the text VCF file was UTF-16-encoded, and the lines returned contained alternating NULs when interpreted as ASCII C-strings, hence all lines appearing truncated to 0 or 1 or BOM+1 characters.

HTSlib should not accept such encoding (as other tools surely don't, hence doing so would cause interoperability problems), but it should ideally emit a warning or error message identifying the problem clearly.

Reading text from a htsFile/samFile/vcfFile will already have failed with EFTYPE/ENOEXEC (and printed a suitable Inappropriate file type or format message) if the text file is UTF-16-encoded, as the encoding will not have been recognised by hts_detect_format(). So no changes are needed here, although this could be extended to add utf16_text_format or so to htsFormatCategory to aid in reporting this (but I don't think that is really warranted).

OTOH code that uses plain BGZF handles does not return a clear diagnostic, as was seen with tabix. bgzf_getline() will return a UTF-16-encoded text line successfully, but code not expecting it will misinterpret the resulting string. This PR adds a diagnostic suitable for each context to the BGZF-based bgzf_getline() calls in HTSlib:

  • in hts_readlist()/hts_readlines(), emit a warning (once, on the first line);
  • in tbx.c, emit a more specific error message if get_intv() parsing failure is due to UTF-16 encoding.

With this PR, the biostars poster's test case produces the following diagnostics and the root cause is apparent:

tabix myVcf.vcf.gz
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: was wrong -p [type] used?
The offending line was: ""

@d-cameron
Copy link

d-cameron commented Aug 1, 2022

As of VCF 4.3 encoding is explicit:

The character encoding of VCF files is UTF-8.

Newlines are also explicit:

Line separators must be CR+LF or LF

@jmarshall
Copy link
Member Author

The data file in the OP's case was a VCF file, and it is true that VCF 4.3 explicitly and VCF 4.1/4.2 de facto require UTF-8 encoding. However tabix also works with arbitrary position-oriented files that don't have specifications about encodings.

However there is no suggestion (in this PR, at least) that HTSlib or tabix should go out of their way to accept such wackily encoded input files. This PR just improves the error messages so that the user is aided in identifying the underlying cause of the parsing failure.

@daviesrob
Copy link
Member

This looks useful, but I'm not sure it quite works as intended yet. If I make myself a UTF-16 encoded vcf and try to index it, I get:

$ iconv -t UTF-16 test/index.vcf | ./bgzip > /tmp/i16.vcf.gz
$ ./tabix /tmp/i16.vcf.gz
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
  [ ... lots of similar messages omitted ...]
[E::get_intv] Failed to parse TBX_VCF: offending line appears to be encoded as UTF-16
[E::hts_idx_push] Unsorted positions on sequence #1: 2 followed by 1
tbx_index_build failed: /tmp/i16.vcf.gz

So it's reporting once for each line, because the indexer skips lines it can't parse. I guess that isn't much change from previous behaviour which also printed a message for each bad line, but should we treat this as more fatal? If the whole file is UTF-16 then there's no way you're going to get an index for it.

Somewhat more interesting is the last line. That appeared because tbx_parse1() gets to look at the line before any UTF-16 check happens. It will also happily split input lines on NUL bytes, so occasionally it interprets what it gets as VCF even though the returned data is rubbish. Possibly it should reject lines that appear to have premature NUL bytes so the rest of the detection can work more reliably?

@jkbonfield
Copy link
Contributor

I'd be +1 for making it a fatal error instead.

Although as for warnings, I'm quite OK with invalid input producing 1 warning per line for invalid data. It "encourages" people to fix their inputs and makes spotting the warning significantly more likely. :-) (Obviously warnings produced from valid data, eg something we don't fully support or another form of warning, shouldn't be quite so spammy.)

@jmarshall
Copy link
Member Author

jmarshall commented Aug 1, 2022

Yes. In general I prefer not to add diagnostic output to library routines, so in the get_intv() case I just changed (in some cases) the existing message being printed rather than altered when a diagnostic might be printed. So it's working as intended by the PR's author 😄 but I suppose I could be convinced to make more drastic changes to get_intv()'s / tbx_parse1()'s behaviour. (Is it the right time for contemplating further behaviour changes?)

@daviesrob
Copy link
Member

Well, as no files supported by tabix should have NUL bytes in them, I don't think it's too unreasonable to have tbx_parse1() reject lines that look obviously wrong. It would be pretty easy to make it look for something like len > 6, a NUL is found in the first four bytes and maybe also there's another NUL two bytes on. That should make it better at catching the UTF-16 case without getting false positives.

Also, as hts_is_utf16_text() is not going to mistake something valid for UTF-16, I think it would be reasonable to make get_intv() return -2 if it triggers, turning the error into a fatal one. Otherwise you could end up with the program apparently working, albeit very noisily, but making an empty index - which actually gets more likely if tbx_parse1() is better at detecting bad data.

@daviesrob daviesrob self-assigned this Aug 30, 2022
Text files badly transferred from Windows may occasionally be
UTF-16-encoded, and this may not be easily noticed by the user.
HTSlib should not accept such encoding (as other tools surely don't,
hence doing so would cause interoperability problems), but it should
ideally emit a warning or error message identifying the problem.

Reading text from a htsFile/samFile/vcfFile will already have failed
with EFTYPE/ENOEXEC if the text file is UTF-16-encoded, as the encoding
will not have been recognised by hts_detect_format().

OTOH bgzf_getline() will return a UTF-16-encoded text line. Add a
suitable context-dependent diagnostic to the BGZF-based bgzf_getline()
calls in HTSlib: in hts_readlist()/hts_readlines(), emit a warning
(once, on the first line); in tbx.c, emit a more specific error message
if get_intv() parsing failure is due to UTF-16 encoding.

[TODO] If utf16_text_format were added to htsFormatCategory,
the new is_utf16_text() function is suitable for detecting it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants