Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text file type detection updates #721

Closed
wants to merge 9 commits into from
Closed

Conversation

jmarshall
Copy link
Member

Dusting off some old branches, notably to fix #200, making identification of SAM files more accurate. At present, any text file is recognised as SAM, which is less than ideal.

This adds four new entries to htsExactFormat, so it may be a good time to review #587 with a view to adding these entries with a suitable prefix from the start.

Unfortunately some code, notably bcftools, has taken advantage of the bug that any text file is recognised as SAM to open arbitrary text files with hts_open() in contravention of its documentation:

/*! @abstract       Open a SAM/BAM/CRAM/VCF/BCF/etc file
…
*/
htsFile *hts_open(const char *fn, const char *mode);

Ideally such code would be rewritten to use bgzf_open/bgzf_getline/etc instead of misusing hts_open/hts_getline/etc.

Alternatively the use of hts_open to read text files could be blessed. The patch contains workarounds to enable such use (as currently tabix.c is guilty of the same thing), but bcftools may need similar alterations to that in tabix.c to work with this fixed file type detection PR.

@daviesrob
Copy link
Member

Thanks. It does cause issues for bcftools, so we'll have to sort those out before merging this.

@jmarshall
Copy link
Member Author

The bcftools problem is that it has file arguments that are now recognised as bed or fasta, that it tries to open using hts_open() — which fails because those formats are not in the case statements in hts_open() and hts_close().

So this is a policy question for the maintainers: should htsFile continue to be for SAM/BAM/CRAM and VCF/BCF files exclusively?

If so, the fix for bcftools is to do something different for bed files in synced_bcf_reader.c's bcf_sr_regions_init() (and maybe a couple of places in bcftools), and to apply a fix like samtools/bcftools#821.

If not, the fix for bcftools is to add fastaand bed and other text formats to those two case statements, and bless the use of htsFile for use with hts_getline and text input files. (And maybe apply that bcftools PR anyway.)

BTW I have a couple more commits to add to this branch while you folks contemplate that policy question…

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 27, 2018

What's the proposed alternative to hts_open? I guess bgzf_open is the closest thing in functionality. (Indeed I just realised your other PR uses that.)

I'm not sure what the original intention was for hts_open with regards to policy. It does claim to be "format-neutral I/O", but doesn't really clarify beyond that.

I've always found it a bit confusing that the htsExactFormat enum has both specific file types (bam, sam, bcf, bed, etc) as well as generic ones (binary, text). The category also has unknown in there, rather than it being an error. Both of those do rather imply it is intended for generic data, but it's a bit of guesswork. If it wasn't intended that way, then I'm not convinced the generic ones should exist at all in something claiming to be "exact" format, although internally they do sometimes get used. (Eg VCF is sometimes labelled as text and not vcf; I fixed this in the indexing PR.)

So my gut instinct is it should probably work. Breaking other code though implies this is perhaps best delayed until the ABI breaking stuff lands (even though this is only API breaking, it's easy to lump them together and the timing isn't great for a late change like this).

Edit: also the hts layer has been advertising itself as handling bed since 2014, even if it wasn't fully supported. It's been in the enum since abd1efb. I can understand why people use hts_open for bed files.

@jmarshall
Copy link
Member Author

The proposed alternative is

Ideally such code would be rewritten to use bgzf_open/bgzf_getline/etc instead

…or plain hopen/hgetln/etc if they don't need to take advantage of gzip/BGZF decompression being done for them.

text_format/binary_format exist because after hts_open("wb") it's impossible to know whether the exact format is going to be bam or bcf — see 529ca88. This is called compromising with reality 😄 But in general hts_detect_format() is intended to work on any file, which is not necessarily the case for hts_open(). The two are separate.

It turns out that that “hts_open: Open a SAM/BAM/CRAM/VCF/BCF/etc file” documentation was written by me in 2013 in d038d8e. The hts_getline function was already present at the time (though it was introduced for reading SAM files). So one point of view is that the 2013 documentation over-constrained the function and it would be in keeping with the function as it existed prior to 2013 to relax that to “Open a sequence data (SAM/BAM/CRAM) or variant data (VCF/BCF) or textual line-orientated file” (or something shorter!) and thus bless reading line-by-line with hts_open().

If people agree with this solution, I can rework this PR to avoid any breakage, in bcftools or otherwise.

… until the ABI breaking stuff lands … timing … a late change …

Would you like to give us a hint when ABI breakage is planned to land, and what timing “late” might relate to?

@jmarshall
Copy link
Member Author

Edit: also the hts layer has been advertising itself as handling bed since 2014, even if it wasn't fully supported. It's been in the enum since abd1efb. I can understand why people use hts_open for bed files.

That enum lists the formats the code is capable of detecting and that can be printed out by htsfile. This is a separate part of the “hts layer”. The enum also lists crai and csi (and at the time bcfv1) but I don't think anyone really believes that hts_open() will do anything useful with those.

@jkbonfield
Copy link
Contributor

I see, I hadn't realised the distinction between what hts_detect_format is meant to work on vs hts_open. None of this is particularly obvious!

As far as timing, the release candidate is meant to be today, so this is what I mean by late. :-)
The ABI breaking stuff got punted again sadly, but the plan is it'll arrive asap after the release goes out, so develop has the new stuff in for the longest time it can, to give us lots of time to spot potential gotchas.

@jmarshall
Copy link
Member Author

I'm planning to rework this as described in #721 (comment) so as to minimise other changes needed.

Fixes samtools#717 and facilitates upcoming hts_detect_format() changes
that will mean it no longer always returns SAM for unknown data.
jmarshall and others added 6 commits August 28, 2019 13:49
We currently don't do anything with such data, and unfortunately we
can't peek at the contents. But we can print "Unknown bzip2-compressed
data" rather than "Unknown data" for e.g. bzip2-ed FASTQ files, which
have been seen in the wild.

Also for gzip compression: set compression_level according to XFL
when CM is DEFLATE (BGZF does not set XFL, but plain gzip does).
The BSDs and macOS define EFTYPE ("Inappropriate file type or format"),
so use it to report invalid input file formats instead of the less ideal
but more portable ENOEXEC ("Exec format error").

(tabix.c also checks for ENOEXEC so as to print a better message than
"Exec format error", but strerror(EFTYPE) is already a good message
so there is no need to check for EFTYPE there as well.)

Samtools and bcftools contain no checks against ENOEXEC, so no changes
are needed in those repositories.
Fixes samtools#576 and interacts with upcoming changes making text_format
a likely input format to these functions.

(One abort() remains, indicating unimplemented functionality for CRAM.)
Trying to read an empty file as SAM is likely to be a case of
`somecmd | samtools view` with `somecmd` aborting and outputting
nothing. So having sam_hdr_read() and sam_read1() fail in this
case ensures the error is propagated. Fixes samtools#261. OTOH reading
text files line-by-line via hts_open/hts_getline/hts_close should
work even for an empty file.

Add tests for reading an empty file as SAM and line-by-line,
and update xx#blank.sam test case.
Enlarge the buffers in decompress_peek() and hts_detect_format() so
that an entire line or two can be peeked at. Parse a tab-separated line
and match the column types (but allowing for the possibility that we've
only peeked an incomplete prefix of the line) to try to differentiate
SAM, CRAI, and BED, with others potentially to follow.

Note that avoiding false negatives is vital: if hts_detect_format()
fails to recognise a SAM file as such, then hts_open() will refuse to
open it.

Fixes samtools#200.

Relax htslib/hts.h's hts_open() documentation to (re-)bless its use for
opening arbitrary text files to be read line-by-line via hts_getline().
Such files will now be detected as text_format or bed (rather than the
previous "everything textual was sam"); update tabix.c accordingly.
Add htsExactFormat entries for these four file types to htslib/hts.h.

Place FASTQ in the sequence_data category as it can be considered to
be a representation of unmapped SAM. FASTA on the other hand is not,
as we don't operate on it in the same way as SAM/BAM/CRAM sequence data
and in particular we wouldn't ever want hts_open() to accept it, as
then swapping filename arguments in `samtools mpileup -f foo bar` etc
would be less detectable.

Add tests verifying that FASTA and FASTQ (and SAM) can be read
line-by-line via hts_open/hts_getline/hts_close.

Fixes samtools#719.
@jmarshall
Copy link
Member Author

jmarshall commented Aug 30, 2019

This branch is now finally updated to a point where it is ready for merging. Rebased and conflicts resolved, and two significant changes from previously which mean it is now ready:

  • Disallowing empty SAM files (detecting a sam/bam/cram file that is of 0 size. #261) is now implemented: empty files can be opened with hts_open() and read with hts_getline, but sam_hdr_read now fails.

    Why it was necessary to implement this is now lost in the mists of time, but it has to do with the xx#blank.sam test cases involving empty files — without work one way or the other in this area, the test suite was failing with this PR. Re-figured out why some work on empty files was needed: with the more accurate text-file detector, empty files were coming out as unknown. So either length==0 needed to be explicitly special-cased as SAM to make empty files acceptable and xx#blank.sam happy; or length==0 needed to be not-SAM and xx#blank.sam adjusted.

  • Reworked to bless using hts_open to read arbitrary text files line-by-line, and tests added to ensure that this works. With this done, there are no issues caused for bcftools.

All of the htslib, samtools, and bcftools test suites pass when run against this branch.

@jkbonfield
Copy link
Contributor

The xx#blank.sam test was one of mine, lifted over from io_lib. It was deliberately empty in order to test that case, as it's a valid file, but I'm completely in agreement with taking a stance that valid-or-not, we shouldn't accept it. The fudge to put a comment in it is good as it still means it tests functionality in e.g. iterators.

Thanks.

{
samFile *in = sam_open(filename, "r");
if (in) {
bam1_t *aln = bam_init1();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A check for the empty_format would be a nice addition.
Something like:

if (in->format.format != empty_format)
    fail("file %s is not marked as empty\n", filename);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks.

Also use the accessor function in tabix.c, which is outwith the library.
Change this check from >=0 to >=-1 so that sam_read1() returning
EOF also fails the test -- we want to see an error indication.
Refactor now that sam_read1() invocation and check are separate.

Also test/sam.c's fail() function adds \n itself (oops!).
@valeriuo
Copy link
Contributor

valeriuo commented Sep 5, 2019

Merged into develop as acdab39 and preceding commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

htsfile predicts "SAM version 1" for all text files
4 participants