New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text file type detection updates #721
Conversation
Thanks. It does cause issues for bcftools, so we'll have to sort those out before merging this. |
The bcftools problem is that it has file arguments that are now recognised as So this is a policy question for the maintainers: should If so, the fix for bcftools is to do something different for If not, the fix for bcftools is to add BTW I have a couple more commits to add to this branch while you folks contemplate that policy question… |
What's the proposed alternative to I'm not sure what the original intention was for hts_open with regards to policy. It does claim to be "format-neutral I/O", but doesn't really clarify beyond that. I've always found it a bit confusing that the htsExactFormat enum has both specific file types (bam, sam, bcf, bed, etc) as well as generic ones (binary, text). The category also has unknown in there, rather than it being an error. Both of those do rather imply it is intended for generic data, but it's a bit of guesswork. If it wasn't intended that way, then I'm not convinced the generic ones should exist at all in something claiming to be "exact" format, although internally they do sometimes get used. (Eg VCF is sometimes labelled as text and not vcf; I fixed this in the indexing PR.) So my gut instinct is it should probably work. Breaking other code though implies this is perhaps best delayed until the ABI breaking stuff lands (even though this is only API breaking, it's easy to lump them together and the timing isn't great for a late change like this). Edit: also the hts layer has been advertising itself as handling bed since 2014, even if it wasn't fully supported. It's been in the enum since abd1efb. I can understand why people use hts_open for bed files. |
The proposed alternative is
…or plain
It turns out that that “hts_open: Open a SAM/BAM/CRAM/VCF/BCF/etc file” documentation was written by me in 2013 in d038d8e. The If people agree with this solution, I can rework this PR to avoid any breakage, in bcftools or otherwise.
Would you like to give us a hint when ABI breakage is planned to land, and what timing “late” might relate to? |
That enum lists the formats the code is capable of detecting and that can be printed out by |
I see, I hadn't realised the distinction between what hts_detect_format is meant to work on vs hts_open. None of this is particularly obvious! As far as timing, the release candidate is meant to be today, so this is what I mean by late. :-) |
I'm planning to rework this as described in #721 (comment) so as to minimise other changes needed. |
Fixes samtools#717 and facilitates upcoming hts_detect_format() changes that will mean it no longer always returns SAM for unknown data.
We currently don't do anything with such data, and unfortunately we can't peek at the contents. But we can print "Unknown bzip2-compressed data" rather than "Unknown data" for e.g. bzip2-ed FASTQ files, which have been seen in the wild. Also for gzip compression: set compression_level according to XFL when CM is DEFLATE (BGZF does not set XFL, but plain gzip does).
The BSDs and macOS define EFTYPE ("Inappropriate file type or format"), so use it to report invalid input file formats instead of the less ideal but more portable ENOEXEC ("Exec format error"). (tabix.c also checks for ENOEXEC so as to print a better message than "Exec format error", but strerror(EFTYPE) is already a good message so there is no need to check for EFTYPE there as well.) Samtools and bcftools contain no checks against ENOEXEC, so no changes are needed in those repositories.
Fixes samtools#576 and interacts with upcoming changes making text_format a likely input format to these functions. (One abort() remains, indicating unimplemented functionality for CRAM.)
Trying to read an empty file as SAM is likely to be a case of `somecmd | samtools view` with `somecmd` aborting and outputting nothing. So having sam_hdr_read() and sam_read1() fail in this case ensures the error is propagated. Fixes samtools#261. OTOH reading text files line-by-line via hts_open/hts_getline/hts_close should work even for an empty file. Add tests for reading an empty file as SAM and line-by-line, and update xx#blank.sam test case.
Enlarge the buffers in decompress_peek() and hts_detect_format() so that an entire line or two can be peeked at. Parse a tab-separated line and match the column types (but allowing for the possibility that we've only peeked an incomplete prefix of the line) to try to differentiate SAM, CRAI, and BED, with others potentially to follow. Note that avoiding false negatives is vital: if hts_detect_format() fails to recognise a SAM file as such, then hts_open() will refuse to open it. Fixes samtools#200. Relax htslib/hts.h's hts_open() documentation to (re-)bless its use for opening arbitrary text files to be read line-by-line via hts_getline(). Such files will now be detected as text_format or bed (rather than the previous "everything textual was sam"); update tabix.c accordingly.
Add htsExactFormat entries for these four file types to htslib/hts.h. Place FASTQ in the sequence_data category as it can be considered to be a representation of unmapped SAM. FASTA on the other hand is not, as we don't operate on it in the same way as SAM/BAM/CRAM sequence data and in particular we wouldn't ever want hts_open() to accept it, as then swapping filename arguments in `samtools mpileup -f foo bar` etc would be less detectable. Add tests verifying that FASTA and FASTQ (and SAM) can be read line-by-line via hts_open/hts_getline/hts_close. Fixes samtools#719.
This branch is now finally updated to a point where it is ready for merging. Rebased and conflicts resolved, and two significant changes from previously which mean it is now ready:
All of the htslib, samtools, and bcftools test suites pass when run against this branch. |
The xx#blank.sam test was one of mine, lifted over from io_lib. It was deliberately empty in order to test that case, as it's a valid file, but I'm completely in agreement with taking a stance that valid-or-not, we shouldn't accept it. The fudge to put a comment in it is good as it still means it tests functionality in e.g. iterators. Thanks. |
{ | ||
samFile *in = sam_open(filename, "r"); | ||
if (in) { | ||
bam1_t *aln = bam_init1(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A check for the empty_format would be a nice addition.
Something like:
if (in->format.format != empty_format)
fail("file %s is not marked as empty\n", filename);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks.
Also use the accessor function in tabix.c, which is outwith the library.
Change this check from >=0 to >=-1 so that sam_read1() returning EOF also fails the test -- we want to see an error indication. Refactor now that sam_read1() invocation and check are separate. Also test/sam.c's fail() function adds \n itself (oops!).
Merged into develop as acdab39 and preceding commits |
Dusting off some old branches, notably to fix #200, making identification of SAM files more accurate. At present, any text file is recognised as SAM, which is less than ideal.
This adds four new entries to
htsExactFormat
, so it may be a good time to review #587 with a view to adding these entries with a suitable prefix from the start.Unfortunately some code, notably bcftools, has taken advantage of the bug that any text file is recognised as SAM to open arbitrary text files with
hts_open()
in contravention of its documentation:Ideally such code would be rewritten to use
bgzf_open
/bgzf_getline
/etc instead of misusinghts_open
/hts_getline
/etc.Alternatively the use of
hts_open
to read text files could be blessed. The patch contains workarounds to enable such use (as currently tabix.c is guilty of the same thing), but bcftools may need similar alterations to that in tabix.c to work with this fixed file type detection PR.