Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for very long input lines (> 2Gbyte) #1542

Merged
merged 3 commits into from Jan 16, 2023

Conversation

daviesrob
Copy link
Member

Some changes to make reading and writing very long lines work better. It is a partial fix for #1539 - the overflow mentioned in dealt with, but more work will be needed to make the very long VCF records readable.

  • Cap the return value of bgzf_getline() to avoid integer overflow.
  • Change data types in tbx_parse1() and vcf_write() so they can handle long lines (albeit only on 64-bit platforms).
  • Put an extra loop around read() and write() in fd_read(), fd_write() to ensure they really have processed the expected amount of data. This is especially important on Linux, which has a limit on the amount of data that will be read or written in a single call.

This is mostly useful for tabix, which doesn't do much interpretation of its input. With these changes it will happily index and return long lines if it has enough memory.

This does not change the maximum size of a SAM record, as that is limited by bam1_t::l_data which is an int.
The situation for VCF is a bit more complicated, and it may be possible to get very slightly over 2Gbytes as various limits apply to different parts of the record. The size of the sample data, which is likely to be the biggest part, is currently restricted to 2Gbytes by this check, and by the size of bcf_fmt_t::p_off which is 31 bits. It might be possible to work around the p_off limit by abusing the ability of bcf_fmt_t to store edited data in a separate buffer - but doing so would be a bit hairy and would need a lot of thought and testing.

Prevents overflow on long lines, which could result in a negative
number being returned even if the read was otherwise successful.
Change the type of len, i and b so they can index positions
over 2Gbytes into the input string.  Naturally this is only useful
on 64 bit platforms as the entire input needs to be in memory.

Luckily this function isn't exported, so changing the type of
len does not affect the ABI.
@jmarshall
Copy link
Member

Re the fd_read()/fd_write() change: as documented in struct hFILE_backend in hfile_internal.h, it is expected that these functions might not read or write their entire buffers.

For fd_write(), it is always called via hFILE_backend::write(). There are two invocations of that in the hfile.c front-end code, and both of them implement such loops to ensure all data is written.

For fd_read(), it is always called via hFILE_backend::read(). There are three invocations of that: that in hread2() implements a loop to ensure the bulk of a large request is read; that in refill_buffer() does not have a loop but records the number of bytes read, and its callers loop as necessary; the final one is in multipart_read(), which is itself an hFILE_backend::read() method that will be recalled in a loop as necessary by the front-end functions.

So it should not be necessary to have this sort of loop inside the back-end functions. Did you encounter a bug without this commit? If so, that would seem to signal a bug in the front-end code instead.

Moreover, as mentioned in the refill_buffer() doc string, it only calls the back-end reader once and may not completely refill the buffer. It's implied 😄 that the back-end function is only going to do a system call once too. This was intentional, written this way with interactively typing a SAM file into a terminal in mind. In that scenario, IIRC each read(2) system call would return one line after the user pressed Enter. Doing only one read(2) system call for each sam_read1() call means that sam_read1() does not block when reading from an interactive terminal, which seemed desirable at the time. (Similar considerations might apply for reading from the network.)

(I don't recall whether this was described in the commit messages of the time.)

It needs to be ssize_t to prevent possible overflow when writing
very long records on 64 bit platforms.
@daviesrob
Copy link
Member Author

Yes, you're right. It does work without the fd_read()/fd_write() commit. I possibly thought they were a problem because I hadn't fixed the vcf_write() type at that point, so the whole process still wasn't working and I maybe didn't go far enough in the debugger.

I'll remove that commit.

@whitwham whitwham merged commit 9b6f7e1 into samtools:develop Jan 16, 2023
@daviesrob daviesrob deleted the long_lines branch January 16, 2023 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants