Improve support for very long input lines (> 2Gbyte) #1542

daviesrob · 2023-01-04T17:50:13Z

Some changes to make reading and writing very long lines work better. It is a partial fix for #1539 - the overflow mentioned in dealt with, but more work will be needed to make the very long VCF records readable.

Cap the return value of bgzf_getline() to avoid integer overflow.
Change data types in tbx_parse1() and vcf_write() so they can handle long lines (albeit only on 64-bit platforms).
Put an extra loop around read() and write() in fd_read(), fd_write() to ensure they really have processed the expected amount of data. This is especially important on Linux, which has a limit on the amount of data that will be read or written in a single call.

This is mostly useful for tabix, which doesn't do much interpretation of its input. With these changes it will happily index and return long lines if it has enough memory.

This does not change the maximum size of a SAM record, as that is limited by bam1_t::l_data which is an int.
The situation for VCF is a bit more complicated, and it may be possible to get very slightly over 2Gbytes as various limits apply to different parts of the record. The size of the sample data, which is likely to be the biggest part, is currently restricted to 2Gbytes by this check, and by the size of bcf_fmt_t::p_off which is 31 bits. It might be possible to work around the p_off limit by abusing the ability of bcf_fmt_t to store edited data in a separate buffer - but doing so would be a bit hairy and would need a lot of thought and testing.

Prevents overflow on long lines, which could result in a negative number being returned even if the read was otherwise successful.

Change the type of len, i and b so they can index positions over 2Gbytes into the input string. Naturally this is only useful on 64 bit platforms as the entire input needs to be in memory. Luckily this function isn't exported, so changing the type of len does not affect the ABI.

jmarshall · 2023-01-06T10:44:01Z

Re the fd_read()/fd_write() change: as documented in struct hFILE_backend in hfile_internal.h, it is expected that these functions might not read or write their entire buffers.

For fd_write(), it is always called via hFILE_backend::write(). There are two invocations of that in the hfile.c front-end code, and both of them implement such loops to ensure all data is written.

For fd_read(), it is always called via hFILE_backend::read(). There are three invocations of that: that in hread2() implements a loop to ensure the bulk of a large request is read; that in refill_buffer() does not have a loop but records the number of bytes read, and its callers loop as necessary; the final one is in multipart_read(), which is itself an hFILE_backend::read() method that will be recalled in a loop as necessary by the front-end functions.

So it should not be necessary to have this sort of loop inside the back-end functions. Did you encounter a bug without this commit? If so, that would seem to signal a bug in the front-end code instead.

Moreover, as mentioned in the refill_buffer() doc string, it only calls the back-end reader once and may not completely refill the buffer. It's implied 😄 that the back-end function is only going to do a system call once too. This was intentional, written this way with interactively typing a SAM file into a terminal in mind. In that scenario, IIRC each read(2) system call would return one line after the user pressed Enter. Doing only one read(2) system call for each sam_read1() call means that sam_read1() does not block when reading from an interactive terminal, which seemed desirable at the time. (Similar considerations might apply for reading from the network.)

(I don't recall whether this was described in the commit messages of the time.)

It needs to be ssize_t to prevent possible overflow when writing very long records on 64 bit platforms.

daviesrob · 2023-01-06T11:55:08Z

Yes, you're right. It does work without the fd_read()/fd_write() commit. I possibly thought they were a problem because I hadn't fixed the vcf_write() type at that point, so the whole process still wasn't working and I maybe didn't go far enough in the debugger.

I'll remove that commit.

daviesrob added 2 commits January 4, 2023 12:32

Cap bgzf_getline return value to INT_MAX

dcced53

Prevents overflow on long lines, which could result in a negative number being returned even if the read was otherwise successful.

Use correct type for ret in vcf_write()

f0ddc57

It needs to be ssize_t to prevent possible overflow when writing very long records on 64 bit platforms.

daviesrob force-pushed the long_lines branch from b36a913 to f0ddc57 Compare January 6, 2023 11:55

daviesrob assigned whitwham Jan 10, 2023

whitwham merged commit 9b6f7e1 into samtools:develop Jan 16, 2023

daviesrob deleted the long_lines branch January 16, 2023 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for very long input lines (> 2Gbyte) #1542

Improve support for very long input lines (> 2Gbyte) #1542

daviesrob commented Jan 4, 2023

jmarshall commented Jan 6, 2023

daviesrob commented Jan 6, 2023

Improve support for very long input lines (> 2Gbyte) #1542

Improve support for very long input lines (> 2Gbyte) #1542

Conversation

daviesrob commented Jan 4, 2023

jmarshall commented Jan 6, 2023

daviesrob commented Jan 6, 2023