We should be able to improve the GPU version very easily as soon as updating to newer libcudf (25.08+), because then the Chunked reader supports row offsets, so we can use that.
(We may also be able to mix row groups and row offsets, but I am not sure, the only reason for that is really that we need the row groups for the CPU version currently.)
The CPU version is worse, because it wasn't clear to us if there is a nice approach via arrow to efficiently limit the reads. There is probably some unnecessary decompression going on here of things we don't need.