Description
Prerequisites
- I have written a descriptive issue title
- I have verified that I am running the latest version of ImageSharp
- I have verified if the problem exist in both
DEBUG
andRELEASE
mode - I have searched open and closed issues to ensure it has not already been reported
Description
To pair with #1597, I have identified some opportunities for improvement in baseline scan decoding performance (haven't looked at progressive yet, but the same ideas might apply).
I had a look through the asm from HuffmanScanDecoder.DecodeBlockBaseline()
, and while there are some small codegen issues that are easy to fix without refactoring, I was only able to knock about 70 bytes off the code -- out of a total method size of 2400 bytes including inlined callees -- which was barely measurable in the full decode benchmarks.
The bulk of that full method is made up of calls to BufferedReadStream.ReadByte()
, which is inlined a total of 12 times into the outer method. BufferedReadStream
itself is quite clever in that it allows ReadByte()
to be devirtualized and inlined, but each byte read still comes with significant overhead.
ImageSharp/src/ImageSharp/IO/BufferedReadStream.cs
Lines 129 to 148 in 82bca55
When inlined into the caller compiles to this x64 asm:
mov r8,qword ptr [r14] ; read 8-byte BufferedReadStream base pointer to r8
mov rcx,qword ptr [r8+28h] ; read 8-byte `readerPosition` field
cmp rcx,qword ptr [r8+30h] ; read 8-byte `Length` property
jl LENGTH_OK ; if `readerPosition` < `Length`, continue to read
mov r9d,0FFFFFFFFh ; else load -1
jmp END ; and return it
LENGTH_OK:
mov ecx,dword ptr [r8+3Ch] ; read 4-byte `readBufferIndex` field
cmp ecx,dword ptr [r8+38h] ; read 4-byte `maxBufferIndex` field
jle BUFFER_FILLED ; if `readBufferIndex` <= `maxBufferIndex`, continue to buffer read
mov qword ptr [rsp+28h],r8 ; else spill r8 to stack (it may be overwritten by method call)
mov rcx,r8 ; set up `this` arg for method
call FillReadBuffer() ; <-- that
mov r8,qword ptr [rsp+28h] ; restore r8 from stack
BUFFER_FILLED:
inc qword ptr [r8+28h] ; increment `readerPosition` field
mov rcx,qword ptr [r8+20h] ; read 8-byte `pinnedReadBuffer` field
mov r9d,dword ptr [r8+3Ch] ; read 4-byte `readBufferIndex` field
lea r10d,[r9+1] ; increment it
mov dword ptr [r8+3Ch],r10d ; write it back
movsxd r8,r9d ; sign-extend `readBufferIndex` to native int length
movzx r9d,byte ptr [rcx+r8] ; read 1 byte @ `pinnedReadBuffer`+`readBufferIndex` and return it
END:
The movsxd
could be eliminated by using a native-sized readBufferIndex
, but it would still be a huge amount of overhead for fetching a single byte of data.
Additionally, the huffman decoder has checks and recovery built in for handling bad data which add overhead -- not necessarily to execution in this case since they're unlikely to be hit, but they add to code size.
I would think (and I'm about show my ignorance of JPEG huffman coding so correct me if I'm wrong) that it should be possible to build a happy path through the decoder that depends on eagerly peeking a number of bytes (one that should suffice for 99+% of all blocks) from the stream into a stack buffer and working through those without the need for individual byte loads. If that's something like 64 bytes since that would be 1:1 compression, that would be easy to satisfy. If it needs to handle a few degenerate cases where the compressed data is larger than the original, maybe double that?
The current resilient implementation could serve as a backstop so that if the happy path doesn't work for some reason, it could bail and restart that block with the slow path. This shouldn't happen often, and it should be only for cases where perf becomes secondary to security and correctness.
I don't have the time to work on restructuring the code just yet, but does that sound reasonable, @JimBobSquarePants?