Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading PNG decoding, "parallel" PNG's #54

Open
richgel999 opened this issue Dec 14, 2021 · 15 comments
Open

Multithreading PNG decoding, "parallel" PNG's #54

richgel999 opened this issue Dec 14, 2021 · 15 comments

Comments

@richgel999
Copy link

richgel999 commented Dec 14, 2021

It looks possible to write completely standard PNG files that can be optionally decompressed across multiple threads:
https://twitter.com/richgel999/status/1470846966276018187

To do this, you would use multiple IDAT's, force a flush to a byte boundary at the end of each IDAT using a 0 or 1 byte uncompressed Deflate block (if necessary - 1/8th of the time it shouldn't be), and make sure the LZ matches never cross IDAT block boundaries. A special standard ancillary extension chunk (which would be ignored by existing readers) would hint to new decoders that the file can be decoded in parallel.

The filtering can also be done in parallel with decompression, using a simple pipeline.

This would significantly reduce the wall-clock time involved in reading large (8k-32k) PNG files on modern machines.

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Dec 14, 2021

I've been thinking about this also: brion/mtpng#20

My original idea was to have some metadata that points to offsets within the file, for each zlib sub-block. However, your idea of just using IDAT block boundaries could make more sense, and only requires one bit of additional metadata to signal.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

@randy408
Copy link

randy408 commented Dec 14, 2021

I've had this idea for a while and have made my png-restart-marker repo public, it's kind of like JPEG restart markers for PNG, will push what I have so far tomorrow.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

If you start on a scanline, IDAT and DEFLATE boundary the filter byte will be the first thing, I would just exclude filter types 2, 3, 4 at the start of each segment, then each worker is truly independent.

@richgel999
Copy link
Author

I've been thinking about this also: brion/mtpng#20

My original idea was to have some metadata that points to offsets within the file, for each zlib sub-block. However, your idea of just using IDAT block boundaries could make more sense, and only requires one bit of additional metadata to signal.

An additional metadata flag could be used to signal that the first row of each block has filter type none (or other filter type that does not reference the previous row) - if this is known by the decoder, then it can perform defiltering in parallel too.

Yes. The only spec change would be the definition of a new standard ancillary chunk type (that would be ignored by existing readers) that indicates which levels of parallelization are supported on the file: parallel IDAT decompression only, or parallel IDAT decompression+parallel de-filtering. It could also indicate the # of scanlines present in each IDAT chunk.

This is so useful that my plan is to put together a quick prototype for 24bpp and 32bpp, specifically in order to benchmark it.

@richgel999
Copy link
Author

I think if we could agree on a standard chunk type that contains this metadata, multiple independent implementations could then be created with faster writing/reading. This would be extremely valuable (large PNG's, like 8k-16k are now becoming common).

@DavidBuchanan314
Copy link

One question is, should the safe-to-copy bit be set, in the chunk name? I suppose it shouldn't.

An unaware PNG processing tool may do things like combine or re-split IDAT chunks, which would break parallel decoders if they still tried to process the chunks in parallel.

@DavidBuchanan314
Copy link

I have created a proof-of-concept implementaion of parallel encoding and decoding here https://github.com/DavidBuchanan314/parallel-png-proposal , along with a proposal for how the metadata could be encoded.

@richgel999
Copy link
Author

Wow that was fast! Thank you. My implementation can use this chunk data.

One question is, should the safe-to-copy bit be set, in the chunk name? I suppose it shouldn't.
An unaware PNG processing tool may do things like combine or re-split IDAT chunks, which would break parallel decoders if they still tried to process the chunks in parallel.

Yea, I'm thinking it shouldn't be set. According to the PNG spec, the IDAT boundaries have "no semantic significance", so a processing tool may recompress the data. It's the safe thing to do because we can't be sure what the processing tool is going to do with the IDAT data and/or filters.

@richgel999
Copy link
Author

richgel999 commented Dec 15, 2021

I'm wondering if the chunk should have an array with offsets/lengths that point to each IDAT chunk? (Someone suggested this, not sure who/where.) It's not necessary because a PNG reader can just find them by scanning the file, but perhaps there is some value in this on very large images. Having to find all the chunks and dispatch the decode jobs is serial processing.

On the flip side, the IDAT offsets could be brittle in some way I can't think of, isn't necessary, and introduces a concept into PNG that seems to go against the flow.

@richgel999
Copy link
Author

richgel999 commented Dec 15, 2021

A decoder should be able to determine the number of such pieces by calculating floor(image_height / piece_height). Each piece is stored in its own IDAT chunk.

In your proposal: Shouldn't this be ceil(image_image/piece_height)? If the image height is 17 and the piece_height is 8, I would expect this to be 3 (not 2). (I filed a bug.)

@DavidBuchanan314
Copy link

Hah, I spotted the bug around the same time as you did, I pushed a fix already.

On the flip side, the IDAT offsets could be brittle in some way I can't think of, isn't necessary, and introduces a concept into PNG that seems to go against the flow.

Yes, I've been having similar thoughts. In practice, I don't think it'd be any faster to have an offset table, although it would be marginally faster if you wanted to implement random-access (but I'm not sure anyone would really want that?)

@DavidBuchanan314
Copy link

One argument for using an offset table, is that you can have your IDAT chunks be arbitrary sizes - I know some encoders like to use chunks of a certain size, although I've never understood why exactly.

@randy408
Copy link

I cleaned up and published the specification I had laying around, also added a method to encode without an offset table for maximum bikeshedding potential: https://github.com/libspng/png-restart-marker

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Dec 17, 2021

So far we have three potential approaches:

As a very brief summary:

(Note: Here I use "piece" to refer to the subsections of the zlib stream, and/or slices of image)

  • iDOT (as best we can tell) records the pixel heights, and file offsets, of each piece.

  • pLLD records a single piece-height value, from which the number of pieces and height of the last piece can be deterministically derived. The file offset for each piece must be determined by parsing the IDAT chunks, with one piece per chunk.

  • mARK supports two different "segmentation types":

    • Type 0 records the number of pieces, and the relative file offset for each piece. The height pixel height of each piece is deterministic from this information.
    • Type 1 works largely the same as how pLLD works, except it records the number of pieces, rather than their height, and in cases where the image height is not evenly divisible, the last piece is larger (whereas in pLLD, the last piece is shorter).

mARK has fields that encode the "segmentation method" and "segmentation type", which makes it more future-proof in terms of implementing new strategies to parallelise decoding in the future.

It is my opinion that recording the file offsets of pieces (as in iDOT and mARK Type 0) is a bad idea. Knowing the piece offsets gives you a slight advantage, in that you can immediately dispatch work to all your decoder threads. However for a decoder to be 100% sure that an offset is valid, you need to perform a lot of checks, otherwise you risk a maliciously crafted file being ambiguous (such as I described here ). For example, you could have data that looks like a valid IDAT chunk, contained entirely within another IDAT chunk. The only way to detect this case is to parse all the IDAT chunks sequentially from the beginning. (Which somewhat defeats the purpose of knowing the offsets beforehand - although, this could be done in parallel with the actual decompression)

pLLD and mARK type 1 are slightly less flexible, because each piece must occupy exactly one IDAT chunk. I believe there are sometimes reasons to keep IDAT chunks small (although I'm not entirely sure what these reasons are), so there may be a desire to have one piece occupy multiple IDAT chunks. If this is a desirable feature, then I propose an alternative approach:

Record a table of IDAT chunk indices (either in absolute or relative terms).

The pLLD and/or mARK specifications could easily be amended to support this. I believe this approach would be the best way to maximise flexibility, and minimise parsing edge-cases.

I believe @brion is looking into implementing parallel decoding in mtpng, and I would be interested to see the performance numbers, especially in terms of quantifying how much time it costs to initially iterate over the IDAT chunks, if their offsets not recorded.

@lrosenthol
Copy link

As we are currently working on an update to PNG, the timing for this is quite good...

My personal preference would be to avoid anything that includes offset tables for (a) the reasons that @DavidBuchanan314 mentioned concerning attacks but also (b) because of scenarios where assets are modified w/o changing their content (e.g., updated metadata) that could well change the byte offsets.

So while I haven't looked at the proposals in depth, I would tend to favor pLLD

@randy408
Copy link

I believe error handling will be the hardest part of this, do note that mARK type 1 and pLLD still have issues with pieces possibly being too long or too short for which the only conformant behavior is to fall back to linear decoding.

I went into the details of this and have made the argument that any error during parallel decoding must be recoverable and should be continued with linear decoding: #60 (comment).

One issue I have with pLLD is the "parallel defilterable" flag, it shouldn't exist.

If defiltering requires data from the previous piece (flag = 0) then a simple parallel decoder would have to wait for the previous piece to finish, defeating the purpose of parallel decoding.
Not only would decompression have to finish on the previous piece but defiltering too, defiltering is a linear process.

An optimal decoder uses width * 2 bytes for defiltering plus some fixed overhead, you would have to allocate more memory to buffer the uncompressed piece just to handle this corner case. It would mean additional complexity for most implementations which is undesirable for decoders and worker threads would still have to wait for defiltering to finish on the previous piece.

mARK only allows filter type 0 (None) or 1 (Sub) for the first scanline which do not reference the previous piece, this makes decoding a lot simpler.

To implement a decoder for it you only have to create n-1 workers and join them at the end, memory usage scales linearly, no complicated synchronization is required. It should be straightforward to adapt an existing implementation to run multiple instances of decoders on preassigned regions of the PNG.

If you encounter an unexpected filter value you switch to error recovery, it would be mostly error recovery code you have to add anyway.

The "IDAT inside another IDAT" scenario doesn't seem possible. If all workers have to consume all data in their regions (which is a reasonable requirement) then you are in fact implicitly guarding against that. I wouldn't drop offsets for this reason.

I have an entire page on the rationale behind the mARK chunk and the extension: https://github.com/libspng/png-restart-marker/blob/master/RATIONALE.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants