Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with untar where sometimes pax headers are not parsed properly #37

Closed
bitc opened this issue Jan 4, 2024 · 5 comments
Closed

Bug with untar where sometimes pax headers are not parsed properly #37

bitc opened this issue Jan 4, 2024 · 5 comments

Comments

@bitc
Copy link

bitc commented Jan 4, 2024

Hello, I am trying out the new pax support added in #34

Thank you, it is a great feature and very much appreciated.

I am using the untar function to extract a tar file that contains long pathnames (path=my-long-path pax records).

It works 99% well but occasionally fails to parse the pax header for some files and ends up ignoring the long pathname.

I haven't fully diagnosed this, but the problem seems to be in the parsePax function.

parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
parsePax = await >>= \case
Just (ChunkPayload _ b) -> pure $ paxParser b
_ -> pure mempty

This function reads chunks containing a ByteString payload, and then calls paxParser with this ByteString value.

The problem is that sometimes this ByteString value is truncated, for example:

"70 path=very/very/very/very/long/file/path/that/got/tru"

The "70" indicates the length and you can see it got truncated and there is also no trailing "\n" char.

Immediately afterwards my traces show that there is a ChunkPayload containing the rest of the header ("ncated.txt\n").

When paxParser is called with the truncated header, the parse fails and the header is ignored, causing errors with the file.

So this seems to be an error related to the chunking that is done in conduit, and happens on the rare occasion when a pax header happens to cross a ByteString chunk boundary. When I change the conduit input to be a network stream instead of a file, then I think the input chunking is a bit different, and I get errors in a different pax header entry.

@bitc
Copy link
Author

bitc commented Jan 4, 2024

I analysed this further and it was as I suspected above. The bug occurs when the incoming flow of ByteStrings through the conduit are short, and we get lots of small ChunkPayloads instead of one contiguous one.

Here is the "untar, pax interchange format" test that I played around with:

tar-conduit/tests/Spec.hs

Lines 313 to 318 in 96b18aa

it "untar, pax interchange format" $ do
res <- runConduitRes $
paxExample
.| untar process
.| sinkList
pure res `shouldReturn` [("filepath", "payload")]

I can make this test fail by "chopping" the conduit input bytes such that they flow in a single byte at a time:

     it "untar, pax interchange format" $ do
         res <- runConduitRes $
                paxExample
+            .| chop
             .| untar process
             .| sinkList
         pure res `shouldReturn` [("filepath", "payload")]

Here is the "chop" function:

chop :: (MonadIO m) => ConduitT ByteString ByteString m ()
chop = do
  next <- await
  case next of
    Nothing -> pure ()
    Just val -> do
      forM_ (splitEvery 1 (S.unpack val)) $ \chunk -> do
        yield (S.pack chunk)
      chop

splitEvery :: Int -> [a] -> [[a]]
splitEvery n = P.takeWhile (not . P.null) . P.map (P.take n) . P.iterate (P.drop n)

With this modification, the test fails with:

pax
  untar, pax interchange format [✘]

Failures:

  tests/Spec.hs:320:5: 
  1) pax untar, pax interchange format
       uncaught exception: TarException
       UnexpectedPayload 513

I have a fix, I will post it here shortly.

@bitc
Copy link
Author

bitc commented Jan 4, 2024

Here is the fix I came up with. parsePax should read all of the ChunkPayloads that are available in the input (instead of just the first). It should concatenate all of their payloads together and then call paxParser on the result.

-parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
-parsePax = await >>= \case
-    Just (ChunkPayload _ b) -> pure $ paxParser b
-    _ -> pure mempty
+parsePax :: Monad m => ConduitM TarChunk TarChunk (StateT PaxState m) PaxHeader
+parsePax = do
+    payloads <- readAllChunkPayloads []
+    let b = S.concat (reverse payloads)
+    pure $ paxParser b
+
+readAllChunkPayloads :: Monad m => [ByteString] -> ConduitT TarChunk o m [ByteString]
+readAllChunkPayloads accum = do
+    next <- await
+    case next of
+        Nothing -> pure accum
+        Just (ChunkPayload _ b) -> readAllChunkPayloads (b:accum)
+        Just other -> do
+            leftover other
+            pure accum

Someone who is better at conduit than me can probably write this in a more elegant way that is more efficient.

This fixes the original issue I had.

It also fixes the test crash with the "chop" modification I described above in #37 (comment) (note that the test will still fail, but this time for a different reason -- a flaw in the test assertion can that be easily fixed).

@mpilgrem
Copy link
Collaborator

mpilgrem commented Jan 4, 2024

@bitc, I am sorry that #34 was not bug-free. I'll work through your analysis but I'm likely not in a position to do that until the coming weekend.

@mpilgrem
Copy link
Collaborator

mpilgrem commented Jan 5, 2024

@bitc, I agree fully with your analysis and I have a pull request that reflects it and your solution: #39. My original mistake was not realising that pax extended header data could be provided in more than one sequential chunk.

The only thing I changed was that it seemed to me that the accumulator could be of type ByteString, rather than going via a list of ByteString and S.concat. I am assuming that in the great majority of cases, in practice, the pax extended header data would be provided in only a small number of sequential chunks.

@bitc
Copy link
Author

bitc commented Jan 6, 2024

Thank you! The fix in #39 looks perfect and I have tested it in my code and the problem is gone 👍

Thank you again for this feature, pax support in tar-conduit is very useful and valuable 💯

mpilgrem added a commit that referenced this issue Jan 6, 2024
Fix #37 Handle multi-chunk pax extended header data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants