Streaming parser #205

Daniel-Diaz · 2022-05-14T12:43:03Z

I added the ability to consume yaml files incrementally using conduits. This is exposed by the parseStream function in the module Data.Yaml.Internal.

The motivation was simply wanting to consume large files without having to have them fully in memory. I don't know what the demand for something like this is, but I wanted it.

I was thinking maybe to define a wrapper in Data.Yaml that would be easier to use and discover. However, I wasn't sure what that would look like. In any case, exporting the conduit works well enough for myself.

I also had to export ParseState or otherwise I wasn't able to run the parser.

To try it out, I wrote the following code:

{-# LANGUAGE ImportQualifiedPost, OverloadedStrings #-}

module Main (main) where

import Data.Void (Void)
import Data.Aeson (FromJSON, parseJSON, withObject, (.:), fromJSON)
import Data.Aeson qualified as JSON
import Conduit (ConduitM, runConduit, (.|), runResourceT)
import Data.Conduit.List qualified as CL
import Data.Yaml.Internal (Parse, ParseState (..), parseStream)
import Control.Monad.Reader (runReaderT)
import Control.Monad.State.Strict (evalStateT)
import Text.Libyaml (decodeFile)

data ID = ID { id_field :: Int }

instance FromJSON ID where
  parseJSON = withObject "ID" $ \o -> fmap ID $ o .: "id"

main :: IO ()
main = do
  let f :: Int -> JSON.Value -> Int
      f i v =
        case fromJSON v of
          JSON.Error err -> error err
          JSON.Success x -> i + id_field x
  let conduit :: ConduitM () Void Parse Int
      conduit = decodeFile "big.yaml"
             .| runReaderT parseStream []
             .| CL.fold f 0
  r <- runResourceT $ evalStateT (runConduit conduit) $ ParseState mempty []
  print r

And then I created a 1,5Gb file (big.yaml). The memory still goes up as the program runs, but does so very slowly and the program finishes before it is a problem (it ends up taking close to 3Gb though). Trying to parse the entire file quickly used all of my available memory (31,2Gb). I was expecting the conduit version to take a roughly constant amount of memory. I wonder what causes the slow build up. Could it be related to the parser state?

Thanks in advance for your review.

Daniel-Diaz · 2022-05-14T12:43:52Z

yaml/src/Data/Yaml/Internal.hs

+    me <- lift CL.head
+    case me of
+      Just (EventSequenceStart _ _ a) -> parseArrayStreaming 0 a
+      _ -> liftIO $ throwIO $ UnexpectedEvent me $ Just $ EventSequenceStart NoTag AnySequence Nothing


I added some values here for the "expected" event, hoping that would be more helpful that using Nothing.

Daniel-Diaz · 2022-05-14T12:44:12Z

yaml/src/Data/Yaml/Internal.hs

+--   In combination with 'Y.decodeFile', it can be used to consume
+--   a yaml file that consists in a very large list of values.
+parseStream :: ReaderT JSONPath (ConduitM Event Value Parse) ()
+parseStream = do


I can change the name of this if there's a better name for it.

Daniel-Diaz · 2022-05-14T12:44:40Z

yaml/yaml.cabal

@@ -5,7 +5,7 @@ cabal-version: 1.12
 -- see: https://github.com/sol/hpack

 name:           yaml
-version:        0.11.7.0
+version:        0.11.8.0


I didn't change this myself. I don't know why it kept happening.

snoyberg · 2022-05-15T07:40:20Z

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

Daniel-Diaz · 2022-05-15T08:39:28Z

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

OK. It might be a good idea, but I don't have the time right now to create a separate project. Perhaps another time. For now, I will just use my fork and merge any updates from your repository.

The reason I added it here instead of on my own code is that it uses internal functions that are not exported, and I also thought it might be useful to other people. I simply copied parseAll and modified it to yield the values downstream instead of gathering them in a list. So it's just a modification of one the functions that's already there. This is also why ReaderT is in the place it is.

Thanks for the prompt reply!

Daniel-Diaz added 6 commits May 14, 2022 11:55

Add parseStream to Data.Yaml.Internal

19c2ec5

Streaming all the way down.

33106cf

Add expected event to parseSubStream

f3e70e1

succ -> +1

7c8ec4a

parens

48b34d0

Export ParseState

8bfcda5

Daniel-Diaz commented May 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming parser #205

Streaming parser #205

Daniel-Diaz commented May 14, 2022

Daniel-Diaz May 14, 2022

Daniel-Diaz May 14, 2022

Daniel-Diaz May 14, 2022

snoyberg commented May 15, 2022

Daniel-Diaz commented May 15, 2022 •

edited

Loading

Streaming parser #205

Are you sure you want to change the base?

Streaming parser #205

Conversation

Daniel-Diaz commented May 14, 2022

Daniel-Diaz May 14, 2022

Choose a reason for hiding this comment

Daniel-Diaz May 14, 2022

Choose a reason for hiding this comment

Daniel-Diaz May 14, 2022

Choose a reason for hiding this comment

snoyberg commented May 15, 2022

Daniel-Diaz commented May 15, 2022 • edited Loading

Daniel-Diaz commented May 15, 2022 •

edited

Loading