New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental or streaming decoding #10

Open
gwils opened this Issue Mar 7, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@gwils
Collaborator

gwils commented Mar 7, 2018

Currently sv will parse and load an entire document into memory before starting any decoding. On a 5GB CSV file, this would likely end in disaster.

It would be worth looking into whether we could add a "row-at-a-time" approach and what the trades-off would be.

@axman6

This comment has been minimized.

axman6 commented Mar 7, 2018

Row at a time or chunk of rows at a time would be good, streaming individual rows is going to be inefficient in many cases (such as the time-double value examples I've shown before), so having something like:

stream :: Monad m => Int -> Parser a -> ByteString m a -> Stream (Of (Vector a)) m (Either (Message, ByteString m r) r)

(see https://hackage.haskell.org/package/streaming-utils-0.1.4.7/docs/Data-Attoparsec-ByteString-Streaming.html#v:parsed)

would be quite useful.

Also important is streaming serialisation.

@LeanderK

This comment has been minimized.

LeanderK commented Mar 7, 2018

Hello, I am currently working on something comparable to a data-frame library and just stumbled upon this package. Looks great! 🙂

I would love to use this package for parsing CSVs etc., but I am fundamentally streaming-based, so this feature is important to me. Also, I would like to have a more low-level hook, since I am not sure which streaming-package I want to integrate with.

@tonyday567

This comment has been minimized.

tonyday567 commented Mar 10, 2018

A quick experiment: https://github.com/tonyday567/streaming-sv/

I got a fair way towards streaming with the existing library. The main blocker seemed to be the list in Records.

@gwils

This comment has been minimized.

Collaborator

gwils commented Mar 10, 2018

Hi Tony. That's quite interesting. Thanks for linking it.

The main blocker seemed to be the list in Records.

Do you mean the vector?

data Records s =
  EmptyRecords
| Records (Record s) (Vector (Newline, Record s))

Perhaps we could change that structure to better support streaming, or create a separate, more stream-oriented structure as an alternative?

@tonyday567

This comment has been minimized.

tonyday567 commented Mar 10, 2018

Yes, I meant the Vector in Records. A streaming version would be something like:

newtype RecordsS m s = RecordsS (Stream (Of (Newline, Record s)) m ())

Not sure what to do about the m. You might be able to swallow it with an existential.

I had to hardcode an Identity as in SvParser (B.ByteString Identity ()) in the example but it's going to come out of a file as a B.ByteString IO () (say), so there may need to be another type parameter anyway, and that would propagate up.

But impressive that streaming can occur out of the box without any prior engineering. Shows you're on the right track with these types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment