Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ndjson for training data persistence #200

Closed
tomdavidson opened this issue Apr 13, 2022 · 1 comment
Closed

ndjson for training data persistence #200

tomdavidson opened this issue Apr 13, 2022 · 1 comment
Milestone

Comments

@tomdavidson
Copy link

Is your feature request related to a problem? Please describe.
Marked recorded are stored as individual Parquet files. Parquet is an "immutable" binary format and difficult to edit and view with out special tools.

Exporting and importing labels has lots of extra motion: https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/exportlabeleddata

I recently had to remove some records marked as unsure when I learned that they should have been matched. With about 250 marks, it was quite a pain to go through and find the "offending files"

Describe the solution you'd like
The labels are small data and do not need the columnar binary format. Storing all the records in a single plaintext file such as NDJSON is self describing, appendable, universal, and accessible. This probably applies to other files zingg is persisting too.

Describe alternatives you've considered
CSV is problematic due to the minimal spec without types nor lists. Another alt could be a db for zingg training data, labels, stop words, synonyms, models, and future api for clis and webapps.... but I think the json file would deliver immediate value with a lot less effort.

@sonalgoyal
Copy link
Member

the updateLabels phase addresses this. we also have ways to convert training data to other formats using standard spark functionality.

@sonalgoyal sonalgoyal added this to the 0.4 milestone Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants