Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary File Inputs #4637

Closed
order-flow-labs opened this issue Sep 1, 2023 · 3 comments
Closed

Binary File Inputs #4637

order-flow-labs opened this issue Sep 1, 2023 · 3 comments
Labels
Feature Request New feature requested in system

Comments

@order-flow-labs
Copy link

Add support for allowing binary input files

Short description

As far as I can tell from the documentation there is not support for binary files as the input source for VW. I have VW integrated into a pretty extensive research system that generates massive files (TB+ sizes). The data set generation process is geared towards regressions where the data is very well defined and consistently formatted. The bottleneck comes from the feature generation process having to write floats to text files (which is a notorious slowdown). Furthermore VW just reads this in and parses the string representation of the float back into a float. Support for just reading inputs from a binary file with a schema flag would allow for a substantial speed up for a lot of use cases.

How this suggestion will help you/others

This would lead to a much faster research process, a speed up on the VW side too as parsing inputs is not required. Raw casting of the byte buffers to floats usually has no measurable overhead in C++. It wouldn't be a feature all users of VW would need but there is definitely a subset which would gain a lot from this capability.

Possible solution/implementation details

new flags:

--binary_file=true/false
--binary_file_schema=JSON schema of if feature weight is included, names of each column etc.

This can be implemented where lines are still delimited by '\n' but the contents leading up to it are a raw byte vector. Would only make sense for logistic/linear regression type tasks where the data is highly consistent from one record to the next.

@order-flow-labs order-flow-labs added the Feature Request New feature requested in system label Sep 1, 2023
@order-flow-labs order-flow-labs changed the title Binary File Inpts Binary File Inputs Sep 3, 2023
@rajan-chari
Copy link
Member

There is a nice CPU efficient input format implemented in VW (Flatbuffers)

Command line example:
vw --cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer

Unfortunately, this feature is off by default. You can turn it on when you build VW.

if(VW_FEAT_FLATBUFFERS)

Here is the schema:
https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/fb_parser/schema/example.fbs

Helpful reference. Converting to flatbuffers
https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/test/runtests_flatbuffer_converter.py

There are extensive tests for this feature:

"vw_command": "--cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer",

If you feel you can contribute some time to the projects, we welcome your involvement. The following link is a PR that's almost ready to go. It defines a more compact Flatbuffer format with similar performance. I am happy to guide you through it if you would like to push forward with it. C++ experience would be very helpful.
https://github.com/VowpalWabbit/vowpal_wabbit/pulls?q=is%3Apr+is%3Aclosed+flatbuffer

Please note that the input format in the PR above is where we want to eventually end up.

@rajan-chari
Copy link
Member

Feel free to reach out to me if you would like to push this item forward for the benefit of the community.

@olgavrou
Copy link
Collaborator

Closing for now, please feel free to re-open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature requested in system
Projects
None yet
Development

No branches or pull requests

3 participants