Binary File Inputs #4637

order-flow-labs · 2023-09-01T13:50:44Z

Add support for allowing binary input files

Short description

As far as I can tell from the documentation there is not support for binary files as the input source for VW. I have VW integrated into a pretty extensive research system that generates massive files (TB+ sizes). The data set generation process is geared towards regressions where the data is very well defined and consistently formatted. The bottleneck comes from the feature generation process having to write floats to text files (which is a notorious slowdown). Furthermore VW just reads this in and parses the string representation of the float back into a float. Support for just reading inputs from a binary file with a schema flag would allow for a substantial speed up for a lot of use cases.

How this suggestion will help you/others

This would lead to a much faster research process, a speed up on the VW side too as parsing inputs is not required. Raw casting of the byte buffers to floats usually has no measurable overhead in C++. It wouldn't be a feature all users of VW would need but there is definitely a subset which would gain a lot from this capability.

Possible solution/implementation details

new flags:

--binary_file=true/false
--binary_file_schema=JSON schema of if feature weight is included, names of each column etc.

This can be implemented where lines are still delimited by '\n' but the contents leading up to it are a raw byte vector. Would only make sense for logistic/linear regression type tasks where the data is highly consistent from one record to the next.

The text was updated successfully, but these errors were encountered:

rajan-chari · 2023-09-21T13:25:11Z

There is a nice CPU efficient input format implemented in VW (Flatbuffers)

Command line example:
vw --cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer

Unfortunately, this feature is off by default. You can turn it on when you build VW.

vowpal_wabbit/CMakeLists.txt

Line 53 in de52303

if(VW_FEAT_FLATBUFFERS)

Here is the schema:
https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/fb_parser/schema/example.fbs

Helpful reference. Converting to flatbuffers
https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/test/runtests_flatbuffer_converter.py

There are extensive tests for this feature:

vowpal_wabbit/test/core.vwtest.json

Line 3100 in de52303

    
           "vw_command": "--cb_force_legacy --cb 2 -d train-sets/rcv1_raw_cb_small.fb --flatbuffer",

If you feel you can contribute some time to the projects, we welcome your involvement. The following link is a PR that's almost ready to go. It defines a more compact Flatbuffer format with similar performance. I am happy to guide you through it if you would like to push forward with it. C++ experience would be very helpful.
https://github.com/VowpalWabbit/vowpal_wabbit/pulls?q=is%3Apr+is%3Aclosed+flatbuffer

Please note that the input format in the PR above is where we want to eventually end up.

rajan-chari · 2023-09-21T13:26:35Z

Feel free to reach out to me if you would like to push this item forward for the benefit of the community.

olgavrou · 2023-10-12T15:35:10Z

Closing for now, please feel free to re-open

order-flow-labs added the Feature Request New feature requested in system label Sep 1, 2023

order-flow-labs changed the title ~~Binary File Inpts~~ Binary File Inputs Sep 3, 2023

olgavrou closed this as completed Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary File Inputs #4637

Binary File Inputs #4637

order-flow-labs commented Sep 1, 2023

rajan-chari commented Sep 21, 2023

rajan-chari commented Sep 21, 2023

olgavrou commented Oct 12, 2023

Binary File Inputs #4637

Binary File Inputs #4637

Comments

order-flow-labs commented Sep 1, 2023

Short description

How this suggestion will help you/others

Possible solution/implementation details

rajan-chari commented Sep 21, 2023

rajan-chari commented Sep 21, 2023

olgavrou commented Oct 12, 2023