-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary File Inputs #4637
Comments
There is a nice CPU efficient input format implemented in VW (Flatbuffers) Command line example: Unfortunately, this feature is off by default. You can turn it on when you build VW. Line 53 in de52303
Here is the schema: Helpful reference. Converting to flatbuffers There are extensive tests for this feature: vowpal_wabbit/test/core.vwtest.json Line 3100 in de52303
If you feel you can contribute some time to the projects, we welcome your involvement. The following link is a PR that's almost ready to go. It defines a more compact Flatbuffer format with similar performance. I am happy to guide you through it if you would like to push forward with it. C++ experience would be very helpful. Please note that the input format in the PR above is where we want to eventually end up. |
Feel free to reach out to me if you would like to push this item forward for the benefit of the community. |
Closing for now, please feel free to re-open |
Add support for allowing binary input files
Short description
As far as I can tell from the documentation there is not support for binary files as the input source for VW. I have VW integrated into a pretty extensive research system that generates massive files (TB+ sizes). The data set generation process is geared towards regressions where the data is very well defined and consistently formatted. The bottleneck comes from the feature generation process having to write floats to text files (which is a notorious slowdown). Furthermore VW just reads this in and parses the string representation of the float back into a float. Support for just reading inputs from a binary file with a schema flag would allow for a substantial speed up for a lot of use cases.
How this suggestion will help you/others
This would lead to a much faster research process, a speed up on the VW side too as parsing inputs is not required. Raw casting of the byte buffers to floats usually has no measurable overhead in C++. It wouldn't be a feature all users of VW would need but there is definitely a subset which would gain a lot from this capability.
Possible solution/implementation details
new flags:
This can be implemented where lines are still delimited by '\n' but the contents leading up to it are a raw byte vector. Would only make sense for logistic/linear regression type tasks where the data is highly consistent from one record to the next.
The text was updated successfully, but these errors were encountered: