TSV parser for Python in using only NumPy ops.
This is not production-ready code, just a primer on a branchless parsing technique using vectorized code.
- support strings
- support negative integers and floats
- Read the whole file into a byte array in memory
- Find positions of tabs and decimal points
- Compute digit count for every field
- For the integer case, given the maximum number of digits in the file, precompute the parsed integers finishing on a given positions for all possible digit counts
- For every field, use the computed digit count to index into the precomputed parsed integers array
- Assemble values for the real-valued columns: the integral and remainder parts are neighboring parsed integers
- Supports integer, real (only decimal point notation) and utf-8 string columns (quotes not supported)
- Uses only NumPy methods, and can be extended to GPU using Google Jax or PyTorch. It can also run on Pyodide.
For some truly fascinating vectorized parsing check out simdjson and csvmonkey.