Skip to content

TSV parser for Python in pure vectorized NumPy code

Notifications You must be signed in to change notification settings

vadimkantorov/fasttsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 

Repository files navigation

fasttsv

TSV parser for Python in using only NumPy ops.

This is not production-ready code, just a primer on a branchless parsing technique using vectorized code.

TODO

  • support strings
  • support negative integers and floats

Approach

  1. Read the whole file into a byte array in memory
  2. Find positions of tabs and decimal points
  3. Compute digit count for every field
  4. For the integer case, given the maximum number of digits in the file, precompute the parsed integers finishing on a given positions for all possible digit counts
  5. For every field, use the computed digit count to index into the precomputed parsed integers array
  6. Assemble values for the real-valued columns: the integral and remainder parts are neighboring parsed integers

Features, scope and limitations

  1. Supports integer, real (only decimal point notation) and utf-8 string columns (quotes not supported)
  2. Uses only NumPy methods, and can be extended to GPU using Google Jax or PyTorch. It can also run on Pyodide.

Further reading

For some truly fascinating vectorized parsing check out simdjson and csvmonkey.

About

TSV parser for Python in pure vectorized NumPy code

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages