Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These changes are already on master, this PR is meant to inform future iterations.
It might be worth starting with the very bottom In an ideal world.
What is
hck
?hck
is a near clone ofcut
, but faster and with more features, such as rearranging output columns regex delimiters and selection by headers. A few more features are planned as well such as regex filters on columns and possibly some kind of nice hookup tobat
orfzf
.As it stands in my own benchmarking (in other words, probably flawed and biased benchmarks),
hck
is on par withtsv-utils
from ebay, which seems to be about the fastest toolkit out there. I would like to keep or exceed this performance.How does it work?
Setup
hck
takes a list of files or stdin and some subset of fields as options. Fields are parsed by theFieldRange
module/struct. The code for that was adapted from the rust coreutils version of cut.A writer opened on stdout (from grep_cli to avoid the rustc not-buffered-stdout issue) is created.
For each input the
run
function inmain.rs
is called. It takes a reader, writer, and the parsed options.The field ranges are re-parsed for each input source since the headers for each input could be in a different order. This is not ideal and could be changed in the future to do it all right at the start, or at least only parse the index based fields once.
In the run function the reader and writer are wrapped in
BufWriter
/BufReader
. Acore::Core
object is created to manage the actual parsing and writing of data.At this point we hit the first limitation imposed by the current design, if we are splitting the line based on a regex, we call the
core.process_reader_regex
, and if using a substr delim we callcore.process_reader_substr
. These methods differ only in how the string is split.Parsing lines
Looking at
core::Core::process_reader_*
, the main loop there was taken from bstr'sfor_line_with*
code. It is a nearly zero copy line reader and was the fastest of the many ways I tried parsing lines. The function itself inbstr
required a closure, more on that in a second.The cached
Vec<Vec<&str>>
issueThe "process a line" code is repeated twice in the loop. This is bad and I'd love to change it. For each line, because we are reordering the outputs, we need to store references to parsed fields in the "new" order since we must (or at least it seems easiest / fastest) iterate over the fields their input order. This is done by storing groups of fields on the
shuffler
Vec
asVec<Vec<&str>>
. We want to avoid allocating and deallocating that vec over and over for each line. This turns out to be really really hard and annoying.See my question here and here to see how I ended up with what you see now.
The short explanation is that
empty_shuffler
is created outside the loop with a&'static str
inner, then coerced inside the loop to give&str
a shorter lifetime so that the compiler will allow references from the current buffer to be stored on a container that lives longer than that data. When the fields are written in their new order the innerVec<&'str>
's are drained so the vec is effectively recycled, but rustc doesn't know that yet so we transmute theshuffler
back into theempty_shuffler
(again, see above questions for more details on how / why this works).HELPPPPP In
src/lib/line_parser.rs
you will see an attempt at moving this logic elsewhere to compartmentalize things. Every attempt I've made at this falls apart with lifetime errors around theempty_shuffler
and its inners living longer than the data that was just read. The lifetime coercion seems to fall apart across a function boundary. This especially does not work with a closure.The split issue
This leads directly into the next issue. We can parse on two delimiter types:
&[u8]
orregex::byte::Regex
. I've gone down the road of a splitter function that returnsBox<dyn Iterator<Item=&[u8>>
and it works, but there is an appreciable performance hit since that is literally the hottest part of the code.HELPPPP Again,
src/lib/line_parser.rs
makes an attempt at splitting this out into a trait with different implementations. Sadly things feel apart due to The cached vec issue before really getting to play with this design. I would love feedback on this though. Specifically, should the return type fromsplit
be elevated up to an associated type for the trait? If so this moves a lifetime onto the trait, how does that work out?In an ideal world
In an ideal world I would be able to use:
To the best of my knowledge there is no way to reuse a vec like that, which is very sad.