Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There must be a better way! #6

Closed
wants to merge 7 commits into from
Closed

There must be a better way! #6

wants to merge 7 commits into from

Conversation

sstadick
Copy link
Owner

@sstadick sstadick commented Jul 1, 2021

These changes are already on master, this PR is meant to inform future iterations.

It might be worth starting with the very bottom In an ideal world.

What is hck?

hck is a near clone of cut, but faster and with more features, such as rearranging output columns regex delimiters and selection by headers. A few more features are planned as well such as regex filters on columns and possibly some kind of nice hookup to bat or fzf.

As it stands in my own benchmarking (in other words, probably flawed and biased benchmarks), hck is on par with tsv-utils from ebay, which seems to be about the fastest toolkit out there. I would like to keep or exceed this performance.

How does it work?

Setup

hck takes a list of files or stdin and some subset of fields as options. Fields are parsed by the FieldRange module/struct. The code for that was adapted from the rust coreutils version of cut.

A writer opened on stdout (from grep_cli to avoid the rustc not-buffered-stdout issue) is created.

For each input the run function in main.rs is called. It takes a reader, writer, and the parsed options.

The field ranges are re-parsed for each input source since the headers for each input could be in a different order. This is not ideal and could be changed in the future to do it all right at the start, or at least only parse the index based fields once.

In the run function the reader and writer are wrapped in BufWriter / BufReader. A core::Core object is created to manage the actual parsing and writing of data.

At this point we hit the first limitation imposed by the current design, if we are splitting the line based on a regex, we call the core.process_reader_regex, and if using a substr delim we call core.process_reader_substr. These methods differ only in how the string is split.

Parsing lines

Looking at core::Core::process_reader_*, the main loop there was taken from bstr's for_line_with* code. It is a nearly zero copy line reader and was the fastest of the many ways I tried parsing lines. The function itself in bstr required a closure, more on that in a second.

The cached Vec<Vec<&str>> issue

The "process a line" code is repeated twice in the loop. This is bad and I'd love to change it. For each line, because we are reordering the outputs, we need to store references to parsed fields in the "new" order since we must (or at least it seems easiest / fastest) iterate over the fields their input order. This is done by storing groups of fields on the shuffler Vec as Vec<Vec<&str>>. We want to avoid allocating and deallocating that vec over and over for each line. This turns out to be really really hard and annoying.

See my question here and here to see how I ended up with what you see now.

The short explanation is that empty_shuffler is created outside the loop with a &'static str inner, then coerced inside the loop to give &str a shorter lifetime so that the compiler will allow references from the current buffer to be stored on a container that lives longer than that data. When the fields are written in their new order the inner Vec<&'str>'s are drained so the vec is effectively recycled, but rustc doesn't know that yet so we transmute the shuffler back into the empty_shuffler (again, see above questions for more details on how / why this works).

let mut empty_shuffler: Vec<Vec<&'static [u8]>> = vec![vec![]; self.fields.len()];
loop {
    // coerce the lifetime to match the loop
    let mut shuffler = empty_shuffler;
    // ... fill the shuffler with &str from the current line
    // write the fields in the new order, draining each inner vec
    self.join_appender(shuffler.iter_mut().flat_map(|values| values.drain(..)))?;
    // "reset" the empty_shuffler for the next loop
    empty_shuffler = unsafe { core::mem::transmute(shuffler) };
}

HELPPPPP In src/lib/line_parser.rs you will see an attempt at moving this logic elsewhere to compartmentalize things. Every attempt I've made at this falls apart with lifetime errors around the empty_shuffler and its inners living longer than the data that was just read. The lifetime coercion seems to fall apart across a function boundary. This especially does not work with a closure.

The split issue

This leads directly into the next issue. We can parse on two delimiter types: &[u8] or regex::byte::Regex. I've gone down the road of a splitter function that returns Box<dyn Iterator<Item=&[u8>> and it works, but there is an appreciable performance hit since that is literally the hottest part of the code.

HELPPPP Again, src/lib/line_parser.rs makes an attempt at splitting this out into a trait with different implementations. Sadly things feel apart due to The cached vec issue before really getting to play with this design. I would love feedback on this though. Specifically, should the return type from split be elevated up to an associated type for the trait? If so this moves a lifetime onto the trait, how does that work out?

In an ideal world

In an ideal world I would be able to use:

// method on `core::Core`
pub fn process_reader<R: Read>(&mut self, reader: &mut BufReader<R>) -> Result<(), io::Error> {
    let mut shuffler: Vec<Vec<&'static [u8]>> = vec![vec![]; self.fields.len()];
    reader.for_byte_line(|line| {
        self.line_parser.parse_line(line, &mut shuffler)?;
         self.join_appender(shuffler.iter_mut().flat_map(|values| values.drain(..)))?;
        Ok(true)
    })?;
}

To the best of my knowledge there is no way to reuse a vec like that, which is very sad.

@sstadick sstadick mentioned this pull request Jul 7, 2021
@sstadick
Copy link
Owner Author

sstadick commented Jul 7, 2021

See #7, which fixes the issue partly by adding the ripline crate which does zero copy reading without a closure nearly as fast as bstr. I was also ably to partially solve the split issue.

I added more run types (fast mode, bytes only mode, etc), but the reuse of the vec is still an issue because I can't "dry" out that code to pass the vec around because it still complicates the lifetimes when passing it around in a loop.

@sstadick sstadick closed this Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant