Using `+` with `choose` field separator considered harmful #21

theryangeary · 2021-07-13T02:21:45Z

Hey there, I just discovered this project from following the breadcrumbs from your issue over on choose! I noticed you had some great benchmark reports (really impressive performance btw!). It looks like you are testing a couple choice inputs to choose using regexes with a + at the end. By default, field separators in choose are greedy, and so using + is not necessary, and actually hurts performance (my understanding is that it is typical for regex engine's performance to be hurt by any repetition).

So where you have choose -f '[[:space:]]+' -i ./hyper_data.txt 0 7 18 > /dev/null, I recommend you instead use choose -f '[[:space:]]' -i ./hyper_data.txt 0 7 18 > /dev/null (note the lack of +). It looks like there are a few other spots as well unnecessarily using +.

Admittedly, this is something of a foot gun, so if you felt that it would be good to demonstrate both, that would also be fair. I am thinking of adding some docs explaining that this should be avoided, and possibly stripping repeat operators in code as well.

For reference, with a quick test using time, it looks like adding the + makes choose take very roughly twice as long:

➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]' 3:5 > /dev/null
choose -i test/long_long_long_long_long.txt -f '[[:space:]]' 3:5 > /dev/null  4.26s user 0.13s system 99% cpu 4.415 total
➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]+' 3:5 > /dev/null
choose -i test/long_long_long_long_long.txt -f '[[:space:]]+' 3:5 > /dev/null  8.31s user 0.17s system 99% cpu 8.507 total

The text was updated successfully, but these errors were encountered:

sstadick · 2021-07-13T02:38:59Z

That's a great point! And an interesting optimization unto itself... I'll update the benchmarks here shortly running choose without the space.

I like your tool btw. I posted hck on reddit and of course the first comment was a link pointing out choose existed. I'm not sure how I missed it when looking for existing tooling at the start, by no means did I want to step on toes!

theryangeary · 2021-07-13T02:55:26Z

No worries! No toes were harmed in the making of this software. When I first posted about choose on reddit, the first comment was "what about fex" - although now I can't seem to find the repo for it, so maybe that's what's up!

sstadick · 2021-07-13T03:40:18Z

I figure out why choose was so slow in comparison! It should be choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null not choose -f '[[:space:]]' -i ./hyper_data.txt 0 7 18 > /dev/null. So there were no fields for it to split on and it was doing way more work than all the other tools.

The benchmarks have been updated and I apologize for the double issue running against the wrong input file and not taking the greedy splitting into account!

theryangeary · 2021-07-13T11:55:36Z

No worries, thanks for the update! I'll be back when I am done improving performance 😈

theryangeary closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `+` with `choose` field separator considered harmful #21

Using `+` with `choose` field separator considered harmful #21

theryangeary commented Jul 13, 2021

sstadick commented Jul 13, 2021

theryangeary commented Jul 13, 2021

sstadick commented Jul 13, 2021

theryangeary commented Jul 13, 2021

Using + with choose field separator considered harmful #21

Using + with choose field separator considered harmful #21

Comments

theryangeary commented Jul 13, 2021

sstadick commented Jul 13, 2021

theryangeary commented Jul 13, 2021

sstadick commented Jul 13, 2021

theryangeary commented Jul 13, 2021

Using `+` with `choose` field separator considered harmful #21

Using `+` with `choose` field separator considered harmful #21