Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using + with choose field separator considered harmful #21

Closed
theryangeary opened this issue Jul 13, 2021 · 4 comments
Closed

Using + with choose field separator considered harmful #21

theryangeary opened this issue Jul 13, 2021 · 4 comments

Comments

@theryangeary
Copy link

Hey there, I just discovered this project from following the breadcrumbs from your issue over on choose! I noticed you had some great benchmark reports (really impressive performance btw!). It looks like you are testing a couple choice inputs to choose using regexes with a + at the end. By default, field separators in choose are greedy, and so using + is not necessary, and actually hurts performance (my understanding is that it is typical for regex engine's performance to be hurt by any repetition).

So where you have choose -f '[[:space:]]+' -i ./hyper_data.txt 0 7 18 > /dev/null, I recommend you instead use choose -f '[[:space:]]' -i ./hyper_data.txt 0 7 18 > /dev/null (note the lack of +). It looks like there are a few other spots as well unnecessarily using +.

Admittedly, this is something of a foot gun, so if you felt that it would be good to demonstrate both, that would also be fair. I am thinking of adding some docs explaining that this should be avoided, and possibly stripping repeat operators in code as well.

For reference, with a quick test using time, it looks like adding the + makes choose take very roughly twice as long:

➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]' 3:5 > /dev/null
choose -i test/long_long_long_long_long.txt -f '[[:space:]]' 3:5 > /dev/null  4.26s user 0.13s system 99% cpu 4.415 total
➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]+' 3:5 > /dev/null
choose -i test/long_long_long_long_long.txt -f '[[:space:]]+' 3:5 > /dev/null  8.31s user 0.17s system 99% cpu 8.507 total
@sstadick
Copy link
Owner

That's a great point! And an interesting optimization unto itself... I'll update the benchmarks here shortly running choose without the space.

I like your tool btw. I posted hck on reddit and of course the first comment was a link pointing out choose existed. I'm not sure how I missed it when looking for existing tooling at the start, by no means did I want to step on toes!

@theryangeary
Copy link
Author

No worries! No toes were harmed in the making of this software. When I first posted about choose on reddit, the first comment was "what about fex" - although now I can't seem to find the repo for it, so maybe that's what's up!

@sstadick
Copy link
Owner

I figure out why choose was so slow in comparison! It should be choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null not choose -f '[[:space:]]' -i ./hyper_data.txt 0 7 18 > /dev/null. So there were no fields for it to split on and it was doing way more work than all the other tools.

The benchmarks have been updated and I apologize for the double issue running against the wrong input file and not taking the greedy splitting into account!

@theryangeary
Copy link
Author

No worries, thanks for the update! I'll be back when I am done improving performance 😈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants