Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support string alternative syntax (curly braces pattern) #2

Open
michaelsproul opened this issue Sep 1, 2014 · 1 comment
Open

Comments

@michaelsproul
Copy link
Contributor

Is there a plan to support string alternatives with curly braces, like {foo,bar}program matching fooprogram and barprogram?

To avoid code duplication in the matcher, I was thinking it might be a good idea to use the regex crate to implement this (as well as the rest of glob).

Support in other languages:

  • C/C++: yes using the GLOB_BRACE option.
  • Python: no
  • Ruby: yes
  • Java: yes
@blaenk
Copy link
Contributor

blaenk commented Feb 12, 2015

/cc @alexcrichton (nothing important, no need to read/reply; just if you're interested)

TL;DR: A while back I attempted a complete rewrite of glob. I called it glob-prime, as in glob'. It parsed glob patterns into the AST we have now and translated the AST to regular expressions. It also had a different 'selector-based' selection algorithm, so that each pattern type (e.g. recursive, exact, wildcard) would define its own globbing behavior. The goal was extensibility and maintainability, but it was much slower, so I gave up on it and proposed the remaining things I thought were applicable in #25/#26/#27/#28.

I'll probably take a shot at implementing this alternation syntax whenever I'm bored if no one else gets around to it.

I actually rewrote this glob package from scratch with that goal, to translate globs to regular expressions in order to make the code thinner and easier to expand/maintain. This is what Python does, and it wouldn't surprise me if others did the same. Adding this alternative syntax would've been very straightforward, as it would more or less map to a regex alternation, foo|bar. You can see the translation code here, so that creating a Pattern was essentially a two-step process of parsing the text into the glob AST we already have and then translating that to regex.

I also re-implemented the actual globbing algorithm to be one which, as this one currently does, splits the patterns on the path separator and then translates each one into a given token which has 'globbing' behavior defined for it. The goal was to make things much easier to expand: adding the alternative syntax would've basically consisted of creating an AST token for it, defining how it translates to a regex, and then defining a selector variant for it with its selection behavior, which would have been, for example, something like Alternative(Vec<Box<Selector>>), since AFAIK glob alternations can consist of sub-glob patterns, which means that something similar would have to be done for the translation into regex, which would've perhaps simply consisted of a recursive call to the translation function.

When all was said and done, however, I found it to be much slower.

First, the new selection algorithm I used, which I actually got inspiration from python's glob for, wasn't exactly optimal because on each iteration I would have to re-trace through each of the selectors. Python has the yield keyword which makes all of this very natural, but the iterator as implemented essentially would return a result, then on the next iteration would have to go through the top-most selector, then the next one, and so forth until it got back to its original position.

Perhaps I could've optimized this by storing a reference to the old position or something, but I didn't get around to it because I also discovered that the regular expression approach in general was much slower. I already kind of expected this because regular expressions are much more general and have more machinery to handle other use cases, which can hardly compete against a hand-tuned specific parser like the one we have now. But I didn't expect it to be that much slower.

I don't remember the numbers, but it was on the order of something like, glob would take maybe 50ns max with the benchmarking test set, and mine would take like 200,000ns. I figured I'd cut my losses and stick with the current globbing mechanism while keeping my regular expression back-end, but it was still something like 80,000ns to the current one's 50ns max. With a difference like that I figured my redesigns would never be accepted, even if it was still orders of magnitude faster than for example python's, and even though globbing isn't something that's typically done in a hot loop.

So I figured I'd cut my losses once more and proposed what remained, which was relative path support (#25), error reporting (#26), recursive wildcard simplification (#27), and I/O error propagation (#28).

I didn't really bother too much with attempting to optimize it further because it was mostly a temporary diversion for me, since I need this package for one of mine. But more importantly what we have now works fine enough, and globbing isn't necessarily very extensible, so it's not like we really required that extensibility; the gains would've mostly been in the area of maintainability, but this package shouldn't need to be touched too much for the aforementioned reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants