-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Named captures in regex #12
Comments
Hmm, that's an interesting idea. Do you have any use cases where named subpatterns would be more convenient? I'm a little hesitant because it seems like named parameters might add backward incompatibilities and/or api inconsistencies. For example, the third parameter passed to a handler (after "r`^/actors/([\\w]+)/([\\w]+)$`": function(req, res, matches) {
// if req.url === "/actors/smith/will"
// then matches === [ "smith", "will" ]
} While for token matches the third parameter is an hash map and the fourth parameter is an array: "/names/`last-name`/`first-name`": function(req, res, tokens, values) {
// if req.url === "/names/smith/will"
// then tokens === { "first-name": "will", "last-name": "smith" }
// and values === [ "will", "smith" ]
}, To ensure backwards compatibility while adding support for named subpatterns, we'd have to add a fourth parameter of type hash map to regex matches, which would make regex matches inconsistent with token matches -- Another option is to extend the token syntax to support regex. Something like this: "/user/`user-id: [a-z]{2}-\d{5}`": function(req, res, tokens, values) {
// if req.url === "/user/12345"
// then tokens === { "user-id": "tk-12345" }
// and values === [ "tk-12345" ]
//
// if req.url === "/user/tk-123456"
// then this handler is not called
}, But I feel the syntax is a little clunky and it significantly complicates the library. In any case, please let me know about any use cases you had in mind. Thanks, |
The use case is pretty simple; Why would one want to reference items in an array (such as the result from captured groups) by index instead of by name? Any time I write a regex, and I write quite a lot of them, I default to naming the captured groups, because then it doesn't matter if I change the order of them later on, I can still reference them the same way (instead of having to change indexes in the references to them). So in summary, the use case would be cleaner and more dynamic/flexible code, and less chance of bugs. People often think that one changes the order of capturing groups/subpatterns when moving "parts" that you're trying to match around, but it can actually happen that you get different groups just by changing the way the regex works/matches as well, i.e. it's syntax. Either case, there's little point in using integer indexes when we can use names. I guess for the same reasons that you introduced the I agree that the current API, due to the lack of the Maybe one could add this in the next major version or something. Another option is to make the current third parameter to the callback contain both integer and string indexes (they could be tested), but you might feel that is ugly. Regarding putting the regex in the token definition, yeah that's an option. I think you're right in that it complicates it a bit though, you'll have to deal with containing the regex even when one might want to put delimiters for the route definition in it. Here's how it's done in the Yii PHP framework, but it doesn't support all of the regex syntax: |
That totally makes sense. Named subpatterns are definitely better than just raw regexs, but do you know of any cases where named subpatterns are better than the token syntax? The biggest advantage I can think of is that it helps reduce false-positives: "/user/`user-id`" // matches "/user/12345" and "/user/not-an-id"
"r`^user/(?P<user-id>\d{5})/$`" // matches "/user/12345" but not "/user/not-an-id" Which is fairly compelling, because it removes the need for some types of error handling. If that's the main use case, I'm inclined to go with the token syntax: One issue that comes to mind is how should Let me know if any other use cases and/or corner cases that come to mind. |
I think they are two different things. The token syntax is simple and basic and for people that don't need more complex matching than e.g. "text parts between /" it's nice to have. Personally if I had to choose just one to keep, it'd obviously be the one that supports regex, because I think one need to have that ability of more complex/detailed matching.
Deviating from the regular regex rules is of course doable but you'd need to write your own regex parser I guess and also it does become a bit weird. Perhaps better to just keep it the way it is, i.e. let .* match / as well. After all, you the regex syntax for routes does it that way; If youd' want to match .* you'd have to turn that into [^/]* so it doesn't match the /. Overall I'm not too fond of complicating it by merging the two syntaxes. In part because it requires some intervening, you can't just apply it to the URL. Better just keep it separate.
Would the purpose of making the token syntax accept regex simply be to avoid having to change the API? I don't have an opinion on that really, it depends on whether you want to keep the API or can break it now or in a new major version. I imagine the way I'd personally like the regex way to work is like this:
This example doesn't exemplify anything that the regular tokens syntax doesn't support, but you get the idea. |
You bring up good points, I agree that Avoiding API changes is one reason I'd prefer the mixed syntax, but the primary reason is that the purpose of this library is to remove the need for raw regexs. A good amount of boiler plated is required to match regexs against urls -- the starting Another reason why I'd prefer the mixed syntax is that it limits the scope of regexs. By limiting the scope, we, IMO, make the regexs more readable as well as remove some of the boiler plate, while still keeping much of the power and flexibility. For example:
vs
A long the same lines, url definitions should ideally be as concise as possible and contain as little visual noise as possible. IMO, named groups in raw regex are quite verbose and contain a good amount of visual noise, which is why I'd again prefer the token/regex style. |
I can't really agree with this because I love regex and I don't think things like \w and \W are more complicated than knowing the difference between == and != or == and ===. But that's just my opinion, if one doesn't know and/or doesn't have the interest in knowing the basics of regex then I guess it can be annoying to see regex syntax. Either one embrace it or one doesn't.
I don't think that's right. The above regex would IMO be:
The ^ and $ is something you can add while parsing the line, there's no point in users adding that all the time (assuming always matching at the beginning and end of the URL, which I think makes sense).
I unescaped a - since I don't think it needs escaping, and the second `.
If you still prefer to introduce a mixed syntax you have the choice to parse a more restricted version of regex. Here is an example of how the above rule would look in the syntax that Yii (PHP framework) uses:
It's quite similar to your suggested mixed syntax, actually. Just slightly easier to parse since it has separate start and ending delimiters for the parameters (instead of one and the same character being used for both). If Python has recursive regex support that it could deal with using the same character for both start and end delimiters though. Anyway, you should do what you think is the best. At least until someone else speaks their mind. |
I'd love to see this added, specifically this syntax:
|
Cool, will do. I'll try implementing it this weekend. |
Beeline (v0.2.1) should now support regex tokens. This feature was much trickier to implement than I expected, so ideas for any additional unit tests will be more than welcomed. In order to keep the code sane, regex's with backreferences are not supported. For example, a url like this will generate a warning: Technically you could use use backreference, but the results won't be intuitive. The reason is because ultimately the regex associated with a token will be embedded into a larger regex and as a result, the capture group That said all other regex feature should be supported including look-aheads. Though keep in mind an implicit Let me know if you think of any other features or use cases I should test out. |
Hi,
Any plans or thoughts on making the regex matching support named subpatterns, so that we get not just the values but also the names of the "parameters", like in tokens, in the callback?
I realize this might require depending on another regex library than the core one. It would just be useful, I use named subpatterns/capturing groups all the time.
Thanks!
The text was updated successfully, but these errors were encountered: