Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named captures in regex #12

Closed
rawtaz opened this issue Nov 11, 2013 · 9 comments
Closed

Named captures in regex #12

rawtaz opened this issue Nov 11, 2013 · 9 comments

Comments

@rawtaz
Copy link

rawtaz commented Nov 11, 2013

Hi,

Any plans or thoughts on making the regex matching support named subpatterns, so that we get not just the values but also the names of the "parameters", like in tokens, in the callback?

I realize this might require depending on another regex library than the core one. It would just be useful, I use named subpatterns/capturing groups all the time.

Thanks!

@xavi-
Copy link
Owner

xavi- commented Nov 11, 2013

Hmm, that's an interesting idea. Do you have any use cases where named subpatterns would be more convenient?

I'm a little hesitant because it seems like named parameters might add backward incompatibilities and/or api inconsistencies. For example, the third parameter passed to a handler (after req and res) for regex matches is an array:

"r`^/actors/([\\w]+)/([\\w]+)$`": function(req, res, matches) {
    // if req.url === "/actors/smith/will"
    //   then matches === [ "smith", "will" ]
}

While for token matches the third parameter is an hash map and the fourth parameter is an array:

"/names/`last-name`/`first-name`": function(req, res, tokens, values) {
    // if req.url === "/names/smith/will"
    //   then tokens ===  { "first-name": "will", "last-name": "smith" }
    //   and values === [ "will", "smith" ]
},

To ensure backwards compatibility while adding support for named subpatterns, we'd have to add a fourth parameter of type hash map to regex matches, which would make regex matches inconsistent with token matches -- fn(req, res, array, hash) vs fn(req, res, hash, array).

Another option is to extend the token syntax to support regex. Something like this:

"/user/`user-id: [a-z]{2}-\d{5}`": function(req, res, tokens, values) {
    // if req.url === "/user/12345"
    //   then tokens ===  { "user-id": "tk-12345" }
    //   and values === [ "tk-12345" ]
    //
    // if req.url === "/user/tk-123456"
    //   then this handler is not called
},

But I feel the syntax is a little clunky and it significantly complicates the library.

In any case, please let me know about any use cases you had in mind.

Thanks,
Xavi

@rawtaz
Copy link
Author

rawtaz commented Nov 11, 2013

The use case is pretty simple; Why would one want to reference items in an array (such as the result from captured groups) by index instead of by name? Any time I write a regex, and I write quite a lot of them, I default to naming the captured groups, because then it doesn't matter if I change the order of them later on, I can still reference them the same way (instead of having to change indexes in the references to them). So in summary, the use case would be cleaner and more dynamic/flexible code, and less chance of bugs.

People often think that one changes the order of capturing groups/subpatterns when moving "parts" that you're trying to match around, but it can actually happen that you get different groups just by changing the way the regex works/matches as well, i.e. it's syntax.

Either case, there's little point in using integer indexes when we can use names. I guess for the same reasons that you introduced the tokens parameter with the routes in the first place.

I agree that the current API, due to the lack of the tokens parameter for the regex route callback complicates it. It should be there from the start, but that's too late now :) Perhaps before discussing that we should consider whether this is something that is of enough interest at all, considering that it probably needs a dependency on a non-core regex library? I don't do much regex in node so I don't know if there is support for named subpatterns/groups, do you know?

Maybe one could add this in the next major version or something. Another option is to make the current third parameter to the callback contain both integer and string indexes (they could be tested), but you might feel that is ugly.

Regarding putting the regex in the token definition, yeah that's an option. I think you're right in that it complicates it a bit though, you'll have to deal with containing the regex even when one might want to put delimiters for the route definition in it. Here's how it's done in the Yii PHP framework, but it doesn't support all of the regex syntax: post/<year:\d{4}>/<title>

@xavi-
Copy link
Owner

xavi- commented Nov 11, 2013

That totally makes sense. Named subpatterns are definitely better than just raw regexs, but do you know of any cases where named subpatterns are better than the token syntax?

The biggest advantage I can think of is that it helps reduce false-positives:

"/user/`user-id`" // matches "/user/12345" and "/user/not-an-id"
"r`^user/(?P<user-id>\d{5})/$`" // matches "/user/12345" but not "/user/not-an-id"

Which is fairly compelling, because it removes the need for some types of error handling. If that's the main use case, I'm inclined to go with the token syntax: "/user/user-id: [a-z]{2}-\d{5}"

One issue that comes to mind is how should /'s be handled? For example, should "/user/user-id: .*" match "/user/foo/bar/"?

Let me know if any other use cases and/or corner cases that come to mind.

@rawtaz
Copy link
Author

rawtaz commented Nov 29, 2013

That totally makes sense. Named subpatterns are definitely better than just raw regexs, but do you know of any cases where named subpatterns are better than the token syntax?

I think they are two different things. The token syntax is simple and basic and for people that don't need more complex matching than e.g. "text parts between /" it's nice to have. Personally if I had to choose just one to keep, it'd obviously be the one that supports regex, because I think one need to have that ability of more complex/detailed matching.

The biggest advantage I can think of is that it helps reduce false-positives:

"/user/user-id" // matches "/user/12345" and "/user/not-an-id"
"r^user/(?P<user-id>\d{5})/$" // matches "/user/12345" but not "/user/not-an-id"
Which is fairly compelling, because it removes the need for some types of error handling. If that's the main use case, I'm inclined to go with the token syntax: "/user/user-id: [a-z]{2}-\d{5}\"

One issue that comes to mind is how should /'s be handled? For example, should "/user/user-id: .*" match "/user/foo/bar/"?

Deviating from the regular regex rules is of course doable but you'd need to write your own regex parser I guess and also it does become a bit weird. Perhaps better to just keep it the way it is, i.e. let .* match / as well. After all, you the regex syntax for routes does it that way; If youd' want to match .* you'd have to turn that into [^/]* so it doesn't match the /.

Overall I'm not too fond of complicating it by merging the two syntaxes. In part because it requires some intervening, you can't just apply it to the URL. Better just keep it separate.

Let me know if any other use cases and/or corner cases that come to mind.

Would the purpose of making the token syntax accept regex simply be to avoid having to change the API? I don't have an opinion on that really, it depends on whether you want to keep the API or can break it now or in a new major version. I imagine the way I'd personally like the regex way to work is like this:

"r`^/actors/(?P<lastName>[\\w]+)/(?P<firstName>[\\w]+)$`": function(req, res, tokens, values) {
    // The parameter tokens is an object that maps token name to a value.
    // The parameter values is a list of the values only.
    // For example if req.url === "/actors/smith/will"
    //   then tokens === { "lastName": "smith", "firstName": "will" }
    //   and values === [ "smith", "will" ]
},

This example doesn't exemplify anything that the regular tokens syntax doesn't support, but you get the idea.

@xavi-
Copy link
Owner

xavi- commented Dec 1, 2013

You bring up good points, I agree that .* should act as expected and match /.

Avoiding API changes is one reason I'd prefer the mixed syntax, but the primary reason is that the purpose of this library is to remove the need for raw regexs. A good amount of boiler plated is required to match regexs against urls -- the starting ^, the ending $, parens for subgroups, escaping common url characters (dots and dashes), etc... Also I feel that raw regexs are error prone and difficult to read. For example \w and \W look very similar but mean the completely different things.

Another reason why I'd prefer the mixed syntax is that it limits the scope of regexs. By limiting the scope, we, IMO, make the regexs more readable as well as remove some of the boiler plate, while still keeping much of the power and flexibility. For example:

^/user\-blogs/(?P<user\-id>[a-z]{2}\-\d{5})/(?P<post\-id>\d+)$

vs

/user-blogs/`user-id: [a-z]{2}\-\d{5}\`/`post-id: \d+`

A long the same lines, url definitions should ideally be as concise as possible and contain as little visual noise as possible. IMO, named groups in raw regex are quite verbose and contain a good amount of visual noise, which is why I'd again prefer the token/regex style.

@rawtaz
Copy link
Author

rawtaz commented Dec 1, 2013

Avoiding API changes is one reason I'd prefer the mixed syntax, but the primary reason is that the purpose of this library is to remove the need for raw regexs. A good amount of boiler plated is required to match regexs against urls -- the starting ^, the ending $, parens for subgroups, escaping common url characters (dots and dashes), etc... Also I feel that raw regexs are error prone and difficult to read. For example \w and \W look very similar but mean the completely different things.

I can't really agree with this because I love regex and I don't think things like \w and \W are more complicated than knowing the difference between == and != or == and ===. But that's just my opinion, if one doesn't know and/or doesn't have the interest in knowing the basics of regex then I guess it can be annoying to see regex syntax. Either one embrace it or one doesn't.

Another reason why I'd prefer the mixed syntax is that it limits the scope of regexs. By limiting the scope, we, IMO, make the regexs more readable as well as remove some of the boiler plate, while still keeping much of the power and flexibility. For example:

^/user\-blogs/(?P<user\-id>[a-z]{2}\-\d{5})/(?P<post\-id>\d+)$

I don't think that's right. The above regex would IMO be:

/user-blogs/(?P<user-id>[a-z]{2}-\d{5})/(?P<post-id>\d+)

The ^ and $ is something you can add while parsing the line, there's no point in users adding that all the time (assuming always matching at the beginning and end of the URL, which I think makes sense).

vs

/user-blogs/`user-id: [a-z]{2}-\d{5}`/`post-id: \d+`

I unescaped a - since I don't think it needs escaping, and the second `.

A long the same lines, url definitions should ideally be as concise as possible and contain as little visual noise as possible. IMO, named groups in raw regex are quite verbose and contain a good amount of visual noise, which is why I'd again prefer the token/regex style.

If you still prefer to introduce a mixed syntax you have the choice to parse a more restricted version of regex. Here is an example of how the above rule would look in the syntax that Yii (PHP framework) uses:

user-blogs/<user-id:[a-z]{2}-\d{5}>/<post-id:\d+>
user-blogs/<user-id:\w+>/<post-id:\d+>

It's quite similar to your suggested mixed syntax, actually. Just slightly easier to parse since it has separate start and ending delimiters for the parameters (instead of one and the same character being used for both). If Python has recursive regex support that it could deal with using the same character for both start and end delimiters though.

Anyway, you should do what you think is the best. At least until someone else speaks their mind.

@KoryNunn
Copy link

I'd love to see this added, specifically this syntax:

/user-blogs/`user-id: [a-z]{2}\-\d{5}\`/`post-id: \d+`

@xavi-
Copy link
Owner

xavi- commented Feb 13, 2014

Cool, will do. I'll try implementing it this weekend.

@xavi-
Copy link
Owner

xavi- commented Feb 17, 2014

Beeline (v0.2.1) should now support regex tokens. This feature was much trickier to implement than I expected, so ideas for any additional unit tests will be more than welcomed.

In order to keep the code sane, regex's with backreferences are not supported. For example, a url like this will generate a warning: "/palindrome: (.)(.)(.)\3\2\1"

Technically you could use use backreference, but the results won't be intuitive. The reason is because ultimately the regex associated with a token will be embedded into a larger regex and as a result, the capture group \1 refers will be unpredictable.

That said all other regex feature should be supported including look-aheads. Though keep in mind an implicit ^ and $ are added to the beginning and end of each url rule, so "/lookahead: foo(?!bar)" will not match /foo-king, but "/lookahead: foo(?!bar).*" will.

Let me know if you think of any other features or use cases I should test out.

@xavi- xavi- closed this as completed Feb 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants