-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Needs canonical examples / reference implementation (including RegExp) #59
Comments
The regular expression needs to be updated to detect the presence of leading zeros. Unfortunately, I haven't found a way to do that, otherwise I would have provided an updated example. |
Can you give an example of a valid semver string that has leading zeros? and also provide a reference to the documentation that suggests this is allowed (i.e. a phrase or example in the docs)? If so I'll add your string to the tests in semver-utils and fix it. |
Sorry, I meant to disallow. The regular expression as used now does not seem to reject version numbers with leading zeros.. |
We decided only recently that leading zeroes are no longer allowed: semver/semver#112 |
Hi, I have been playing with the regex for version 2.0.0 and came up with this on regex101. Expanded: /^
(?'MAJOR'
0|(?:[1-9]\d*)
)
\.
(?'MINOR'
0|(?:[1-9]\d*)
)
\.
(?'PATCH'
0|(?:[1-9]\d*)
)
(?:-(?'prerelease'
(?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
(?:\.
(?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
)*
))?
(?:\+(?'build'
(?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
(?:\.
(?:0|(?:[1-9A-Za-z-][0-9A-Za-z-]*))
)*
))?
$/ The pre-release and build patterns are very complex because they require the 'no leading zeros' rule. I can't figure any benefits of that over a (very) relaxed pattern as in Notice that according to the railroad diagram and the BNF (boy, isn't that hard to read! 😕) the identifier "0000.0000.0000.0000.------" is valid (leading zeros allowed). If you can, please supply more edge cases 😄 on the regex101 page (in the first block is all cases are valid, in the second, invalid). Happy hacking! |
I'm in the camp to veto the use of leading zeros. In JavaScript (and some other languages) parsers will default to octal and then your sorting could get all out of wack because suddenly '011' is less than '10' both lexicographically and numerically. And what about 007 vs 07 vs 7? Numerically they're all the same in base 10 and base 8 so how would you know which version is the "newer" one? For the love of all that is good on this earth: no leading zeros!!! |
Oh, sorry I missed the part about that being a build number. I'll have to look at the spec, but I don't think the parser I mentioned prohibits this either way. |
Hi, The requirement is on 2.0.0 for pre-release but not on build version:
But on the BNF:
The railroad diagram is less clear. So maybe the text requires some correction. So, version 2.0.1? (patterns allowed on 2.0.0 will still work here). |
New regex101 pattern:
|
Never mind, it appears that numeric identifiers with leading zeros are not accepted at all or at least have no precedence defined which is pretty much the same. |
I'm cool with adding a new page that has a regex example. We just need to figure out how we'll change the layout of the site to accommodate such links. #57 has the same design issue. |
The posted regex seems to have a little trouble with the spec: In §9 it states
So I assume that a version like Also, am I correct that according to spec pre-release identifiers and build metadata behave differently with respect to the leading zero policy, since §10 misses the appropriate statement? Before I mess with the provided regex, it would be nice if someone could confirm/ falsify my suggestion. Edit: Noticed the discrepancy between pre-release and build metadata spec. |
Good catch. However, your change would also make Taking a step back, an identifier is either a numeric or an alphanumeric. That's a bit tricky to capture in Regex. numeric : Hence it'd combine to be something like So complicated. 😦 Does that look correct? |
Looks good, except that the asterisk in the numeric pattern has to go out of the set. You could also speed it up if the second set in the alpha was matched exactly once so the engine doesn't have to backtrack. |
@haacked would you like to update the regex then (Including the suggestion by @FichteFoll )? This is probably a regurgitation of previously discussed topics, but can a reference regex be put into the spec? |
Another problem: Just numeric : |
Why should the term "0" be valid? It is purely numeric, starts with a '0' so it should be invalid.
|
https://en.wikipedia.org/wiki/Leading_zero
Leading zeroes are excluded for simplification and prevention of ambiguity. Is For negative numbers, I can only speculate. Considering that, in numbering, you want to "start" at a certain point and increase from that onward, it makes sense to have a generally specified starting point (i.e. lowest member) as zero instead of operating on the entire set of whole numbers, which is infinite in both directions. It should be obvious to everyone that the spec speaks of decimal numbers in all places, which are the standard numbering system pretty much everywhere in the world, afaik. |
The thing is that this will often be used by programming people who tend to think in octal sometimes. :)
This is not about the version numbers which are clearly decimal natural numbers w/o leading 0s. I am in favour of stating explicitly that natural decimal numbers (including 0 per se) without leading 0s are the only accepted form of purely numeric notation in the build information. (Alternatively one could go the C-way and assume that pure numeric expressions with leading 0s are interpreted as an octal number.) Hexadecimal expressions should be no problem, they are usually prefixed by I still am not aware why exactly the contents of the build information is restricted in such a way. |
Do we have an oracle of valid and invalid version strings to test against? We should not publish any regex's without such an oracle and test scripts to verify correctness. We should probably also do the same for version 1.0.0, which is still in use. |
The current home page has a menu of version links, why not add another one with links to a FAQ and any other pages need to be added? |
Yeah, I think we could change the menu to have the following links:
That'd probably cover what we need now. Down the road we might add "Implementations" or "Related Specs". |
Did get added to the site? Is there a final recommended regex? |
Hi, I don't think so, although no one did advance with other options. The current version (with the comment from @haacked above) is now this one:
on regex101 [online regex tester] It would be nice if someone could contribute with more tests here. |
See #59 (comment). |
1.0.0+0.build.1-rc.10000aaa-kk-0.1 is a legal version string that is among the list of invalid strings in the test data. There are no special rules for numeric fields in the build meta tag. Anything in the set [0-9A-Za-z-] is allowed in any meta tag sequence. |
@FichteFoll, yes, technically |
Hi @jwdonahue , your expression works on regex101.com with python if you remove the group naming (only a few regex engines understand that, and comments too). It will almost works with golang if you remove the lookaheads Tell me, you added some redundant matches and quantifiers |
Ok, I asked Stackoverflow for help last night and got some very useful feedback. It turns out the
@gvlx, named capture groups are required in my book. Relying on numbered capture groups is unstable. It's too easy for a random numbered capture group to accidentally show up in a revision of the regex that breaks all the code that uses the results of the previous version. I am not just looking for match/no-match Boolean result, I want to be able to access each of the fields by name. Though it would be neat if the results could somehow have PrereleaseTags[] and MetaTags[] arrays (minus the dots), it's simple enough to split the Prerelease and Meta tags on the dots, in a post processing step. I think we're very close now, just need to eliminate some redundant quantifiers and investigate whether there are any useful optimizations. Regex 101 counts a total of 591 steps to process the 72 lines (1293 bytes) of test data. Not sure exactly what they are measuring there, but if there's any way to speed this thing up without losing functionality, that would be awesome. Also, the test data is a bit random and should be rationalized and probably requires expansion. |
I have only one major worry with regard to posting a regex as a canonical example, and that is, I know of no way to prove its correctness relative to the spec. In the absence of such a proof, we should label these as plausible examples that are explicitly not canonical, if we publish them at all. The spec should always be the golden standard. |
I think this should be published, with some of the gory details, if for no other reason than otherwise people will re-create worse regex for the same purpose. |
Yup, I just did another search and looked a bit deeper than the the first page, and there's over hundreds of them here on GitHub and most are useless and some may even be worse than useless. |
@gvlx I think you meant me. I removed my comment after realising I was wrong about that. |
Following the discussion in EBNF grammar and this clarification, I updated this description in W3C simple EBNF grammar, which is much clearer than the official SemVer specification's BNF grammar to help clear the regex ambiguities:
Update 2018/07/17: finally I understood (call me slooow) the construction of the pre-release identifier! The element Following this, I think the actual Regex for the two main extra tags are (using @DavidFichtmueller's simplifications here):
which have to be tested as a compound identifier, i.e., Maybe these patterns can be further optimized? I suppose that, for performance reasons, pattern processing could be done on two passes?
|
If anyone is interested, here is a javascript regex version (regex101) I made, which passes all the same tests as those in the above PCRE version.
It's a little more condensed, as it's not concerned with group names, and just validation. (I'm using it for a JSON Schema) |
Question, the spec for pre-release and metadata says the identifiers may consist of ASCII alphanumerics and hyphens only. It doesn't mention anything about dots, however the previously mentioned regex allows for it. Is this implied somewhere other than by the attached examples? |
Quoting from the spec. I would add that your regex has an unbounded or at least extremely long run-time when faced with this tortuous, nearly legal example: 99999999999999999999999.999999999999999999.99999999999999999----RC-SNAPSHOT.12.09.1--------------------------------..12 Notice the empty prerelease field just before that final '.12'? That triggers back-tracking, which can result in infinite and pointless machinations. Now it's possible that your javascript implementation simply times out and correctly fails this on your system, but it's not something you should count on in a production environment. To the best of my knowledge, .NET implementations seem to be the most susceptible (they don't give up easily). Since it can only happen with near-matches, and you simply want a pass/fail, your regex will be fine, provided that you ensure that a reasonable timeout is in place to cut-off this degenerate behavior. |
@Relequestual, FYI: I got around the back-tracking by using look-ahead to chose which regex snippet to apply. It's those sneak peaks ahead of the current anchor point that makes my regex so long. Adding named capture groups to your regex would not overly complicate it, but they won't fix the bugs in your code. Your regex fails to reject 1.2.3-0123 and 1.2.3-0123.0123. Both are invalid due to leading zero in the purely numeric fields of the prerelease tag. |
@Relequestual, here's two more examples it failed to reject: 9.8.7+meta+meta |
@gvlx, I think I missed something in your last post, regarding what you are calling AlphanumIdentifierNLZ. That terminology is incorrect. It should be something like AllNumericNoLeadingZerosIdentifier. Alphanumeric identifiers do not have the leading zero prohibition, it only applies to all numeric identifiers, so -0zyz is a legal prerelease tag according to the spec. |
Depending on your regex engine, you could also look at atomic groups or possessive quantifiers to prevent backtracking. |
@jwdonahue, the name comes from the BNF, as pointed out in semver/semver#181 (comment). Yes, not the best choice but I kept it for consistency with the BNF. As you can read, it is a modified alphanumeric identifier which cannot have a leading zero. This only applies to the This is also the reason @Relequestual's #59 (comment) is not correct. I still wonder if making the two step matching approach I suggested would improve performance as each regex would much simpler. |
@gvlx, are we talking past each other? Is there something I've missed? Nothing in the spec prohibits alphanumeric identifiers from having leading zeros. Only the pure numeric identifiers have that prohibition on them. I've been trying to get this point across to you and others for some time now. -0123zyx is a legal prerelease tag, even with the leading zero (maybe that wasn't obvious with your display font?). -0123 is not legal, due to the leading zero, -0 is legal. So -0.0cba is also legal. It's an important nuance that adds unavoidable complexity to the regex. So again, I argue that the NLZ postfix belongs on the numeric identifier tag, not the alphanumeric identifier. In fact, the no-leading-zero numeric-identifier is a perfect description of the Major, Minor and Patch fields as well. NumericIdentifierNLZ applies to those majore, minor, patch and any numeric identifiers in the prerelease tag. |
@gvlx, I looked at atomic groups, but they aren't widely supported and weren't required to avoid the back-tracking problem and I am pretty sure that possessive quantifiers won't work due to all the cases where a zero is allowed to follow a dot. |
I am going to say this one more time, for clarity: There are three classes of prerelease identifiers, just a zero, all numeric with no leading zero, and alphanumeric (leading zeros allowed). |
I just ran @DavidFichtmueller's suggested regex against our current data set and it passed with flying colors. See semver/semver#232. All it lacks are named capture groups. |
Thanks @jwdonahue, I guess I'll need to revise my regex! I'm mostly expecting this to be used in cases where there are no pre-release or meta information provided, but good to note the potential issues! |
Hello all, Back in June I played around with the regex from both @gvlx and @jwdonahue and came up with some solution for named grouping, that I further condensed down to a regex without grouping, and then a regex with POSIX /compatibility/ (I say this loosely, cause it 'worked for me') I put the expressions in the following Github repository along with some notes: Here's the named capture group expression: Edit: This solution hasn't been checked for every edge case, but it does cover all the ones provided by @jwdonahue and it also processes Edit: |
Okay, I think we've converged a good set of regex's over on semver/semver#232. @coolaj86 please close this thread at your earliest possible convenience. |
@jwdonahue Thanks for your reply to my gist and for pointing me to this issue. I've updated my semver parsing and comparing library for XQuery, which I've rewritten to use the regex and test version strings shown here. Comments welcome - though gist won't notify me of comments, so I'd appreciate a ping here or via Twitter (same username there). https://gist.github.com/joewiz/b349e2853a17bf817e5d0013d01fa9f9 |
It seems that PR #460 has been completed. We now have a pair of well vetted pair of regex's included in the spec. Can we please close this discussion now? |
See semver/semver#32
JavaScript RegExp:
/^(([\d+)\.(\d|)\.(\d+))(?:-([\dA-Za-z\-]+(?:\.[\dA-Za-z\-]+)*))?(?:\+([\dA-Za-z\-]+(?:\.[\dA-Za-z\-]+)*))?$/
JavaScript Reference Implementation:
https://github.com/coolaj86/semver-utils
I would suggest having two sections - one for tested Regular Expressions that match and another for modules / reference implementations.
There should be at least canonical reference implementation of a parser / validator with an api that would make sense to copy in any language.
The text was updated successfully, but these errors were encountered: