Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C-99 comment rule flawed #609

Closed
ogdenpm opened this issue Nov 17, 2023 · 13 comments
Closed

C-99 comment rule flawed #609

ogdenpm opened this issue Nov 17, 2023 · 13 comments

Comments

@ogdenpm
Copy link

ogdenpm commented Nov 17, 2023

In section A.4.3 of the documentation, the C99 comment pattern is given as
("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)

The first part of this for recognising /* ... */ patterns appears to be incorrect, specifically

/* some text **/ would fail as the ** would match "*"[^/] , preventing the */ matching.
Note I tried inserting a / between the "*" and the [^/] i.e. to check for a single * not followed by a /
i.e. ("/*"([^*]|"*"/[^/])*"*/")
but flex generates an unrecognized rule error

@zmajeed
Copy link
Contributor

zmajeed commented Dec 13, 2023

You're right - I think gobbling any trailing stars fixes the regex

"/*"([^*]|"*"[^/])*"*"+"/"

or to make it easier to read with whitespace ignored

(?x: "/*" ( [^*] | "*"[^/] )* "*"+"/" )

I'll make the change if you think this is right

@ogdenpm
Copy link
Author

ogdenpm commented Dec 13, 2023 via email

@zmajeed
Copy link
Contributor

zmajeed commented Dec 13, 2023

Sorry - that had a typo - I corrected it above - it should have been

(?x: "/*" ( [^*] | "*"[^/] )* "*"+"/" )

This accepts

/* text ** more text */

But it has another problem - stemming from the original regex - where it accepts invalid comments like

/* comment 1 **/ invalid missing comment start */

This is because for the intermediate **/ input, "*"[^/] consumes ** then [^*] consumes / and the lexer merrily continues to a successful match at the very last '*/`

@zmajeed
Copy link
Contributor

zmajeed commented Dec 14, 2023

I feel this cannot be done with Flex regex - it seems to require some lookahead to avoid prematurely consuming the star from a comment end delimiter */

I'm actually surprised that every single basic regex solution I found online is wrong! Some of these posts are decades old

The Flex doc also has a FAQ on matching C comments - there it only has couple example patterns that are clearly labelled wrong and doesn't purport to offer a working regex - that's probably what needs to be done for this section too

The trailing context "*"/[^/] you tried can't be used inside group parentheses - that's why you got the error - I've always avoided using it for this and other limitations

@ogdenpm
Copy link
Author

ogdenpm commented Dec 14, 2023 via email

@ogdenpm
Copy link
Author

ogdenpm commented Dec 14, 2023 via email

@ogdenpm
Copy link
Author

ogdenpm commented Dec 14, 2023 via email

@ogdenpm
Copy link
Author

ogdenpm commented Dec 15, 2023 via email

@zmajeed
Copy link
Contributor

zmajeed commented Dec 15, 2023

Yep - this works - now the middle part of the regex “*”+[^*/] only works for runs of stars that don't end at an end delimiter - this means an end delimiter can only be matched by the last part of the regex - and matching won't continue past the first end delimiter

Can test with grep

grep -Po '(?x: \/\* ([^*] | \*+[^*/])* \*+\/)' <<EOF
/* text ** more text */
/* some text **/
/* comment 1 *//* comment 2 */
/* comment 3 */    /* comment 4 */
/* comment 5 */
/* comment 6 */ invalid comment missing start after comment 6 */
/* comment 7 **/ invalid comment missing start after comment 7 */
/* comment 8 // **/ invalid comment missing start after comment 8 */
EOF
/* text ** more text */
/* some text **/
/* comment 1 */
/* comment 2 */
/* comment 3 */
/* comment 4 */
/* comment 5 */
/* comment 6 */
/* comment 7 **/
/* comment 8 // **/

and multiline comments

grep -z -Po '(?x: \/\* ([^*] | \*+[^*/])* \*+\/)' <<EOF
/* multiline
comment 9 */
EOF
/* multiline
comment 9 */

Incidentally the dotall flag (?s:) is not needed since there's no dot in the regex

@zmajeed
Copy link
Contributor

zmajeed commented Dec 16, 2023

The C++ comment regex doesn't account for newline escapes in the comment body either - it fails for

grep -z -Po '(?x: \/ (\\ \n)* \/ [^\n]* )' <<EOF
not comment before /\\
/ multiline split delimiter \\
> comment 1
> not comment after
> EOF
/\
/ multiline split delimiter \

Adding another match for escaped newlines after comment start fixes it

grep -z -Po '(?x: \/ (\\ \n)* \/ ( (\\ \n) | [^\n] )* )' <<EOF
not comment before /\\
/ multiline split delimiter \\
comment 1
not comment after
EOF
/\
/ multiline split delimiter \
comment 1

@zmajeed
Copy link
Contributor

zmajeed commented Dec 16, 2023

So the full correct regex for C and C++ comments is

("/*"([^*]|"*"[^*/])*"*"+"/")|("/"(\\\n)*"/"((\\\n)|[^\n])*)

or

(?x: ( "/*" ( [^*] | "*"+ [^*/] )* "*"+ "/" ) | ( "/" (\\ \n)* "/" ( (\\ \n) | [^\n] )* ) )

@zmajeed
Copy link
Contributor

zmajeed commented Dec 16, 2023

Fix in PR 614, #614

@westes
Copy link
Owner

westes commented Jan 9, 2024

fixed by #614

@westes westes closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants