Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in CSS tokenizer #3906

Merged
merged 2 commits into from
Oct 28, 2023
Merged

Conversation

fredden
Copy link
Contributor

@fredden fredden commented Oct 25, 2023

Yes, I know that the non-PHP tokenizers are going away in version 4.0. Meanwhile, the 3.x release line has a bug which can be fixed.

Description

While working with phpcbf, I noticed that some LESS files (which are parsed as CSS due to the specific ruleset being used) had unexpected content injected into them.

/* file: test.less */
// /**
//  * this is a comment
//  * this is also a comment
//  */

(Note the extra trailing newline at the end of the file.)

phpcs --basepath=. -s --standard=Squiz --sniffs=Squiz.WhiteSpace.SuperfluousWhitespace --extensions=less/CSS test.less

FILE: test.less
-------------------------------------------------------------------------------------------------------------
FOUND 1 ERROR AFFECTING 1 LINE
-------------------------------------------------------------------------------------------------------------
 7 | ERROR | [x] Additional whitespace found at end of file
   |       |     (Squiz.WhiteSpace.SuperfluousWhitespace.EndFile)
-------------------------------------------------------------------------------------------------------------
PHPCBF CAN FIX THE 1 MARKED SNIFF VIOLATIONS AUTOMATICALLY
-------------------------------------------------------------------------------------------------------------

Time: 45ms; Memory: 10MB

After running phpcbf (with the same flags as above), I get a file with this content:

/* file: test.less */
// /**
?>^PHPCS_CSS_T_CLOSE_TAG^^PHPCS_CSS_T_CLOSE_TAG^//  * this is a comment
//  * this is also a comment
//  */

(Note the extra newline at the end of the file has been fixed.)

Suggested changelog entry

Fixed bug in CSS tokenizer

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

PR checklist

  • I have checked there is no other PR open for the same change.
  • I have read the Contribution Guidelines.
  • I grant the project the right to include and distribute the code under the BSD-3-Clause license (and I have the right to grant these rights).
  • I have added tests to cover my changes.
  • I have verified that the code complies with the projects coding standards.

@jrfnl
Copy link
Contributor

jrfnl commented Oct 25, 2023

@fredden Thanks for this PR. While I'm not keen on putting any more effort into the CSS tokenizer, the outcome of this bug is pretty undesirable, so I'm willing to accept this fix.

✔️ I have confirmed and been able to reproduce the issue.
✔️ I have confirmed that the added test demonstrates the issue and can safeguard the fix without needing to add a dedicated test file for this issue for the CSS tokenizer.
✔️ I have confirmed that the issue is gone with the proposed fix.

However, looking at the token output, the problem is actually bigger and not really fixed.

Originally, the code sample in the description above was tokenized like so:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  0]:
  1 | L1 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  2 | L1 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  3 | L1 | C  4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG     | [  3]: /**
  4 | L1 | C  7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  0]:

  5 | L2 | C  1 | CC 0 | ( 0) | T_DOC_COMMENT_STRING       | [  2]: ?>
  6 | L2 | C  3 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  7 | L2 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
  8 | L2 | C  7 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
  9 | L2 | C  8 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 10 | L2 | C  9 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 11 | L2 | C 13 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 12 | L2 | C 14 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 13 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 14 | L2 | C 17 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 15 | L2 | C 18 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 16 | L2 | C 19 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 17 | L2 | C 26 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 18 | L3 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 19 | L3 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 20 | L3 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 21 | L3 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 22 | L3 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 23 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 24 | L3 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 25 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 26 | L3 | C 15 | CC 0 | ( 0) | T_STRING                   | [  4]: also
 27 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 28 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 29 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 30 | L3 | C 22 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 31 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 32 | L4 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 33 | L4 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 34 | L4 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 35 | L4 | C  6 | CC 0 | ( 0) | T_DIVIDE                   | [  1]: /
 36 | L4 | C  7 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 37 | L5 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 38 | L6 | C  1 | CC 0 | ( 0) | T_CLOSE_TAG                | [  0]:

Take note of token 5, the inserted ?>, which looks to be the cause of the problem for the Squiz sniff.

With this fix applied, the code sample is tokenized as

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  0]:
  1 | L1 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  2 | L1 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  3 | L1 | C  4 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG     | [  3]: /**
  4 | L1 | C  7 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  0]:

  5 | L2 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
  6 | L2 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
  7 | L2 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
  8 | L2 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  9 | L2 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 10 | L2 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 11 | L2 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 12 | L2 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 13 | L2 | C 15 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 14 | L2 | C 16 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 15 | L2 | C 17 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 16 | L2 | C 24 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 17 | L3 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 18 | L3 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 19 | L3 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 20 | L3 | C  6 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 21 | L3 | C  7 | CC 0 | ( 0) | T_STRING                   | [  4]: this
 22 | L3 | C 11 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 23 | L3 | C 12 | CC 0 | ( 0) | T_STRING                   | [  2]: is
 24 | L3 | C 14 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 25 | L3 | C 15 | CC 0 | ( 0) | T_STRING                   | [  4]: also
 26 | L3 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 27 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [  1]: a
 28 | L3 | C 21 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 29 | L3 | C 22 | CC 0 | ( 0) | T_STRING                   | [  7]: comment
 30 | L3 | C 29 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 31 | L4 | C  1 | CC 0 | ( 0) | T_STRING                   | [  2]: //
 32 | L4 | C  3 | CC 0 | ( 0) | T_WHITESPACE               | [  2]: ⸱⸱
 33 | L4 | C  5 | CC 0 | ( 0) | T_MULTIPLY                 | [  1]: *
 34 | L4 | C  6 | CC 0 | ( 0) | T_DIVIDE                   | [  1]: /
 35 | L4 | C  7 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 36 | L5 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 37 | L6 | C  1 | CC 0 | ( 0) | T_CLOSE_TAG                | [  0]:

What I mean to point out by posting these tokenizations is that the comment line tokenization is completely borked for this code sample, what with all lines after the first line getting confused.

@fredden Did you notice this ? What is your opinion on this ?

As I said above, I'm okay with just accepting this fix as-is and not spending more time on this as CSS tokenizer support will be dropped, but I did want to document this finding.

@jrfnl
Copy link
Contributor

jrfnl commented Oct 25, 2023

@fredden I'm also trying to remember if // is even a valid comment format in CSS and IIRC it is not, so the fact that a file which would cause a "parse error" in CSS would lead to tokenization problems is not really surprising.

@fredden
Copy link
Contributor Author

fredden commented Oct 25, 2023

Thanks for looking at this @jrfnl

the comment line tokenization is completely borked

Yes, I did notice this. I don't yet understand why there's any special-case handling for comments in the CSS tokenizer. I decided to not spend additional brain-power working this out though, as the plan is to delete non-PHP tokenizers for version 4.0. This pull request doesn't make it any worse; I intentionally took a minimal-change approach.

if // is even a valid comment format in CSS

No, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that // is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only /* ... */ style comments.

@jrfnl
Copy link
Contributor

jrfnl commented Oct 26, 2023

if // is even a valid comment format in CSS

No, I don't think it is; I'll check... Some basic local testing with the three browsers I have installed currently all show that // is not treated as a comment. https://lesscss.org/#comments says that LESS supports that comment syntax; the file I was working with when the problem came to light was a LESS file. https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Organizing#comment_your_css shows only /* ... */ style comments.

@fredden Thanks for checking that.

IIRC the CSS tokenizer never had any support for LESS. If it worked, that should be considered coincidental, not by design.

As CSS doesn't support the // comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.

@fredden
Copy link
Contributor Author

fredden commented Oct 26, 2023

As CSS doesn't support the // comment syntax, I'm now inclined to close this PR as accepting it would expand the scope of the CSS Tokenizer to a syntax which is not even valid in CSS.

I don't think it's valid for the CSS tokenizer to inject a rogue ?> token where it doesn't exist in the source file. The PHP tokenizer does not do this.

Example of PHP tokenizer handling a parse error without adding an extra ?> token

cat test.php

<?php

$var = 'white-space at end of this line';     

/** This is a parse error: unterminated comment.

phpcs --standard=PHPCSDebug test.php

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]

  2 | L3 | C  1 | CC 0 | ( 0) | T_VARIABLE                 | [  4]: $var
  3 | L3 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C  6 | CC 0 | ( 0) | T_EQUAL                    | [  1]: =
  5 | L3 | C  7 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  6 | L3 | C  8 | CC 0 | ( 0) | T_CONSTANT_ENCAPSED_STRING | [ 33]: 'white-space at end of this line'
  7 | L3 | C 41 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
  8 | L3 | C 42 | CC 0 | ( 0) | T_WHITESPACE               | [  5]: ⸱⸱⸱⸱⸱

  9 | L4 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

 10 | L5 | C  1 | CC 0 | ( 0) | T_DOC_COMMENT_OPEN_TAG     | [  3]: /**
 11 | L5 | C  4 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  1]: ⸱
 12 | L5 | C  5 | CC 0 | ( 0) | T_DOC_COMMENT_STRING       | [ 44]: This is a parse error: unterminated comment.
 13 | L5 | C 49 | CC 0 | ( 0) | T_DOC_COMMENT_WHITESPACE   | [  0]:

 14 | L6 | C  1 | CC 0 | ( 0) | T_DOC_COMMENT_CLOSE_TAG    | [  0]:

Note there are no rogue ?> tokens in this output.


I haven't seen any cases of phpcs causing issues with this (invalid CSS) syntax, but I have not gone looking for any cases either. The browsers that I have installed here all cope gracefully with a "selector" which has // in it, and parse the rest of the CSS in the file/element as expected. I don't know to what extent people are using PHP_CodeSniffer on CSS files.

In my experience, this bug is mostly masked. It shows up when PHP_CodeSniffer tries to re-write the file as it will output all the tokens into the file after phpcbf has done its magic. With this extra / rogue token*, the resulting file is different in more ways than just what the selected sniffs have intentionally modified.

* I have not looked into why ^PHPCS_CSS_T_CLOSE_TAG^ gets added twice during auto-fixing, as eliminating the rogue ?> makes that symptom go away.

@jrfnl jrfnl added this to the 3.8.0 milestone Oct 28, 2023
@jrfnl
Copy link
Contributor

jrfnl commented Oct 28, 2023

@fredden Okay, I'm going to merge this PR as the rogue token output is just weird. Having said that, I would caution that further PRs adding support for LESS to the CSS tokenizer are strongly discouraged.

@jrfnl jrfnl merged commit 01167c3 into squizlabs:master Oct 28, 2023
26 of 27 checks passed
@fredden fredden deleted the css/close-tag-noise branch October 28, 2023 16:16
@fredden
Copy link
Contributor Author

fredden commented Oct 28, 2023

Agreed. LESS and CSS are not the same and one shouldn't be parsed as the other. I look forward to version 4.0 where these other-language tokenizers will be gone.

jrfnl added a commit that referenced this pull request Oct 28, 2023
@jrfnl
Copy link
Contributor

jrfnl commented Dec 8, 2023

FYI: this fix is included in today's PHP_CodeSniffer 3.8.0 release.

As per #3932, development on PHP_CodeSniffer will continue in the PHPCSStandards/PHP_CodeSniffer repository. If you want to stay informed, you may want to start "watching" that repo (or watching releases from that repo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants