Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer/PHP: bug fix for double quoted strings using ${ #3604

Conversation

jrfnl
Copy link
Contributor

@jrfnl jrfnl commented Jun 2, 2022

While creating a sniff for PHPCompatibility to detect the PHP 8.2 deprecation of two of the four syntaxes to embed variables/expressions within text strings, I realized that for select examples of the "type 4" syntax - "Variable variables (“${expr}”, equivalent to (string) ${expr})" -, PHPCS did not, and probably never did, tokenize those correctly.

Double quoted strings in PHPCS are normally tokenized as one token (per line), including any and all embedded variables and expressions to prevent problems in the scope map caused by braces within the string/expression.

However, for some double quoted strings using ${ syntax, this was not the case and these were tokenized as outlined below, which would still lead to the problems in the scope map.

Luckily, these syntax variants aren't used that frequently. Though I suspect if they were, this bug would have been discovered a lot sooner.

Either way, fixed now.

Including adding a dedicated test file based on examples related to the PHP 8.2 RFC.

Note: this test file tests that these embedding syntaxes are tokenized correctly, it doesn't test the other parts of the related Tokenizers\PHP code block (re-tokenizing to one token per line, handling of binary casts).

Refs:


Example of some code for which the tokenizer was buggy:

echo "${foo["${bar}"]}";

Was originally tokenized like so:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  2 | L3 | C  1 | CC 0 | ( 0) | T_ECHO                     | [  4]: echo
  3 | L3 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C  6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [  8]: "${foo["
  5 | L3 | C 14 | CC 0 | ( 0) | T_DOLLAR_OPEN_CURLY_BRACES | [  2]: ${
  6 | L3 | C 16 | CC 0 | ( 0) | T_STRING_VARNAME           | [  3]: bar
  7 | L3 | C 19 | CC 0 | ( 0) | T_CLOSE_CURLY_BRACKET      | [  1]: }
  8 | L3 | C 20 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [  4]: "]}"
  9 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
 10 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

Will be tokenized (correctly) like so with the fix included in this PR:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  2 | L3 | C  1 | CC 0 | ( 0) | T_ECHO                     | [  4]: echo
  3 | L3 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C  6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [ 18]: "${foo["${bar}"]}"
  5 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
  6 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

While creating a sniff for PHPCompatibility to detect the PHP 8.2 deprecation of two of the four syntaxes to embed variables/expressions within text strings, I realized that for select examples of the "type 4" syntax - "Variable variables (“${expr}”, equivalent to (string) ${expr})" -, PHPCS did not, and probably never did, tokenize those correctly.

Double quoted strings in PHPCS are normally tokenized as one token (per line), including any and all embedded variables and expressions to prevent problems in the scope map caused by braces within the string/expression.

However, for some double quoted strings using `${` syntax, this was not the case and these were tokenized as outlined below, which would still lead to the problems in the scope map.

Luckily, these syntax variants aren't used that frequently. Though I suspect if they were, this bug would have been discovered a lot sooner.

Either way, fixed now.

Including adding a dedicated test file based on examples related to the PHP 8.2 RFC.

Note: this test file tests that these embedding syntaxes are tokenized correctly, it doesn't test the other parts of the related `Tokenizers\PHP` code block (re-tokenizing to one token per line, handling of binary casts).

Example of the code for which the tokenizer was buggy:
```php
"${foo["${bar}"]}";
```

Was originally tokenized like so:
```
Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  2 | L3 | C  1 | CC 0 | ( 0) | T_ECHO                     | [  4]: echo
  3 | L3 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C  6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [  8]: "${foo["
  5 | L3 | C 14 | CC 0 | ( 0) | T_DOLLAR_OPEN_CURLY_BRACES | [  2]: ${
  6 | L3 | C 16 | CC 0 | ( 0) | T_STRING_VARNAME           | [  3]: bar
  7 | L3 | C 19 | CC 0 | ( 0) | T_CLOSE_CURLY_BRACKET      | [  1]: }
  8 | L3 | C 20 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [  4]: "]}"
  9 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
 10 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:
```

Will be tokenized (correctly) like so with the fix included in this PR:
```
Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  2 | L3 | C  1 | CC 0 | ( 0) | T_ECHO                     | [  4]: echo
  3 | L3 | C  5 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C  6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING     | [ 18]: "${foo["${bar}"]}"
  5 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
  6 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:
```

Refs:
* https://www.php.net/manual/en/language.types.string.php#language.types.string.parsing
* https://wiki.php.net/rfc/deprecate_dollar_brace_string_interpolation
* https://gist.github.com/iluuu1994/72e2154fc4150f2258316b0255b698f2
@jrfnl jrfnl force-pushed the feature/tokenizer-php-bugfix-double-quoted-strings branch from b9751e8 to 150c8c4 Compare June 3, 2022 15:07
@gsherwood gsherwood added this to the 3.7.0 milestone Jun 13, 2022
gsherwood added a commit that referenced this pull request Jun 13, 2022
@gsherwood gsherwood merged commit fe4dde5 into squizlabs:master Jun 13, 2022
@jrfnl jrfnl deleted the feature/tokenizer-php-bugfix-double-quoted-strings branch June 13, 2022 06:26
@gsherwood
Copy link
Member

I had an old version of 1.3.2 lying around so I checked it on that - didn't tokenize correctly. So I think you're right; it's never worked.

But it does now. Thanks a lot for fixing this.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jun 13, 2022

Thanks for getting this merged. This was an unexpected bonus bug find from the PHP 8.2 deprecation 😆

Also interesting to have it confirmed that my suspicion was correct (and even more interesting that apparently nobody had noticed in all that time). Thanks for checking!

jrfnl added a commit to jrfnl/PHP_CodeSniffer that referenced this pull request Oct 26, 2022
…terpolated strings

Similar to previous PR squizlabs#3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
jrfnl added a commit to jrfnl/PHP_CodeSniffer that referenced this pull request Oct 26, 2022
…terpolated strings

Similar to previous PR squizlabs#3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
gsherwood pushed a commit that referenced this pull request Dec 22, 2022
…terpolated strings

Similar to previous PR #3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants