Improved tokenizing of context sensitive keywords #3484

kukulich · 2021-11-21T11:59:30Z

Other tokens should be added after these PR are merged:

kukulich · 2021-12-17T09:07:43Z

@gsherwood Please, this PR should be merged before #3478 and #3483.

It should really help in the future. I've already simplified the support for T_READONLY. I also cannot write some tests for T_ENUM_CASE without this PR.

src/Tokenizers/PHP.php

gsherwood · 2021-12-21T02:33:46Z

Left a couple of comments for minor stuff, but looks good to merge otherwise.

jrfnl

Nice one!, This would be a great step forward, but there's still quite a lot more to do in this regards (though not necessarily all in this PR). Also see: #3336

Note: I locally made a start on a PR to handle this when I opened that issue, but got side-tracked before I could finish it as it is hard to get that one right, especially when trying to account for all situations, including multi-use, group use etc.
Happy to share the code I have though.

Regarding this PR:

I wonder if there aren't more chunks of code in the tokenizer after this new block, which could now be removed, or simplified ?

In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.

More extensive unit tests would also help catch edge cases.

jrfnl · 2021-11-21T14:44:54Z

src/Tokenizers/PHP.php

+                    $preserveKeyword = true;
+                }
+
+                // `namespace\` should be preserved


What about namespace ... ;, i.e declaration ? Every part of the name should be tokenized as T_STRING.

There's a test: 6f991b5#diff-884ebc6b1d508f7200a3ef0edc6dd662e97072d9fec37dd6a4e374a2d835f6d4R3-R4

That's not what I meant (please unresolve this).

Think:

namespace foreach; namespace my\class\extensions;

Added test case.

Noted. That test case still only safeguards the first of the above examples though, the current tokenizer would still fail on the second.

Added one more test :)

Thanks. Would be wonderful though if the other parts of the namespace name would also be checked. The test currently only checks the first part, i.e. my.

What I'm trying to point out, is that on PHP < 8.0, this snippet:

<?php namespace my\class\extensions; namespace my\foreach\yield;

... would be tokenized as - take note of class, yield and foreach:

Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content ------------------------------------------------------------------------- 0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php 1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]: 2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace 3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]: 4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my 5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 6 | L3 | C 14 | CC 0 | ( 0) | T_CLASS | [5]: class 7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions 9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ; 10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]: 11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace 12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]: 13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my 14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 15 | L4 | C 14 | CC 0 | ( 0) | T_FOREACH | [7]: foreach 16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 17 | L4 | C 22 | CC 0 | ( 0) | T_YIELD | [5]: yield 18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ; 19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:

... while with your change, it will be tokenized (better), like so:

Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content ------------------------------------------------------------------------- 0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php 1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]: 2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace 3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]: 4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my 5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 6 | L3 | C 14 | CC 0 | ( 0) | T_STRING | [5]: class 7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions 9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ; 10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]: 11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace 12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]: 13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my 14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 15 | L4 | C 14 | CC 0 | ( 0) | T_STRING | [7]: foreach 16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \ 17 | L4 | C 22 | CC 0 | ( 0) | T_STRING | [5]: yield 18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ; 19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:

There is no difference on PHP 8.0+ as the "retokenization of namespaced names" already takes care of it in that case.

Ok, now I finally get it :) Added.

src/Tokenizers/PHP.php

jrfnl · 2021-12-21T08:07:00Z

src/Tokenizers/PHP.php

@@ -1113,6 +1165,7 @@ protected function tokenize($string)
                && $tokenIsArray === true
                && $token[0] === T_STRING
                && strtolower($token[1]) === 'yield'
+                && isset($this->tstringContexts[$finalTokens[$lastNotEmptyToken]['code']]) === false


Should a similar condition be added for the match retokenization round line 1439 ? And in that case, could the "final check" in PHP::processAdditional() be possibly removed/simplified ?

Similar question for the fn condition round line 1721.

T_MATCH can probably simplified but I'm not sure about this test case:

/* testLiveCoding */ // Intentional parse error. This has to be the last test in the file. echo match

The token is currently tokenized as T_STRING but I'm not sure if it's right.

src/Util/Tokens.php

jrfnl · 2021-12-21T08:33:43Z

src/Util/Tokens.php

+        T_MATCH        => T_MATCH,
+        T_NAMESPACE    => T_NAMESPACE,
+        T_NEW          => T_NEW,
+        T_PARENT       => T_PARENT,


T_PARENT is PHPCS native token and doesn't exist in PHP itself.

The test if failing when I remove the T_PARENT token.

1) PHP_CodeSniffer\Tests\Core\Tokenizer\ContextSensitiveKeywordsTest::testKeywords with data set #19 ('/* testParentIsKeyword */', 'T_PARENT') Failed to find test target token for comment string: /* testParentIsKeyword */ Failed asserting that true is false.

@kukulich That's because it can't find token after the comment (as the custom token is no longer in the list), not because the tokenization is wrong.

@jrfnl Yes, you're right. However I'm not sure if the tokens should be removed here or not...

I suppose that depends on where else this token list will be used, but for the current use case, those tokens aren't needed in the list.

Removed for now.

src/Util/Tokens.php

tests/Core/Tokenizer/ContextSensitiveKeywordsTest.php

kukulich · 2021-12-21T09:42:52Z

@jrfnl

In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.

"string-like token after a function keyword" is gone.

"special case for PHP 5.6 use function and use const" cannot be removed (without other work) but "This is a special case for the PHP 5.5 classname::class syntax" is gone.

jrfnl · 2021-12-21T09:54:38Z

but "This is a special case for the PHP 5.5 classname::class syntax" is gone.

Nice find! Yeah, of course, with the T_PAAMAYIM_NEKUDOTAYIM in the $tstringContexts that would be redundant.

gsherwood · 2022-01-11T23:24:41Z

Just an update that I'm trying to merge this at the moment. I'm needing to resolve a lot of conflicts in the 4.0 branch first, and adjust the namespace changes over there as well.

gsherwood · 2022-01-12T03:41:59Z

Finally got this merged and tested after Github had issues today, but all working.

Appreciate all the work getting this done. Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious.

jrfnl · 2022-01-12T09:22:19Z

Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious.

If you like, I'll try to have a look at the 4.x branch later in the week ? (on that note: there are still a few PRs open for that branch)

kukulich force-pushed the context-sensitive branch 3 times, most recently from 2a9833f to 1f2580d Compare November 21, 2021 12:29

kukulich marked this pull request as ready for review November 21, 2021 12:31

kukulich mentioned this pull request Nov 21, 2021

PHP 8.1: Added T_ENUM_CASE #3483

Merged

kukulich force-pushed the context-sensitive branch from 1f2580d to a51355f Compare December 17, 2021 08:41

gsherwood added this to the 3.7.0 milestone Dec 17, 2021

gsherwood reviewed Dec 21, 2021

View reviewed changes

src/Tokenizers/PHP.php Outdated Show resolved Hide resolved

gsherwood reviewed Dec 21, 2021

View reviewed changes

src/Tokenizers/PHP.php Outdated Show resolved Hide resolved

kukulich force-pushed the context-sensitive branch from a51355f to 5383484 Compare December 21, 2021 07:10

jrfnl reviewed Dec 21, 2021

View reviewed changes

kukulich force-pushed the context-sensitive branch 6 times, most recently from 2be5165 to 6ce46dc Compare December 21, 2021 09:37

kukulich force-pushed the context-sensitive branch 4 times, most recently from e4b1a34 to 8a71b96 Compare December 21, 2021 11:26

kukulich added 2 commits December 22, 2021 07:28

Improved tokenizing of context sensitive keywords

2b7bdb3

Removed dead code

c73f456

kukulich force-pushed the context-sensitive branch from 8a71b96 to c73f456 Compare December 22, 2021 06:28

gsherwood merged commit cfdc6c9 into squizlabs:master Jan 12, 2022

gsherwood added this to Idea Bank in PHPCS v3 Development via automation Jan 12, 2022

gsherwood moved this from Idea Bank to Ready for Release in PHPCS v3 Development Jan 12, 2022

kukulich deleted the context-sensitive branch January 12, 2022 07:30

jrfnl mentioned this pull request Feb 10, 2022

Tokenizer/PHP: bug fix - parent/static keywords in class instantiations #3546

Merged

jrfnl mentioned this pull request Mar 31, 2022

PHP 8.1 | Generic/LowerCaseKeyword: simplify registered tokens + add enum support #3574

Merged

jrfnl mentioned this pull request Jun 12, 2022

Method with name eval is always reported as error #3607

Closed

This was referenced Sep 20, 2022

PSR12/ClassInstantiation: fix regression for new parent #3669

Merged

Generic/FunctionCallArgumentSpacing: fix regression for new parent #3670

Merged

Squiz/OperatorBracket: fix regression for new parent #3671

Merged

jrfnl mentioned this pull request Oct 15, 2022

BCTokens: add support for new contextSensitiveKeywords token array PHPCSStandards/PHPCSUtils#360

Merged

jrfnl mentioned this pull request Jun 7, 2023

Fix mis-identification of 'readonly' keyword #3773

Closed

jrfnl mentioned this pull request Aug 22, 2023

PHP 8.0 | Tokenizer: Handle reserved keywords being used in namespaced names #3336

Closed

jrfnl mentioned this pull request Nov 9, 2023

Tokenizer/PHP: fix mis-identification of 'readonly' keyword icw PHP 8.2 DNF types PHPCSStandards/PHP_CodeSniffer#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved tokenizing of context sensitive keywords #3484

Improved tokenizing of context sensitive keywords #3484

kukulich commented Nov 21, 2021 •

edited

Loading

kukulich commented Dec 17, 2021

gsherwood commented Dec 21, 2021

jrfnl left a comment

jrfnl Nov 21, 2021 •

edited

Loading

kukulich Dec 21, 2021

jrfnl Dec 21, 2021

kukulich Dec 21, 2021

jrfnl Dec 21, 2021

kukulich Dec 21, 2021

jrfnl Dec 21, 2021

kukulich Dec 22, 2021

jrfnl Dec 21, 2021

kukulich Dec 21, 2021

jrfnl Dec 21, 2021

kukulich Dec 21, 2021 •

edited

Loading

jrfnl Dec 21, 2021

kukulich Dec 21, 2021

jrfnl Dec 21, 2021

kukulich Dec 21, 2021

kukulich commented Dec 21, 2021

jrfnl commented Dec 21, 2021

gsherwood commented Jan 11, 2022

gsherwood commented Jan 12, 2022

jrfnl commented Jan 12, 2022

Improved tokenizing of context sensitive keywords #3484

Improved tokenizing of context sensitive keywords #3484

Conversation

kukulich commented Nov 21, 2021 • edited Loading

kukulich commented Dec 17, 2021

gsherwood commented Dec 21, 2021

jrfnl left a comment

Choose a reason for hiding this comment

jrfnl Nov 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kukulich Dec 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kukulich commented Dec 21, 2021

jrfnl commented Dec 21, 2021

gsherwood commented Jan 11, 2022

gsherwood commented Jan 12, 2022

jrfnl commented Jan 12, 2022

kukulich commented Nov 21, 2021 •

edited

Loading

jrfnl Nov 21, 2021 •

edited

Loading

kukulich Dec 21, 2021 •

edited

Loading