Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

pyrmont · 2020-04-08T03:52:27Z

PHP's use keyword can be used to (1) import namespaces into the current scope and (2) include traits in a class. While the PHP lexer currently supports use in some contexts, it doesn't allow for the importing of multiple namespaces at once, nor does it tokenise a given input correctly when use is used inside a class.

This PR improves the support for use. It uses a simple conditional to determine whether identifiers that are captured by the rules in the :uselist state are keywords, namespaces or function names.

It fixes #1353 and it fixes #1361.

pyrmont · 2020-04-08T03:54:07Z

@julp I'm sorry to ask but could I ask you to have a look at this? You mentioned in 826b516 that the commit fixed #1353 and #1361 but I don't think that's the case.

To better tokenise the identifiers in the :uselist state, I have assumed that (by convention at least) trait names are capitalised whereas function names always begin with lowercase. Please let me know if you see anything amiss.

pyrmont · 2020-04-08T03:58:17Z

@HypeMC Sorry it's taken so long to get to this. Here's what it looks like now:

Let me know if you have any thoughts.

julp · 2020-04-08T15:42:31Z

This is the question I was about to ask: for PHP, should we highlight names based on conventions?

class/trait names are pascal case (MyClass)
function names are snake case (mb_convert_case)
method names are camel case (isEnabled)
constant names are upper snake case (FOO_BAR)

(with a potential namespace in front of all of these - \My\Namespace\my_function, Foo\Bar\MyClass)

Also note that use can also be used to import variables from the parent scope into an anonymous function (function sort_on($a, $b) use($key) { ... })

I think we shouldn't bother to handle use statements (namespace related) with those states. It does not worth it, it's too complex and unmaintainable. There is so many edge cases to deal with.

use /*Foo, */ Bar;

<?php use Foo ?>

use /*not*/some\name\{
    function some_fn,
    #const Foo\BAR,
    SomeClass,
};

Also, what do you think about this approach?

pyrmont · 2020-04-09T03:24:01Z

@julp I got really excited by your idea and used it to further simplify the lexer. I made a couple of changes to how your code worked to make the tokens more specific (especially regarding functions) but it's largely the same concept.

I did have to fiddle with one of the tests. We previously had it so that the class keyword in code beginning an anonymous class (Class {) was tokenised as Keyword rather than Keyword::Declaration (which is how it's tokenised everywhere else). That doesn't make any sense to me. class is still functioning to declare a class regardless of whether the class is anonymous or not so I think the token should be the same.

What do you think now?

pyrmont · 2020-04-09T06:14:32Z

@julp I've probably become a bit carried away. I ended up replacing nsid with a new pattern (id_with_ns_and_paren) that does kind of what you expect from the name. Because it matches an opening parenthesis, the rule that matches that pattern is able to apply the Name::Function token to identifiers throughout the code that it can guess are functions. It seems to be working well on the sample.

Eager to hear your thoughts.

julp · 2020-04-09T17:57:44Z

Awesome, good job, it's way better!

Could we not lookup for a builtin if we are lexing one of those statements? See the log method (public function log(string $msg) { in spec/visual/samples/php), it is highlighted as a builtin instead of a function (elsif self.builtins.include? name => elsif !@next_token and self.builtins.include? name?)

Some more points I've missed:

fn is kind of another keyword for function
for typehinting, builtin types excluded (the rule above), a ? at the beginning of a name means it's a class (eg ?mysqli, ?\Some\Class, ?An\Other\Class) but only on a function/fn declaration and class properties because ?FOO can also be found on a ternary $condition?FOO:BAR or a coalesce $exist??DEFAULT but people usually put spaces around operators

pyrmont · 2020-04-10T00:38:39Z

@julp I kind of feel like using a stack is the wrong way to go about it but I couldn't think of a better way so I brought the concept into this PR. Rather than refer to it as a 'stack', I called it a 'magic token' and changed the API a little bit. I also kept the name-guessing heuristic (and kept the broader identifier pattern that I had introduced last time to make that possible).

What do you think? One difference is that your code has this rule about closing parentheses. I didn't see the point of it so have left it out but let me know if there's something that I'm missing.

julp · 2020-04-10T16:03:15Z

A state machine would be a better fit (independant from lexer states to handle only relevant tokens).

The more you want to handle, the more complex it becomes. At start, it seemed easy but you add one case then another then... (it's PHP, not Ruby or Erlang/Elixir sadly) Based only on conventions, the result is not so bad, it seems pretty accurate to me and probably is the best compromise accuracy/simplicity.

One difference is that your code has this rule about closing parentheses. I didn't see the point of it so have left it out but let me know if there's something that I'm missing.

For something like function connect(string $password): mysqli;, it's to switch on Name::Class for mysqli (instead to get a Name::Constant)

For fn, => should act like { or ; for function: fn and function are quite equivalent but the syntax differs :/ One more case to manage specifically.

To avoid a conflict on typehinting with nullable types (?classname or ?builtintype) and the ternary and coalesce operators, it would require a stack on scopes (class, function, ... - push on the corresponding { and pop on }) to choose the good one but it's too difficult to implement.

class Foo extends AbstractFoo implements Bar {
    private ?PDO $dbh = NULL;
}

$a = fn (int $a, int $b): int => FOO + $a <=> $b;
echo $a(1, 2);

echo self::class;
echo FOO;

echo $condition?FOO:BAR;
echo $condition??DEFAULT;

Someone else tried the code and/or have any thoughts?

pyrmont · 2020-04-12T07:48:03Z

@julp Thanks again for all the help so far. I'm still having a think about the best way to do things. I agree that a state machine is largely inevitable but some of my hesitancy comes from the fact that Rouge already provides a state machine with the states in the lexer. Having an additional state machine feels duplicative and inconsistent with the way the other lexers are implemented.

julp · 2020-04-13T15:25:51Z

Sure, I understand.

What ever we could do, we will never get something 100% accurate. I think we have to live with that and keep things as easy and simple as possible in order to facilitate maintenance and future evolvements inherent to the language.

Thanks for your time and support as well.

pyrmont · 2020-07-07T08:01:25Z

@julp I know it's been a few months since the last update but I hadn't completely forgotten about this PR (although you could have been forgiven for thinking otherwise!).

I've made some rather large changes to the structure but I think the result is easier to understand and easier to extend. The main idea was to move from using a :template state and a :php state to using a :root state and a :php state. This mean that when ?> is hit, the state of the lexer can be reset by simply calling reset_stack. Once I had that in place, I put a series of general rules in :php that I then vary by going into various :in_ states when hitting particular keywords (e.g. use goes into :in_use, trait goes into :in_declaration, etc).

I also revised the PHP visual sample; ripping out the copypasta and trying to group examples together so that they are more logically arranged.

II don't know if you've got time at the moment but if you have, could you have a look and let me know what you think?

pyrmont · 2020-07-07T09:35:59Z

Here's what it looks like at the moment:

julp · 2020-07-10T21:07:24Z

Hi,

you did a great job however PHP is not a mature language with some kind of final/established syntax like many others. PHP 8.0.0 is already in alpha and some major changes are coming: union-types, attributes (a replacement for annotations as comments), pattern matching (an alternative to switch/case - seems not yet present in alpha releases), a mixed type, ... (sorry, I feel your pain)

I'm a little worried, with such a technique, on how difficult it might become to maintain the lexer (it begins to do what a parser should in my opinion). For now, you don't have so many intermediate states so it seems manageable but with PHP, expect regular changes (= always more cases to handle).

pyrmont · 2020-07-10T23:45:23Z

@julp Well, we also have a JavaScript lexer so this isn't a completely unknown problem :P

I did see some of these proposals and structured this lexer partly with extensibility in mind. The reason that I mixed in the :return state to the end of all the :in_ states is that it means there won't really be any error tokens if unknown syntax is encountered. At worst, a state will be left prematurely and that'll screw up the highlighting after that point but we have issues like that all the time and we just fix it as and when the need arises.

If the lexer looks like it's highlighting PHP correctly (for now), I think we can merge it in and then handle changes to the syntax in due course.

julp · 2020-07-12T20:45:20Z

Hi,

I haven't tested it on invalid scripts but the result seems really good.

I found one issue: the leading \ for FQDN names (all of them - constants, classes, functions, ...) appears as error

echo collection_radio_buttons('date', new \DatePeriod(new \DateTime, \DateInterval::createFromDateString('1 day'), 7), format_datetime_as_Ymd(date_create('tomorrow')), 'format_datetime_as_Ymd', 'l');

On the above example DatePeriod is also higlighted as a function (instead of class) but I think this is due to this leading \ issue because with this error token we consequently fallback in recover/reset mode (the :return state).

I also noted that on the rule %r/\??#{id}/, Keyword::Type, we could find a namespaced classname (?Foo\Bar) so I think id_with_ns would be more accurate than just id.

Anyone else gave it a shot?

pyrmont · 2020-08-11T18:19:32Z

@julp At this point, I think it's easiest to generate feedback by putting it out into the world and seeing the response. I'm including it in the next version of Rouge I'm prepping right now. Thanks for all your help! I couldn't have done what I did without your advice and support :)

…d type declarations (rouge-ruby#1489) This commit involves a large-scale rewrite of the PHP lexer with the aim of improving the level of semantic detail. Partly this is achieved by the use of various heuristics that, while not always correct, do hold for the majority of cases. In particular, the lexer has improved support for `use` statements, function declarations and type declarations. Co-authored-by: julp <julp@users.noreply.github.com>

pyrmont added the needs-review The PR needs to be reviewed label Apr 8, 2020

pyrmont self-assigned this Apr 8, 2020

pyrmont changed the title ~~Support traits in PHP lexer~~ Improve support for use keyword in PHP lexer Apr 8, 2020

pyrmont added 17 commits July 7, 2020 16:00

Remove tailing whitespace from PHP visual sample

9e4a27e

Add traits usage to PHP visual sample

10c644e

Improve support for use keyword with traits

b74d3f7

Add example of use to import multiple namespaces

b1e3509

Collate trait and use examples

321ef99

Simplify rules and improve coverage

86f9f17

Improve logical separation of rules

8fc769a

Fix regression in comment rule

fe2639c

Fix tokens relating to class keyword

04334e0

Reset expected token after assertions

95c91f4

Change variable name for next token

32f67b9

Treat function keywords separately

9062084

Add anonymous function examples

7242aa9

Update tokens in PHP lexer spec

2260ec4

Improve coverage of tokens further

3ab60d5

Add abstract function example

367179c

Implement MagicToken to 'predict' next token

a564e88

pyrmont added 4 commits July 7, 2020 16:00

Rationalise visual sample

854d84d

Rewrite PHP lexer

610b108

Update Hack lexer to match new structure

ae35ded

Update specs

d33ef44

pyrmont force-pushed the bugfix.php-traits branch from d0b0b5f to d33ef44 Compare July 7, 2020 07:01

pyrmont added 3 commits July 7, 2020 16:25

Update spec

4491110

Use :names state consistently

8a85cca

Better support use in anonymous functions

04ce3cb

Support functions and arrays in assignments

8a6324c

pyrmont changed the title ~~Improve support for use keyword in PHP lexer~~ Rewrite PHP lexer to support use statements, function declarations and type declarations Jul 7, 2020

pyrmont added 2 commits July 7, 2020 18:00

Support arrow functions

4a5be71

Support typed properties

60422a2

pyrmont added 5 commits July 7, 2020 18:41

Rationalise function definition rules

55905e7

Tidy up comments

c42decd

Remove namespace and declaration states

11b41aa

Remove broken assertion

7a88718

Rationalise rules

79cb3d6

pyrmont merged commit ba2d158 into rouge-ruby:master Aug 11, 2020

pyrmont removed the needs-review The PR needs to be reviewed label Aug 11, 2020

pyrmont mentioned this pull request Oct 9, 2020

Rewritten PHP lexer makes kramdown and jekyll tests fail #1603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

pyrmont commented Apr 8, 2020 •

edited

Loading

pyrmont commented Apr 8, 2020 •

edited

Loading

pyrmont commented Apr 8, 2020

julp commented Apr 8, 2020 •

edited

Loading

pyrmont commented Apr 9, 2020 •

edited

Loading

pyrmont commented Apr 9, 2020

julp commented Apr 9, 2020 •

edited

Loading

pyrmont commented Apr 10, 2020 •

edited

Loading

julp commented Apr 10, 2020

pyrmont commented Apr 12, 2020

julp commented Apr 13, 2020

pyrmont commented Jul 7, 2020 •

edited

Loading

pyrmont commented Jul 7, 2020

julp commented Jul 10, 2020 •

edited

Loading

pyrmont commented Jul 10, 2020

julp commented Jul 12, 2020 •

edited

Loading

pyrmont commented Aug 11, 2020

Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

Conversation

pyrmont commented Apr 8, 2020 • edited Loading

pyrmont commented Apr 8, 2020 • edited Loading

pyrmont commented Apr 8, 2020

julp commented Apr 8, 2020 • edited Loading

pyrmont commented Apr 9, 2020 • edited Loading

pyrmont commented Apr 9, 2020

julp commented Apr 9, 2020 • edited Loading

pyrmont commented Apr 10, 2020 • edited Loading

julp commented Apr 10, 2020

pyrmont commented Apr 12, 2020

julp commented Apr 13, 2020

pyrmont commented Jul 7, 2020 • edited Loading

pyrmont commented Jul 7, 2020

julp commented Jul 10, 2020 • edited Loading

pyrmont commented Jul 10, 2020

julp commented Jul 12, 2020 • edited Loading

pyrmont commented Aug 11, 2020

pyrmont commented Apr 8, 2020 •

edited

Loading

pyrmont commented Apr 8, 2020 •

edited

Loading

julp commented Apr 8, 2020 •

edited

Loading

pyrmont commented Apr 9, 2020 •

edited

Loading

julp commented Apr 9, 2020 •

edited

Loading

pyrmont commented Apr 10, 2020 •

edited

Loading

pyrmont commented Jul 7, 2020 •

edited

Loading

julp commented Jul 10, 2020 •

edited

Loading

julp commented Jul 12, 2020 •

edited

Loading