alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

Commelina · 2022-01-21T06:18:37Z

I tried the latest alex 3.2.7 on a common-used library language-c-0.9.0.1. However, it failed to parse Lex.x:

src/Language/C/Parser/Lexer.x:123:22: parse error

It seems that the problem happened here:

@iec60559suffix = (32|64|128)[x]?

Then I rolled my alex back to version 3.2.6 then everything worked. I am not sure if it is a bug.

The text was updated successfully, but these errors were encountered:

hasufell · 2022-01-21T16:22:04Z

Same here: https://github.com/haskell/ghcup-metadata/runs/4898784174?check_suite_focus=true#step:8:35

Ericson2314 · 2022-01-21T18:45:16Z

OK I deprecated 3.2.7 and made it unbuildable. We'll need some new test cases for this.

Ericson2314 · 2022-01-21T19:31:52Z

@andreasabel Assuming this isn't a crazy bootstrapping issue, it looks like you made all the parser changes between 3.2.6 and 3.2.7. Would have any time to help debug this?

andreasabel · 2022-01-22T08:35:04Z

@Ericson2314 : I started to have a look. Then I hit #195.
(I have some old repo of alex which still has a copy of data/AlexTemplate-ghc, so this does not stop me now.)

andreasabel · 2022-01-22T09:42:49Z

Shrunk example: Issue197.x

{}
@iec60559suffix = (32|64|128)[x]?

tokens :-

$white+         ;
@iec60559suffix ;
{}

Phew, finally found the magic incantations to make bisection work:

$ cat > issue197.sh
cp data/AlexTemplate-ghc data/AlexTemplate.hs
cabal run alex -- Issue197.x
^D
$ chmod +x issue197.sh
$ git bisect start 3.2.7 3.2.6
$ git bisect run ./issue197.sh

Result: d6653f3 is the first bad commit
Author: Andreas Abel
Date: Fri Jan 31 19:09:42 2020 +0100

[ fixed #141 ] regex: allow arbitary repetitions

Previously, the r{n,m} and related forms were restricted to single
digit numbers n and m.

We fix this by recognizing numbers of > 1 digits as NUM token and adding
rules to the regex parsing that also allows NUMs for n and m.
Previously, only CHAR was allowed, which subsumes single digits.

Ericson2314 · 2022-01-23T07:10:22Z

Running info on the happy grammar,

...
  58     rep -> '{' CHAR '}'                                (49)
  59     rep -> '{' CHAR ',' '}'                            (50)
  60     rep -> '{' CHAR ',' CHAR '}'                       (51)
  61     rep -> '{' NUM '}'                                 (52)
  62     rep -> '{' NUM ',' '}'                             (53)
  63     rep -> '{' NUM ',' NUM '}'                         (54)
...
  71     set0 -> CHAR                                       (62)
  72     set0 -> CHAR '-' CHAR                              (63)
...

The problem is these lower rules don't have NUM equivalents.

I think remembering to double up NUM and CHAR is hard, and all the more so when this case we would need to "unparse" the number, yet also preserve things like 001 vs 1.

This makes makes we think we should not have a NumT, and instead just have a number nonterminal, parsed from char terminals. Does that sound good?

andreasabel · 2022-01-23T08:31:22Z

Er, @Ericson2314, I am already working on this. I just lack the repo rights to assign this to me so that you would know about it.

…sequences

In different contexts within Alex's surface syntax, something like "2340898" might be a string of characters or a number. The contexts are are only distinguished at the grammar level, not the token level, so this more or less (we could very layer-violation-y tricks) precludes lexing entire number literals. Instead of a number token, we have a digit token. This we treat as "sub-token", making a `DIGIT | CHAR` non-terminal we use everywhere we want to parse a character. For number literals, we just parse a non-empty string of numbers, and the left recursion makes the `* 10` elegant. Fixes #197

andreasabel · 2022-01-23T17:45:56Z

This makes makes we think we should not have a NumT, and instead just have a number nonterminal, parsed from char terminals. Does that sound good?

Yes this sounds good.
I just wish you would not ask me to help and then race for a fix. Wasted my time, I could have looked into something else.

In different contexts within Alex's surface syntax, something like "2340898" might be a string of characters or a number. The contexts are are only distinguished at the grammar level, not the token level, so this more or less (we could very layer-violation-y tricks) precludes lexing entire number literals. Instead of a number token, we have a digit token. This we treat as "sub-token", making a `DIGIT | CHAR` non-terminal we use everywhere we want to parse a character. For number literals, we just parse a non-empty string of numbers, and the left recursion makes the `* 10` elegant. Fixes haskell#197

Ericson2314 · 2022-01-23T18:27:45Z

Er, @Ericson2314, I am already working on this. I just lack the repo rights to assign this to me so that you would know about it.

Oh sorry! That's awkward. I should have watched the time zones more carefully, I just assumed you stopped at the bisect.

Yes, let's get you those perms.

…ssions. In issue haskell#141, multiplicity annotations in regexes where extended to the general, multi-digit case {nnn,mmm}. However, lexing numeric literals broke parsing of regexes like: 32|64 [01-89] The solution here is to only lex numeric literals in a special lexer state called `multiplicity` which is entered by the parser when parsing multiplicity braces {nnn,mmm}. This restores alex' handling of digits as characters in the non-multiplicity situations.

andreasabel · 2022-01-23T20:24:54Z

I should have stated in text that I am looking into a fix...

PR #202 seems to solve the problem in a satisfactory way, by restricting the parsing of numerals to inside multiplicity expressions {n,m}.

Ericson2314 · 2022-01-23T20:50:22Z

@andreasabel BTW I earlier sent you an email taking the git email address. Is that a fine email address? I want to reach out to @simonmar more directly to get to permissions.

andreasabel · 2022-01-23T21:13:13Z

@andreasabel BTW I earlier sent you an email taking the git email address. Is that a fine email address? I want to reach out to @simonmar more directly to get to permissions.

I found the message. I am often slow with email, looking into my github notifications more often than my email.

Thanks for your efforts!

…ssions. In issue haskell#141, multiplicity annotations in regexes where extended to the general, multi-digit case {nnn,mmm}. However, lexing numeric literals broke parsing of regexes like: 32|64 [01-89] The solution here is to only lex numeric literals in a special lexer state called `multiplicity` which is entered by the parser when parsing multiplicity braces {nnn,mmm}. This restores alex' handling of digits as characters in the non-multiplicity situations.

…#202) In issue #141, multiplicity annotations in regexes where extended to the general, multi-digit case {nnn,mmm}. However, lexing numeric literals broke parsing of regexes like: 32|64 [01-89] The solution here is to only lex numeric literals in a special lexer state called `multiplicity` which is entered by the parser when parsing multiplicity braces {nnn,mmm}. This restores alex' handling of digits as characters in the non-multiplicity situations.

Ericson2314 · 2022-01-23T23:21:46Z

@Commelina and others, please try https://hackage.haskell.org/package/alex-3.2.7.1

hasufell mentioned this issue Jan 21, 2022

Fix build because alex is broken haskell/ghcup-metadata#17

Merged

Ericson2314 self-assigned this Jan 23, 2022

andreasabel added a commit to andreasabel/alex that referenced this issue Jan 23, 2022

WIP haskell#197: add rules to interpret number literals as character …

dc52c60

…sequences

andreasabel mentioned this issue Jan 23, 2022

WIP #197: add rules to interpret number literals as character sequences #199

Closed

Ericson2314 mentioned this issue Jan 23, 2022

Parse numbers in Alex's parser, not tokenizer #200

Closed

andreasabel mentioned this issue Jan 23, 2022

Parse numbers in Alex's parser, not tokenizer #201

Closed

andreasabel mentioned this issue Jan 23, 2022

Fix #197 by only lexing numeric literals in multiplicity expressions. #202

Merged

Ericson2314 mentioned this issue Jan 23, 2022

Agda-style Lexing #203

Open

Ericson2314 closed this as completed in #202 Jan 23, 2022

andreasabel added parser Concerning the parsing of .x files regression in 3.2.7 labels Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

Commelina commented Jan 21, 2022

hasufell commented Jan 21, 2022

Ericson2314 commented Jan 21, 2022

Ericson2314 commented Jan 21, 2022

andreasabel commented Jan 22, 2022 •

edited

andreasabel commented Jan 22, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

Comments

Commelina commented Jan 21, 2022

hasufell commented Jan 21, 2022

Ericson2314 commented Jan 21, 2022

Ericson2314 commented Jan 21, 2022

andreasabel commented Jan 22, 2022 • edited

andreasabel commented Jan 22, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 23, 2022

Ericson2314 commented Jan 23, 2022

andreasabel commented Jan 22, 2022 •

edited