Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

Closed
Commelina opened this issue Jan 21, 2022 · 13 comments · Fixed by #202
Closed

alex 3.2.7 fails to parse definitions which can pass alex 3.2.6 #197

Commelina opened this issue Jan 21, 2022 · 13 comments · Fixed by #202
Assignees
Labels
parser Concerning the parsing of .x files regression in 3.2.7

Comments

@Commelina
Copy link

I tried the latest alex 3.2.7 on a common-used library language-c-0.9.0.1. However, it failed to parse Lex.x:

src/Language/C/Parser/Lexer.x:123:22: parse error

It seems that the problem happened here:

@iec60559suffix = (32|64|128)[x]?

Then I rolled my alex back to version 3.2.6 then everything worked. I am not sure if it is a bug.

@hasufell
Copy link
Member

@Ericson2314
Copy link
Collaborator

OK I deprecated 3.2.7 and made it unbuildable. We'll need some new test cases for this.

@Ericson2314
Copy link
Collaborator

@andreasabel Assuming this isn't a crazy bootstrapping issue, it looks like you made all the parser changes between 3.2.6 and 3.2.7. Would have any time to help debug this?

@andreasabel
Copy link
Member

andreasabel commented Jan 22, 2022

@Ericson2314 : I started to have a look. Then I hit #195.
(I have some old repo of alex which still has a copy of data/AlexTemplate-ghc, so this does not stop me now.)

@andreasabel
Copy link
Member

Shrunk example: Issue197.x

{}
@iec60559suffix = (32|64|128)[x]?

tokens :-

$white+         ;
@iec60559suffix ;
{}

Phew, finally found the magic incantations to make bisection work:

$ cat > issue197.sh
cp data/AlexTemplate-ghc data/AlexTemplate.hs
cabal run alex -- Issue197.x
^D
$ chmod +x issue197.sh
$ git bisect start 3.2.7 3.2.6
$ git bisect run ./issue197.sh

Result: d6653f3 is the first bad commit
Author: Andreas Abel
Date: Fri Jan 31 19:09:42 2020 +0100

[ fixed #141 ] regex: allow arbitary repetitions

Previously, the r{n,m} and related forms were restricted to single
digit numbers n and m.

We fix this by recognizing numbers of > 1 digits as NUM token and adding
rules to the regex parsing that also allows NUMs for n and m.
Previously, only CHAR was allowed, which subsumes single digits.

@Ericson2314
Copy link
Collaborator

Running info on the happy grammar,

...
  58     rep -> '{' CHAR '}'                                (49)
  59     rep -> '{' CHAR ',' '}'                            (50)
  60     rep -> '{' CHAR ',' CHAR '}'                       (51)
  61     rep -> '{' NUM '}'                                 (52)
  62     rep -> '{' NUM ',' '}'                             (53)
  63     rep -> '{' NUM ',' NUM '}'                         (54)
...
  71     set0 -> CHAR                                       (62)
  72     set0 -> CHAR '-' CHAR                              (63)
...

The problem is these lower rules don't have NUM equivalents.

I think remembering to double up NUM and CHAR is hard, and all the more so when this case we would need to "unparse" the number, yet also preserve things like 001 vs 1.

This makes makes we think we should not have a NumT, and instead just have a number nonterminal, parsed from char terminals. Does that sound good?

@Ericson2314 Ericson2314 self-assigned this Jan 23, 2022
@andreasabel
Copy link
Member

Er, @Ericson2314, I am already working on this. I just lack the repo rights to assign this to me so that you would know about it.

andreasabel added a commit to andreasabel/alex that referenced this issue Jan 23, 2022
Ericson2314 added a commit that referenced this issue Jan 23, 2022
In different contexts within Alex's surface syntax, something like
"2340898" might be a string of characters or a number. The contexts are
are only distinguished at the grammar level, not the token level, so
this more or less (we could very layer-violation-y tricks) precludes
lexing entire number literals.

Instead of a number token, we have a digit token. This we treat as
"sub-token", making a `DIGIT | CHAR` non-terminal we use everywhere we
want to parse a character.

For number literals, we just parse a non-empty string of numbers, and
the left recursion makes the `* 10` elegant.

Fixes #197
Ericson2314 added a commit that referenced this issue Jan 23, 2022
In different contexts within Alex's surface syntax, something like
"2340898" might be a string of characters or a number. The contexts are
are only distinguished at the grammar level, not the token level, so
this more or less (we could very layer-violation-y tricks) precludes
lexing entire number literals.

Instead of a number token, we have a digit token. This we treat as
"sub-token", making a `DIGIT | CHAR` non-terminal we use everywhere we
want to parse a character.

For number literals, we just parse a non-empty string of numbers, and
the left recursion makes the `* 10` elegant.

Fixes #197
@andreasabel
Copy link
Member

This makes makes we think we should not have a NumT, and instead just have a number nonterminal, parsed from char terminals. Does that sound good?

Yes this sounds good.
I just wish you would not ask me to help and then race for a fix. Wasted my time, I could have looked into something else.

andreasabel pushed a commit to andreasabel/alex that referenced this issue Jan 23, 2022
In different contexts within Alex's surface syntax, something like
"2340898" might be a string of characters or a number. The contexts are
are only distinguished at the grammar level, not the token level, so
this more or less (we could very layer-violation-y tricks) precludes
lexing entire number literals.

Instead of a number token, we have a digit token. This we treat as
"sub-token", making a `DIGIT | CHAR` non-terminal we use everywhere we
want to parse a character.

For number literals, we just parse a non-empty string of numbers, and
the left recursion makes the `* 10` elegant.

Fixes haskell#197
@Ericson2314
Copy link
Collaborator

Er, @Ericson2314, I am already working on this. I just lack the repo rights to assign this to me so that you would know about it.

Oh sorry! That's awkward. I should have watched the time zones more carefully, I just assumed you stopped at the bisect.

Yes, let's get you those perms.

andreasabel added a commit to andreasabel/alex that referenced this issue Jan 23, 2022
…ssions.

In issue haskell#141, multiplicity annotations in regexes where extended to
the general, multi-digit case {nnn,mmm}.  However, lexing numeric
literals broke parsing of regexes like:

   32|64
   [01-89]

The solution here is to only lex numeric literals in a special lexer
state called `multiplicity` which is entered by the parser when
parsing multiplicity braces {nnn,mmm}.

This restores alex' handling of digits as characters in the
non-multiplicity situations.
@andreasabel
Copy link
Member

I should have stated in text that I am looking into a fix...

PR #202 seems to solve the problem in a satisfactory way, by restricting the parsing of numerals to inside multiplicity expressions {n,m}.

@Ericson2314
Copy link
Collaborator

@andreasabel BTW I earlier sent you an email taking the git email address. Is that a fine email address? I want to reach out to @simonmar more directly to get to permissions.

@andreasabel
Copy link
Member

@andreasabel BTW I earlier sent you an email taking the git email address. Is that a fine email address? I want to reach out to @simonmar more directly to get to permissions.

I found the message. I am often slow with email, looking into my github notifications more often than my email.

Thanks for your efforts!

Ericson2314 pushed a commit to andreasabel/alex that referenced this issue Jan 23, 2022
…ssions.

In issue haskell#141, multiplicity annotations in regexes where extended to
the general, multi-digit case {nnn,mmm}.  However, lexing numeric
literals broke parsing of regexes like:

   32|64
   [01-89]

The solution here is to only lex numeric literals in a special lexer
state called `multiplicity` which is entered by the parser when
parsing multiplicity braces {nnn,mmm}.

This restores alex' handling of digits as characters in the
non-multiplicity situations.
Ericson2314 pushed a commit that referenced this issue Jan 23, 2022
…#202)

In issue #141, multiplicity annotations in regexes where extended to
the general, multi-digit case {nnn,mmm}.  However, lexing numeric
literals broke parsing of regexes like:

   32|64
   [01-89]

The solution here is to only lex numeric literals in a special lexer
state called `multiplicity` which is entered by the parser when
parsing multiplicity braces {nnn,mmm}.

This restores alex' handling of digits as characters in the
non-multiplicity situations.
@Ericson2314
Copy link
Collaborator

@Commelina and others, please try https://hackage.haskell.org/package/alex-3.2.7.1

@andreasabel andreasabel added parser Concerning the parsing of .x files regression in 3.2.7 labels Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment