Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lexer for untokenised BBC BASIC files #1280

Merged
merged 21 commits into from
Aug 2, 2019

Conversation

bavison
Copy link
Contributor

@bavison bavison commented Jul 26, 2019

Includes full support for ARM BBC BASIC. Some features of BBC BASIC for Windows
and BBC BASIC for SDL2.0 are not included.

Includes full support for ARM BBC BASIC. Some features of BBC BASIC for Windows
and BBC BASIC for SDL2.0 are not included.
@pyrmont pyrmont added the needs-review The PR needs to be reviewed label Jul 26, 2019
Copy link
Contributor

@pyrmont pyrmont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now I'm caught up with your whirlwind of submissions! :P

lib/rouge/lexers/bbcbasic.rb Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/bbcbasic.rb Show resolved Hide resolved
spec/visual/samples/bbcbasic Show resolved Hide resolved
@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Jul 29, 2019
@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels Jul 31, 2019
@pyrmont
Copy link
Contributor

pyrmont commented Jul 31, 2019

@bavison Looking at this code, I'm a little uncomfortable at the performance cost of the interpolation. There's a lot of it and there's potentially a lot of backtracking as the regex engine goes through trying to find matches.

I thought again about that alternative strategy of matching an identifier and then checking whether the match is inside any of the arrays of keywords (be they self.function, self.statement, etc) but I don't think it's possible given your observation that matches can occur 'within' a word.

That would be your conclusion, too, right?

@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Jul 31, 2019
…ommand introducer

This is done by introducing an extra state where CLI commands are not valid,
and classifying the control flow keywords according to whether they
transition in or out of this state. Not 100% safe yet, since detecting any
expression in :root state should mean that * can only be an operator.
@bavison
Copy link
Contributor Author

bavison commented Jul 31, 2019

There's certainly a lot of (|) clauses in the regexs, if that's what you mean, and there isn't really an easy way to detect the ends of the tokens other than brute-force testing all the alternatives. Of course, this also means that colouring it makes it much easier for humans to read too! The general aversion to whitespace reflects the age of the dialect (it's backward-compatible to an early 1980s dialect, when every byte mattered, and in fact the interpreter tokenises the code as it is loaded to save space further).

Presumably it would be possible to form a decision tree to speed things up, so that for example (ABS|ACS|ADVAL|ASC|ATN|BEATS|BEAT|BGET#) becomes (A(BS|CS|DVAL|SC|TN)|B(EATS|EAT|GET#)). Ideally, that would be done by Ruby as it compiles the regex - if it does. The alternative of writing it out longhand as above risks making the lexer source code far less readable.

@pyrmont
Copy link
Contributor

pyrmont commented Jul 31, 2019

@bavison Yeah, the number of | is what I was thinking about. And, unfortunately, I don't believe Ruby does the kind of compilation you had in mind.

I read through the documentation for Ruby's Regexp class and the best we have is the o modifier. There's an answer at StackOverflow that explains what that does (it sounds like something we should definitely enable here).

This is probably another example where we should see how it performs in practice and look at future optimisations as and when required. As you said, trying to do that by hand is possible but fraught with bug potential.

By using a negative lookahead for when DIM and POINT are statements, rather
that a positive lookahead when DIM() and POINT() are used as built-in
functions, we can merge state :builtin_function into :expression. As a bonus,
since built-in functions are now tested at lower priority than control flow
statements, `ERROR` no longer needs its own rule.
This is achieved to inserting `goto :no_further_imperatives` every time we
identify a component of an expression, or reach a state where we expect an
expression next. Because these will not be balanced by the number of times we
return to :root, we can no longer use a state stack. This means that instead
of being able to do `mixin :expression` from state `:assembly2`, we need to
write out all the same rules again, but without the `goto`s.
Some of those keywords previously listed are not valid as an imperative
statement.
I don't think it realle helps to distinguish control flow statements from
other statements, especially since they are tokenised the same. Some keywords
(like `TINT`) have ended up in control3 even though they have nothing to do
with conrol flow, and others (like `OF`) can appear within either sort of
statement (`COLOUR OF x` and `CASE x OF`).
* `PROC` is permitted in an `ON` statement
* An imperative keyword is permitted immediately after `DEFPROCthing` without a
  separating colon

Added a few more examples to the visual spec
It isn't really  a built-in function (since it needs a function name to qualify
it, which is lexed as a variable), but it can appear within an expression.
Using a pair of states just wan't working - there are just too many exceptions
to the rules to handle cleanly. So instead, here we take the approach of only
permitting `*` to be lexed as a command introducer if it is at the start of a
line, or follows `:` or one of a restrictive set of other keywords.

Unfortunately, it appears that the lookbehind regexp syntax `(?<=)` doesn't
work here, so we have to lex the string including the lead up to the `*` as a
multi-token rule. This is fine in all cases except when the first line of the
file is a `*` command, but hopefully this will be rare enough that nobody
cares.
@bavison
Copy link
Contributor Author

bavison commented Aug 1, 2019

Sorry about the drip-drip of commits. Mostly it's putting the worms back in the can regarding the multiplication operator * getting misinterpreted as a command line command introducer. I think this solution motly works now, and seems to do a good job on real-world code too.

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels Aug 2, 2019
@pyrmont pyrmont merged commit b99ab3a into rouge-ruby:master Aug 2, 2019
@pyrmont pyrmont removed the needs-review The PR needs to be reviewed label Aug 2, 2019
@pyrmont
Copy link
Contributor

pyrmont commented Aug 2, 2019

@bavison Thanks for all your work on this one—quite a journey, huh?

Oh, and I'm sorry we don't have anything in the documentation explaining that negative lookbehinds don't work in rules. I hope you didn't spend too long before you worked that out! This is a limitation of (really a bug in) the StringScanner class from Ruby's standard library. A patch I submitted was just accepted but the reality is that even when that patch rolls out, it'll only be in the latest version of Ruby and so it'll be a while before it's supported in Rouge :(

@bavison
Copy link
Contributor Author

bavison commented Aug 2, 2019

No, it wasn't that big of a proportion of the time. The biggest problem was dealing with a syntax that evolved in a not-entirely-consistent way in an era before common use of other language analysis tools, I think.

That's the end of my backlog of lexers that I'd developed - there probably won't be many more to follow. Bugfixes maybe. Thanks very much for all your help!

@pyrmont
Copy link
Contributor

pyrmont commented Aug 2, 2019

I realised I hadn't said this in the closing comment of any of the other PRs but Rouge is on a two-week release cadence at the moment as we work our way through the backlog of submissions. The lexers you've submitted will come out this coming Tuesday as part of v3.8.0. Thanks again!

bavison added a commit to bavison/rouge that referenced this pull request Sep 9, 2019
This commit adds a lexer for ARM BBC Basic. Some features of BBC BASIC
for Windows and BBC BASIC for SDL2.0 are not included.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants