-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add context-aware state machine #27
Conversation
Give me some time to review, I have much more free time soon. Perhaps we can also work on incorporating variable/function renaming soon, this reduces filesize extremely. I really appreciate your efforts thus far! |
Renaming is much more complicated topic than this, especially with ES6 patterns - too easy to break things. For now, just trying to fix the parsing itself, this fix seems to cover any real-world use-cases, including auto-generated ones. As for the review - sure, take your time. |
@tdewolff Have you had a chance to look at this? Given that it passes both old and new tests, and also CDNjs suite, I'm pretty confident that it at least didn't make things worse but would prefer second pair of eyes :) UPD: Nevermind, we've put it in production from my fork, works fine, so no rush. |
- Bails out on first index exceeding expected array length (if N >= M, there is no need to check and print messages for N+1 >= M, N+2 >= M, etc.) - Bails out on first token mismatching expected type (if it's incorrectly determined, it's very unlikely that following tokens were determined correctly, so can just reduce noise). - Prints expected and actual token types in addition to index and expected array. - Output number of failed token tests in verbose mode. - Output as many tokens as required to show the problem. - Print original input on failure.
Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design. It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.
Benefits: strongly typed, no external dependency and no noticeable difference in speed (we don't have much to copy on growth as program depth is usually quite low).
Avoids to simplify context handling a bit and remove curContext().
44b0f93
to
68e2aa9
Compare
l := NewLexer(bytes.NewBufferString(tt.js)) | ||
i := 0 | ||
j := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't j
just i + 1
within the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for example, one counts whitespace tokens, while the other doesn't.
@@ -33,6 +33,30 @@ const ( | |||
TemplateToken | |||
) | |||
|
|||
// TokenState determines a state in which next token should be read | |||
type TokenState uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just uint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference? Probably Rust spoiled me, but now trying to use explicitly sized types everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when you run a 32-bit machine, but uint
takes the regular int size AFAIK. So on a 64-bit machine it is a uint64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah makes sense. But then TokenType
across the package uses uint32
everywhere, so tried to be consistent with types for enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, yeah true
StmtParensState | ||
SubscriptState | ||
PropNameState | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these names come from? I don't think all of them very clearly specify in what state/context they are...are these terms taken from the specs? Same for context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made up. What do you find unclear about them? It says "expression state" (when expression should be read), "statement parentheses state" (when statement parentheses should be read, such as after while
), "subscript state" (when any subscript should be read, that's how we call it in e.g. Acorn), "property name state" (when property name after .
is expected) to be as explicit about where it is as possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, though I still don't understand subscript
as in subscript. Can we think of a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shrug I don't care much if you can think of a better name, just chose what was used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what do you mean with subscript
? Anything that is a number, regexp or a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, it's a subscript to an already parsed expression (where you expect [
, (
or .
or some binary operator).
regexpState bool | ||
templateState bool | ||
r *buffer.Lexer | ||
stack []ParsingContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it might be confusing given that this is tokenizer otherwise. "Parsing context" means it actually stores context as proper parser would on top of tokenizer, except does that in a much simpler way without full parser available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant instead of stack
as that doesn't convey the meaning of the variable, only the way it is operated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Well it's "parsing context stack", just "context" would be incorrect because context is just item of it (as in current context). Maybe contextStack
but that's slightly long...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think context
would be fine, as indexed as context[i]
it would read as get context at index i
. Reading stack[i]
leaves one wondering what kind of stack it is. I would also call it Context
instead of ParsingContext
because it is pretty obvious that it is for parsing purposes as that is what the package is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to conflate names with different meanings into one.
enterContext
pushing a context
onto l.context
would be rather confusing for readability. stack
(or contextStack
) is not an implementation detail, but conveys what it actually is (that it's a FILO list of actual contexts, so you know in which order to expect them etc.).
ParsingContext
because it is pretty obvious that it is for parsing purposes as that is what the package is called.
Yeah but the class itself is Lexer
so some disambiguation is useful.
@tdewolff How is the online page updated? Is it deployed manually or only when the main |
I deploy it manually at the moment, I will update it this afternoon |
Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design.
It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.
Other than that, this should fix various cases that were previously broken by minifier, such as nested ES6 templates and regular expression recognition in contexts where a division was expected otherwise (see tests for more).
As part of this, improved test output to output actual differences in actual/expected tokens, less noise when tokenizer already failed in the middle, and tokenizer test stats when running with
go test -v
- this helped a lot during development.Mostly fixes #16 (apart from two rare cases described above).
Also fixes two out of four remaining cases in tdewolff/minify#132 (other two are unrelated to this particular issue, will submit bug report and/or PR separately).