Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context-aware state machine #27

Merged
merged 7 commits into from
Jun 10, 2017
Merged

Conversation

RReverser
Copy link
Contributor

Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design.

It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.

Other than that, this should fix various cases that were previously broken by minifier, such as nested ES6 templates and regular expression recognition in contexts where a division was expected otherwise (see tests for more).

As part of this, improved test output to output actual differences in actual/expected tokens, less noise when tokenizer already failed in the middle, and tokenizer test stats when running with go test -v - this helped a lot during development.

Mostly fixes #16 (apart from two rare cases described above).

Also fixes two out of four remaining cases in tdewolff/minify#132 (other two are unrelated to this particular issue, will submit bug report and/or PR separately).

@tdewolff
Copy link
Owner

tdewolff commented May 7, 2017

Give me some time to review, I have much more free time soon. Perhaps we can also work on incorporating variable/function renaming soon, this reduces filesize extremely. I really appreciate your efforts thus far!

@RReverser
Copy link
Contributor Author

RReverser commented May 7, 2017

Renaming is much more complicated topic than this, especially with ES6 patterns - too easy to break things.

For now, just trying to fix the parsing itself, this fix seems to cover any real-world use-cases, including auto-generated ones.

As for the review - sure, take your time.

@RReverser
Copy link
Contributor Author

RReverser commented May 15, 2017

@tdewolff Have you had a chance to look at this? Given that it passes both old and new tests, and also CDNjs suite, I'm pretty confident that it at least didn't make things worse but would prefer second pair of eyes :)

UPD: Nevermind, we've put it in production from my fork, works fine, so no rush.

- Bails out on first index exceeding expected array length (if N >= M, there is no need to check and print messages for N+1 >= M, N+2 >= M, etc.)
- Bails out on first token mismatching expected type (if it's incorrectly determined, it's very unlikely that following tokens were determined correctly, so can just reduce noise).
- Prints expected and actual token types in addition to index and expected array.
- Output number of failed token tests in verbose mode.
- Output as many tokens as required to show the problem.
- Print original input on failure.
Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design.

It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.
Benefits: strongly typed, no external dependency and no noticeable difference in speed (we don't have much to copy on growth as program depth is usually quite low).
Avoids to simplify context handling a bit and remove curContext().
l := NewLexer(bytes.NewBufferString(tt.js))
i := 0
j := 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't j just i + 1 within the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for example, one counts whitespace tokens, while the other doesn't.

@@ -33,6 +33,30 @@ const (
TemplateToken
)

// TokenState determines a state in which next token should be read
type TokenState uint32
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just uint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference? Probably Rust spoiled me, but now trying to use explicitly sized types everywhere.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when you run a 32-bit machine, but uint takes the regular int size AFAIK. So on a 64-bit machine it is a uint64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah makes sense. But then TokenType across the package uses uint32 everywhere, so tried to be consistent with types for enum.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, yeah true

StmtParensState
SubscriptState
PropNameState
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these names come from? I don't think all of them very clearly specify in what state/context they are...are these terms taken from the specs? Same for context.

Copy link
Contributor Author

@RReverser RReverser Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made up. What do you find unclear about them? It says "expression state" (when expression should be read), "statement parentheses state" (when statement parentheses should be read, such as after while), "subscript state" (when any subscript should be read, that's how we call it in e.g. Acorn), "property name state" (when property name after . is expected) to be as explicit about where it is as possible :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, though I still don't understand subscript as in subscript. Can we think of a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrug I don't care much if you can think of a better name, just chose what was used to.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do you mean with subscript? Anything that is a number, regexp or a string?

Copy link
Contributor Author

@RReverser RReverser Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, it's a subscript to an already parsed expression (where you expect [, ( or . or some binary operator).

regexpState bool
templateState bool
r *buffer.Lexer
stack []ParsingContext
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it might be confusing given that this is tokenizer otherwise. "Parsing context" means it actually stores context as proper parser would on top of tokenizer, except does that in a much simpler way without full parser available.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant instead of stack as that doesn't convey the meaning of the variable, only the way it is operated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Well it's "parsing context stack", just "context" would be incorrect because context is just item of it (as in current context). Maybe contextStack but that's slightly long...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think context would be fine, as indexed as context[i] it would read as get context at index i. Reading stack[i] leaves one wondering what kind of stack it is. I would also call it Context instead of ParsingContext because it is pretty obvious that it is for parsing purposes as that is what the package is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to conflate names with different meanings into one.

enterContext pushing a context onto l.context would be rather confusing for readability. stack (or contextStack) is not an implementation detail, but conveys what it actually is (that it's a FILO list of actual contexts, so you know in which order to expect them etc.).

ParsingContext because it is pretty obvious that it is for parsing purposes as that is what the package is called.

Yeah but the class itself is Lexer so some disambiguation is useful.

@tdewolff tdewolff merged commit e3d09bb into tdewolff:master Jun 10, 2017
@RReverser
Copy link
Contributor Author

RReverser commented Jun 12, 2017

@tdewolff How is the online page updated? Is it deployed manually or only when the main minify package is updated? http://go.tacodewolff.nl/minify seems to still minify JS using the old version of the parser.

@tdewolff
Copy link
Owner

I deploy it manually at the moment, I will update it this afternoon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JS: distinguish regexps from division correctly
2 participants