Add context-aware state machine #27

RReverser · 2017-05-03T13:42:29Z

Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design.

It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.

Other than that, this should fix various cases that were previously broken by minifier, such as nested ES6 templates and regular expression recognition in contexts where a division was expected otherwise (see tests for more).

As part of this, improved test output to output actual differences in actual/expected tokens, less noise when tokenizer already failed in the middle, and tokenizer test stats when running with go test -v - this helped a lot during development.

Mostly fixes #16 (apart from two rare cases described above).

Also fixes two out of four remaining cases in tdewolff/minify#132 (other two are unrelated to this particular issue, will submit bug report and/or PR separately).

tdewolff · 2017-05-07T11:51:54Z

Give me some time to review, I have much more free time soon. Perhaps we can also work on incorporating variable/function renaming soon, this reduces filesize extremely. I really appreciate your efforts thus far!

RReverser · 2017-05-07T14:38:52Z

Renaming is much more complicated topic than this, especially with ES6 patterns - too easy to break things.

For now, just trying to fix the parsing itself, this fix seems to cover any real-world use-cases, including auto-generated ones.

As for the review - sure, take your time.

RReverser · 2017-05-15T08:45:58Z

@tdewolff Have you had a chance to look at this? Given that it passes both old and new tests, and also CDNjs suite, I'm pretty confident that it at least didn't make things worse but would prefer second pair of eyes :)

UPD: Nevermind, we've put it in production from my fork, works fine, so no rush.

- Bails out on first index exceeding expected array length (if N >= M, there is no need to check and print messages for N+1 >= M, N+2 >= M, etc.) - Bails out on first token mismatching expected type (if it's incorrectly determined, it's very unlikely that following tokens were determined correctly, so can just reduce noise). - Prints expected and actual token types in addition to index and expected array. - Output number of failed token tests in verbose mode. - Output as many tokens as required to show the problem. - Print original input on failure.

Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design. It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.

Benefits: strongly typed, no external dependency and no noticeable difference in speed (we don't have much to copy on growth as program depth is usually quite low).

Avoids to simplify context handling a bit and remove curContext().

tdewolff · 2017-05-20T17:41:31Z

js/lex_test.go

 		l := NewLexer(bytes.NewBufferString(tt.js))
 		i := 0
+		j := 0


Isn't j just i + 1 within the loop?

No, for example, one counts whitespace tokens, while the other doesn't.

tdewolff · 2017-05-20T17:42:39Z

js/lex.go

@@ -33,6 +33,30 @@ const (
 	TemplateToken
 )

+// TokenState determines a state in which next token should be read
+type TokenState uint32


Why not just uint?

Is there any difference? Probably Rust spoiled me, but now trying to use explicitly sized types everywhere.

Not when you run a 32-bit machine, but uint takes the regular int size AFAIK. So on a 64-bit machine it is a uint64.

Ah yeah makes sense. But then TokenType across the package uses uint32 everywhere, so tried to be consistent with types for enum.

Uh, yeah true

tdewolff · 2017-06-10T14:17:59Z

js/lex.go

+	StmtParensState
+	SubscriptState
+	PropNameState
+)


Where do these names come from? I don't think all of them very clearly specify in what state/context they are...are these terms taken from the specs? Same for context.

Just made up. What do you find unclear about them? It says "expression state" (when expression should be read), "statement parentheses state" (when statement parentheses should be read, such as after while), "subscript state" (when any subscript should be read, that's how we call it in e.g. Acorn), "property name state" (when property name after . is expected) to be as explicit about where it is as possible :)

Makes sense, though I still don't understand subscript as in _subscript. Can we think of a better name?

shrug I don't care much if you can think of a better name, just chose what was used to.

So what do you mean with subscript? Anything that is a number, regexp or a string?

Oh, no, it's a subscript to an already parsed expression (where you expect [, ( or . or some binary operator).

tdewolff · 2017-06-10T14:18:58Z

js/lex.go

-	regexpState   bool
-	templateState bool
+	r     *buffer.Lexer
+	stack []ParsingContext


I would call this context

Then it might be confusing given that this is tokenizer otherwise. "Parsing context" means it actually stores context as proper parser would on top of tokenizer, except does that in a much simpler way without full parser available.

I meant instead of stack as that doesn't convey the meaning of the variable, only the way it is operated

Ah. Well it's "parsing context stack", just "context" would be incorrect because context is just item of it (as in current context). Maybe contextStack but that's slightly long...

I think context would be fine, as indexed as context[i] it would read as get context at index i. Reading stack[i] leaves one wondering what kind of stack it is. I would also call it Context instead of ParsingContext because it is pretty obvious that it is for parsing purposes as that is what the package is called.

I'd prefer not to conflate names with different meanings into one.

enterContext pushing a context onto l.context would be rather confusing for readability. stack (or contextStack) is not an implementation detail, but conveys what it actually is (that it's a FILO list of actual contexts, so you know in which order to expect them etc.).

ParsingContext because it is pretty obvious that it is for parsing purposes as that is what the package is called.

Yeah but the class itself is Lexer so some disambiguation is useful.

RReverser · 2017-06-12T23:27:09Z

@tdewolff How is the online page updated? Is it deployed manually or only when the main minify package is updated? http://go.tacodewolff.nl/minify seems to still minify JS using the old version of the parser.

tdewolff · 2017-06-13T07:08:57Z

I deploy it manually at the moment, I will update it this afternoon

RReverser added 7 commits May 16, 2017 17:25

Add test for complex templates

c9f903e

Add context-sensitive regexp tests

3f96848

Add a context-aware state machine

1b59fa5

Based on Sweet.js design doc https://github.com/sweet-js/sweet.js/wiki/design. It doesn't cover literal objects or function division (at least for now) as it's extremely rare case that requires much more code to be handled correctly.

Remove now-unused regex state table

f01b334

Switch to regular slices for stack

fb138d5

Benefits: strongly typed, no external dependency and no noticeable difference in speed (we don't have much to copy on growth as program depth is usually quite low).

Enter template context only on ${

68e2aa9

Avoids to simplify context handling a bit and remove curContext().

RReverser force-pushed the sweet-js-rework branch from 44b0f93 to 68e2aa9 Compare May 16, 2017 16:32

RReverser mentioned this pull request Jun 10, 2017

JS: nested templates bug tdewolff/minify#144

Closed

tdewolff reviewed Jun 10, 2017

View reviewed changes

tdewolff merged commit e3d09bb into tdewolff:master Jun 10, 2017

RReverser mentioned this pull request Jul 11, 2017

JS: Updated the Quirks section #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context-aware state machine #27

Add context-aware state machine #27

RReverser commented May 3, 2017

tdewolff commented May 7, 2017

RReverser commented May 7, 2017 •

edited

RReverser commented May 15, 2017 •

edited

tdewolff May 20, 2017

RReverser Jun 10, 2017

tdewolff May 20, 2017

RReverser Jun 10, 2017

tdewolff Jun 10, 2017

RReverser Jun 10, 2017

tdewolff Jun 10, 2017

tdewolff Jun 10, 2017

RReverser Jun 10, 2017 •

edited

tdewolff Jun 10, 2017

RReverser Jun 10, 2017

tdewolff Jun 10, 2017

RReverser Jun 10, 2017 •

edited

tdewolff Jun 10, 2017

RReverser Jun 10, 2017

tdewolff Jun 10, 2017

RReverser Jun 10, 2017

tdewolff Jun 10, 2017

RReverser Jun 10, 2017

RReverser commented Jun 12, 2017 •

edited

tdewolff commented Jun 13, 2017

Add context-aware state machine #27

Add context-aware state machine #27

Conversation

RReverser commented May 3, 2017

tdewolff commented May 7, 2017

RReverser commented May 7, 2017 • edited

RReverser commented May 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RReverser Jun 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RReverser Jun 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RReverser commented Jun 12, 2017 • edited

tdewolff commented Jun 13, 2017

RReverser commented May 7, 2017 •

edited

RReverser commented May 15, 2017 •

edited

RReverser Jun 10, 2017 •

edited

RReverser Jun 10, 2017 •

edited

RReverser commented Jun 12, 2017 •

edited