Hand-written parser #22

bterlson · 2015-05-01T02:09:58Z

Haven't really vetted this hard except to confirm it's significantly faster for me. I expect there to be some minor issues and perhaps major feedback on the implementation so I'm starting a PR :)

Semantic delta vs. the old parser:

Comments in lists are handled differently.
Some minor bug fix (see test output)
Nesting is supported for some formats (perhaps not necessary now but seemed like a capability we might want in the future)
Support for documents rather than paragraphs. Documents are 0 or more paragraphs, where paragraphs can be either text or lists. The API has a breaking change that reflects this change (list replaced with document). Fragment remains.

LMK any thoughts. I'm not an expert at parsers so I may be making obvious mistakes. Obviously I have decided to eschew comments for now to preserve this implementation's mysterious yet oddly compelling appeal ;)

domenic · 2015-05-01T16:38:24Z

lib/ecmarkdown.js

+var lexemes = {
+  star: /\*/,
+  underscore: /_/,
+  tick: /`[^`\n]*`/,


Add a comment explaining why star, underscore, and tilde get a simpler treatment than tick, pipe, and string? Or we could get rid of the difference, if that doesn't break things horribly?

I will add a general comment about the grammar of this implementation. But to answer your question, star, underscore, and tilde can nest, whereas tick, string, and pipe cannot have other things inside of it. I alluded to this above - we could disallow nesting in all cases, but it seemed like something we might want in the future...

Seems easier to disallow nesting in the first iteration then losen as people ask for it?

I've unified the parsing of all formats and disallowed nesting (if we want to allow nesting later it's a trivial change).

All formats are now handled uniformly with respect to how they can be started and ended (eg. foo| bar is not a valid start and foo |bar is not a valid end) and whether spaces are allowed.

domenic · 2015-05-01T16:47:10Z

Overall looking good. Want to get some docs in the suggested places before merging since otherwise I'll never be able to maintain it myself :).

Also needs doc updates.

domenic · 2015-05-01T16:47:44Z

Oh and I guess we no longer need cheerio?

bterlson · 2015-05-02T18:31:48Z

Yeah, no cheerio.

I also need to test this with all existing emu documents before we merge this to make sure we're not going to break anyone.

I may also add entry points for list and paragraph - I realized I can do this in a pretty straight forward way, and you'll get friendly parse errors if you are trying to parse a list and have a paragraph break (two line breaks).

domenic · 2015-05-03T11:34:33Z

I also need to test this with all existing emu documents before we merge this to make sure we're not going to break anyone.

Yeah, I'm curious how this will work. Will it just use the document entry point and leave any other EMU tags untouched?

bterlson · 2015-06-13T20:00:45Z

Addressed the feedback (with some exceptions). Verified that Object.observe, SIMD, and Arrow Functions specs are not broken by this change. Many little issues are fixed in the SIMD spec due to its prevalent use of underscores (for example, Foo(%ArrayPrototype_join%, _element_) was considering the first two underscores as var decls whereas it should be the last two). Although some new issues are introduced, eg. _SIMD_Constructor or _foo_ considers the outer underscores as a var, which is per markdown. Probably the only way out of this is escaping which I can implement in a later PR. Also, <ins>_SIMD_Constructor</ins> foo(_x_) --> <ins><var>SIMD_Constructor</ins> foo(_x</var>) --> beautifies to <ins><var>SIMD_Constructor</var></ins> foo(_x) which is sad... not sure how to fix at the moment.

Anyway, cleaning things up and commenting, will push after I get back home :)

bterlson · 2015-06-16T00:10:59Z

Updated. Should be in much better shape now.

domenic · 2015-06-16T22:10:21Z

lib/ecmarkdown.js

 var escapeHtml = require('escape-html');
+var beautify = require('./beautify.js');
+var Tokenizer = require('./tokenizer.js');
+var Emitter = require('./emitter.js')


Not a class so not capitalized

domenic · 2015-06-16T22:15:50Z

_SIMD_Constructor or _foo_ considers the outer underscores as a var, which is per markdown.

Hmm, I'm not sure I like this. Given that variables rarely have spaces in them, I think we should interpret this as two variables if at all possible. This is a difference between variables and emphasis.

domenic · 2015-06-16T22:16:48Z

lib/beautify.js

+  var originalOutput = beautifyWithBugs(html, {
+    indent_size: 2,
+    wrap_line_length: 0,
+    unformatted: ['emu-const', 'emu-val', 'emu-nt'].concat(inlineElements)


I guess just add 'ins' and 'del' here.

domenic · 2015-06-16T22:20:22Z

I can fix everything in my review comments except #22 (comment) about variables. (Actually, maybe I should do that too so I get re-acquianted with the innards.) If you're going to be busy just let me know and I'll take care of them.

bterlson · 2015-06-16T22:33:21Z

I can do. I also want to do some more robust perf testing. It's possible that I regressed performance a lot in the last version (or that my original benchmarks were off). Want to know which before we proceed.

domenic · 2015-07-14T23:43:23Z

@bterlson poke!

bterlson · 2015-07-15T00:02:50Z

Yes, I promised @rwaldron I would make progress on this this weekend. At the very least I'll decide if I think this is the right path forward (and I'm feeling it is as I suspect such things as MD links and HTML integration will be tough using the PEG grammar).

bterlson · 2015-07-19T22:43:39Z

Wrote the following benchmark:

var Bench = require('benchmark');
var emd = require('../lib/ecmarkdown');

var fragmentCase = 'Pop-up wolf *literally* ~Blue~ Bottle, kitsch pork belly |actually| locavore 3 _wolf_ moon ennui butcher raw denim taxidermy you probably haven\'t heard of them.';

var listCase = "1. Pop-up wolf *literally*\n2. ~Blue~ Bottle, kitsch\n  1. prk belly |actually|\n  2. locavore 3 _wolf_ moon ennui\n    1. butcher raw\n    2. denim taxidermy\n";

for(var i = 0; i < 4; i++) {
  fragmentCase += fragmentCase;
  listCase += listCase;
}

var suite = new Bench.Suite();
suite.add('Fragment (' + fragmentCase.length + ' chars)', function() {
  emd.fragment(fragmentCase);
})

suite.add('List (' + listCase.length + ' chars)', function() {
  emd.list(listCase);
});

suite.on('cycle', function(event) {
  console.log(String(event.target));
});

suite.run();

Results with latest master:
Fragment (2560 chars) x 23.22 ops/sec ±6.42% (45 runs sampled)
List (2464 chars) x 20.11 ops/sec ±7.01% (40 runs sampled)

Results with this PR:
Fragment (2560 chars) x 234 ops/sec ±5.91% (75 runs sampled)
List (2464 chars) x 102 ops/sec ±8.43% (71 runs sampled)

So at the very least we seem faster. Will proceed with this PR. Have some additional cleanups I want to work on, I can improve tokenizer perf, and will implement your feedback.

BTW, since this supports paragraphs, I think we can substantially improve Emu perf by passing the entire innerHTML of a clause to emd.document. Would mean we wouldn't need p tags anymore at least.

bterlson · 2015-07-20T22:13:35Z

Got another 25% or so by moving to a hand-written scanner (as opposed to constructing a giant regexp). As a bonus, I can now handle tags in a more rigorous per-html-spec way.

Fragment (2560 chars) x 290 ops/sec ±6.49% (77 runs sampled)
List (2464 chars) x 123 ops/sec ±6.34% (75 runs sampled)

Will now move on to implementing the feedback above (as well as implementing escaping which should now be trivial with the hand-written scanner) and cleaning things up. Also need to write tests for the scanner itself.

bterlson · 2015-07-22T23:23:38Z

Pushed the latest progress if you want to check it out.

I think the only thing left that was raised in this PR is the handling of _. I can fix that as part of this PR or I can move it to a separate PR. Or you can fix it. Up to you :)

If you were going to fix it, you'll want to handle it in the parser and as soon as you come across a whitespace token while parsing the _ format, convert the entire parse node back to chars. See parseFormat's handling of unclosed formats for an example.

Escape chars should be handled in the lexer. Parser never needs to know about them.

Markdown links can also be handled in the lexer by adding a new link token. Probably want to implement and call tryScanLink() when you come across a [ similar to handling of '<' for tags.

domenic · 2015-07-24T13:14:18Z

Merged as 29e219a! I'm going to spend a couple hours tweaking things (using class syntax, s/var/const, maybe trying to make the parser into a class instead of a bunch of nested functions that are re-instantiated every time...) But I'll let you know when I'm done so you can rebase and continue your work.

domenic · 2015-07-24T14:41:30Z

OK, done for the day! We should fix #25 certainly, but other than that I'm ready to do a new major release. I understand you might have some more ambitious ideas though, e.g. eliminating the fragment vs. document distinction in favor of passing through HTML/EMU tags.

bterlson · 2015-07-24T20:00:31Z

Todo list for this weekend:

Backslash escaping
MD Link syntax
MD bulleted list syntax
Handling of space in vars
Stretch: Add tags to the format stack and handle improper nesting of tags/formats appropriately (unblocks emu-clause --> document model in Emu).

domenic · 2015-07-24T20:02:21Z

I think MD Link syntax might deserve more discussion. External links are rare; internal links are emu-xrefs, which might need their own syntax. We should open a new issue to discuss.

domenic reviewed May 1, 2015
View reviewed changes

bterlson force-pushed the master branch from fdb9f60 to df4163c Compare June 16, 2015 00:09

Hand-written parser.

6708a37

bterlson force-pushed the master branch from df4163c to 6708a37 Compare June 16, 2015 00:54

domenic reviewed Jun 16, 2015
View reviewed changes

bterlson mentioned this pull request Jul 22, 2015

Allow skipping of beautification #12

Closed

bterlson added 2 commits July 22, 2015 15:18

Hand-written tokenizer

d203b61

Remove list export (breaking change)

0e9f408

Move parser to separate file and fix lint errors

cb34eba

bterlson force-pushed the master branch from 8243e8c to cb34eba Compare July 22, 2015 23:24

domenic closed this Jul 24, 2015

This was referenced Jul 24, 2015

Add an AST and AST-parsing mode #15

Closed

First call to list or paragraph slow #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hand-written parser #22

Hand-written parser #22

bterlson commented May 1, 2015

domenic May 1, 2015

bterlson May 2, 2015

domenic May 3, 2015

bterlson Jun 13, 2015

domenic commented May 1, 2015

domenic commented May 1, 2015

bterlson commented May 2, 2015

domenic commented May 3, 2015

bterlson commented Jun 13, 2015

bterlson commented Jun 16, 2015

domenic Jun 16, 2015

domenic commented Jun 16, 2015

domenic Jun 16, 2015

domenic commented Jun 16, 2015

bterlson commented Jun 16, 2015

domenic commented Jul 14, 2015

bterlson commented Jul 15, 2015

bterlson commented Jul 19, 2015

bterlson commented Jul 20, 2015

bterlson commented Jul 22, 2015

domenic commented Jul 24, 2015

domenic commented Jul 24, 2015

bterlson commented Jul 24, 2015

domenic commented Jul 24, 2015

Hand-written parser #22

Hand-written parser #22

Conversation

bterlson commented May 1, 2015

domenic May 1, 2015

Choose a reason for hiding this comment

bterlson May 2, 2015

Choose a reason for hiding this comment

domenic May 3, 2015

Choose a reason for hiding this comment

bterlson Jun 13, 2015

Choose a reason for hiding this comment

domenic commented May 1, 2015

domenic commented May 1, 2015

bterlson commented May 2, 2015

domenic commented May 3, 2015

bterlson commented Jun 13, 2015

bterlson commented Jun 16, 2015

domenic Jun 16, 2015

Choose a reason for hiding this comment

domenic commented Jun 16, 2015

domenic Jun 16, 2015

Choose a reason for hiding this comment

domenic commented Jun 16, 2015

bterlson commented Jun 16, 2015

domenic commented Jul 14, 2015

bterlson commented Jul 15, 2015

bterlson commented Jul 19, 2015

bterlson commented Jul 20, 2015

bterlson commented Jul 22, 2015

domenic commented Jul 24, 2015

domenic commented Jul 24, 2015

bterlson commented Jul 24, 2015

domenic commented Jul 24, 2015