Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hand-written parser #22

Closed
wants to merge 4 commits into from
Closed

Hand-written parser #22

wants to merge 4 commits into from

Conversation

bterlson
Copy link
Member

@bterlson bterlson commented May 1, 2015

Haven't really vetted this hard except to confirm it's significantly faster for me. I expect there to be some minor issues and perhaps major feedback on the implementation so I'm starting a PR :)

Semantic delta vs. the old parser:

  1. Comments in lists are handled differently.
  2. Some minor bug fix (see test output)
  3. Nesting is supported for some formats (perhaps not necessary now but seemed like a capability we might want in the future)
  4. Support for documents rather than paragraphs. Documents are 0 or more paragraphs, where paragraphs can be either text or lists. The API has a breaking change that reflects this change (list replaced with document). Fragment remains.

LMK any thoughts. I'm not an expert at parsers so I may be making obvious mistakes. Obviously I have decided to eschew comments for now to preserve this implementation's mysterious yet oddly compelling appeal ;)

var lexemes = {
star: /\*/,
underscore: /_/,
tick: /`[^`\n]*`/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why star, underscore, and tilde get a simpler treatment than tick, pipe, and string? Or we could get rid of the difference, if that doesn't break things horribly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a general comment about the grammar of this implementation. But to answer your question, star, underscore, and tilde can nest, whereas tick, string, and pipe cannot have other things inside of it. I alluded to this above - we could disallow nesting in all cases, but it seemed like something we might want in the future...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems easier to disallow nesting in the first iteration then losen as people ask for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've unified the parsing of all formats and disallowed nesting (if we want to allow nesting later it's a trivial change).

All formats are now handled uniformly with respect to how they can be started and ended (eg. foo| bar is not a valid start and foo |bar is not a valid end) and whether spaces are allowed.

@domenic
Copy link
Member

domenic commented May 1, 2015

Overall looking good. Want to get some docs in the suggested places before merging since otherwise I'll never be able to maintain it myself :).

Also needs doc updates.

@domenic
Copy link
Member

domenic commented May 1, 2015

Oh and I guess we no longer need cheerio?

@bterlson
Copy link
Member Author

bterlson commented May 2, 2015

Yeah, no cheerio.

I also need to test this with all existing emu documents before we merge this to make sure we're not going to break anyone.

I may also add entry points for list and paragraph - I realized I can do this in a pretty straight forward way, and you'll get friendly parse errors if you are trying to parse a list and have a paragraph break (two line breaks).

@domenic
Copy link
Member

domenic commented May 3, 2015

I also need to test this with all existing emu documents before we merge this to make sure we're not going to break anyone.

Yeah, I'm curious how this will work. Will it just use the document entry point and leave any other EMU tags untouched?

@bterlson
Copy link
Member Author

Addressed the feedback (with some exceptions). Verified that Object.observe, SIMD, and Arrow Functions specs are not broken by this change. Many little issues are fixed in the SIMD spec due to its prevalent use of underscores (for example, Foo(%ArrayPrototype_join%, _element_) was considering the first two underscores as var decls whereas it should be the last two). Although some new issues are introduced, eg. _SIMD_Constructor or _foo_ considers the outer underscores as a var, which is per markdown. Probably the only way out of this is escaping which I can implement in a later PR. Also, <ins>_SIMD_Constructor</ins> foo(_x_) --> <ins><var>SIMD_Constructor</ins> foo(_x</var>) --> beautifies to <ins><var>SIMD_Constructor</var></ins> foo(_x) which is sad... not sure how to fix at the moment.

Anyway, cleaning things up and commenting, will push after I get back home :)

@bterlson
Copy link
Member Author

Updated. Should be in much better shape now.

var escapeHtml = require('escape-html');
var beautify = require('./beautify.js');
var Tokenizer = require('./tokenizer.js');
var Emitter = require('./emitter.js')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a class so not capitalized

@domenic
Copy link
Member

domenic commented Jun 16, 2015

_SIMD_Constructor or _foo_ considers the outer underscores as a var, which is per markdown.

Hmm, I'm not sure I like this. Given that variables rarely have spaces in them, I think we should interpret this as two variables if at all possible. This is a difference between variables and emphasis.

var originalOutput = beautifyWithBugs(html, {
indent_size: 2,
wrap_line_length: 0,
unformatted: ['emu-const', 'emu-val', 'emu-nt'].concat(inlineElements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess just add 'ins' and 'del' here.

@domenic
Copy link
Member

domenic commented Jun 16, 2015

I can fix everything in my review comments except #22 (comment) about variables. (Actually, maybe I should do that too so I get re-acquianted with the innards.) If you're going to be busy just let me know and I'll take care of them.

@bterlson
Copy link
Member Author

I can do. I also want to do some more robust perf testing. It's possible that I regressed performance a lot in the last version (or that my original benchmarks were off). Want to know which before we proceed.

@domenic
Copy link
Member

domenic commented Jul 14, 2015

@bterlson poke!

@bterlson
Copy link
Member Author

Yes, I promised @rwaldron I would make progress on this this weekend. At the very least I'll decide if I think this is the right path forward (and I'm feeling it is as I suspect such things as MD links and HTML integration will be tough using the PEG grammar).

@bterlson
Copy link
Member Author

Wrote the following benchmark:

var Bench = require('benchmark');
var emd = require('../lib/ecmarkdown');

var fragmentCase = 'Pop-up wolf *literally* ~Blue~ Bottle, kitsch pork belly |actually| locavore 3 _wolf_ moon ennui butcher raw denim taxidermy you probably haven\'t heard of them.';

var listCase = "1. Pop-up wolf *literally*\n2. ~Blue~ Bottle, kitsch\n  1. prk belly |actually|\n  2. locavore 3 _wolf_ moon ennui\n    1. butcher raw\n    2. denim taxidermy\n";

for(var i = 0; i < 4; i++) {
  fragmentCase += fragmentCase;
  listCase += listCase;
}

var suite = new Bench.Suite();
suite.add('Fragment (' + fragmentCase.length + ' chars)', function() {
  emd.fragment(fragmentCase);
})

suite.add('List (' + listCase.length + ' chars)', function() {
  emd.list(listCase);
});

suite.on('cycle', function(event) {
  console.log(String(event.target));
});

suite.run();

Results with latest master:
Fragment (2560 chars) x 23.22 ops/sec ±6.42% (45 runs sampled)
List (2464 chars) x 20.11 ops/sec ±7.01% (40 runs sampled)

Results with this PR:
Fragment (2560 chars) x 234 ops/sec ±5.91% (75 runs sampled)
List (2464 chars) x 102 ops/sec ±8.43% (71 runs sampled)

So at the very least we seem faster. Will proceed with this PR. Have some additional cleanups I want to work on, I can improve tokenizer perf, and will implement your feedback.

BTW, since this supports paragraphs, I think we can substantially improve Emu perf by passing the entire innerHTML of a clause to emd.document. Would mean we wouldn't need p tags anymore at least.

@bterlson
Copy link
Member Author

Got another 25% or so by moving to a hand-written scanner (as opposed to constructing a giant regexp). As a bonus, I can now handle tags in a more rigorous per-html-spec way.

Fragment (2560 chars) x 290 ops/sec ±6.49% (77 runs sampled)
List (2464 chars) x 123 ops/sec ±6.34% (75 runs sampled)

Will now move on to implementing the feedback above (as well as implementing escaping which should now be trivial with the hand-written scanner) and cleaning things up. Also need to write tests for the scanner itself.

@bterlson
Copy link
Member Author

Pushed the latest progress if you want to check it out.

I think the only thing left that was raised in this PR is the handling of _. I can fix that as part of this PR or I can move it to a separate PR. Or you can fix it. Up to you :)

If you were going to fix it, you'll want to handle it in the parser and as soon as you come across a whitespace token while parsing the _ format, convert the entire parse node back to chars. See parseFormat's handling of unclosed formats for an example.

Escape chars should be handled in the lexer. Parser never needs to know about them.

Markdown links can also be handled in the lexer by adding a new link token. Probably want to implement and call tryScanLink() when you come across a [ similar to handling of '<' for tags.

@domenic
Copy link
Member

domenic commented Jul 24, 2015

Merged as 29e219a! I'm going to spend a couple hours tweaking things (using class syntax, s/var/const, maybe trying to make the parser into a class instead of a bunch of nested functions that are re-instantiated every time...) But I'll let you know when I'm done so you can rebase and continue your work.

@domenic
Copy link
Member

domenic commented Jul 24, 2015

OK, done for the day! We should fix #25 certainly, but other than that I'm ready to do a new major release. I understand you might have some more ambitious ideas though, e.g. eliminating the fragment vs. document distinction in favor of passing through HTML/EMU tags.

@bterlson
Copy link
Member Author

Todo list for this weekend:

  • Backslash escaping
  • MD Link syntax
  • MD bulleted list syntax
  • Handling of space in vars
  • Stretch: Add tags to the format stack and handle improper nesting of tags/formats appropriately (unblocks emu-clause --> document model in Emu).

@domenic
Copy link
Member

domenic commented Jul 24, 2015

I think MD Link syntax might deserve more discussion. External links are rare; internal links are emu-xrefs, which might need their own syntax. We should open a new issue to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants