-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hand-written parser #22
Conversation
var lexemes = { | ||
star: /\*/, | ||
underscore: /_/, | ||
tick: /`[^`\n]*`/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why star, underscore, and tilde get a simpler treatment than tick, pipe, and string? Or we could get rid of the difference, if that doesn't break things horribly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a general comment about the grammar of this implementation. But to answer your question, star, underscore, and tilde can nest, whereas tick, string, and pipe cannot have other things inside of it. I alluded to this above - we could disallow nesting in all cases, but it seemed like something we might want in the future...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems easier to disallow nesting in the first iteration then losen as people ask for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've unified the parsing of all formats and disallowed nesting (if we want to allow nesting later it's a trivial change).
All formats are now handled uniformly with respect to how they can be started and ended (eg. foo| bar
is not a valid start and foo |bar
is not a valid end) and whether spaces are allowed.
Overall looking good. Want to get some docs in the suggested places before merging since otherwise I'll never be able to maintain it myself :). Also needs doc updates. |
Oh and I guess we no longer need cheerio? |
Yeah, no cheerio. I also need to test this with all existing emu documents before we merge this to make sure we're not going to break anyone. I may also add entry points for list and paragraph - I realized I can do this in a pretty straight forward way, and you'll get friendly parse errors if you are trying to parse a list and have a paragraph break (two line breaks). |
Yeah, I'm curious how this will work. Will it just use the document entry point and leave any other EMU tags untouched? |
Addressed the feedback (with some exceptions). Verified that Object.observe, SIMD, and Arrow Functions specs are not broken by this change. Many little issues are fixed in the SIMD spec due to its prevalent use of underscores (for example, Anyway, cleaning things up and commenting, will push after I get back home :) |
Updated. Should be in much better shape now. |
var escapeHtml = require('escape-html'); | ||
var beautify = require('./beautify.js'); | ||
var Tokenizer = require('./tokenizer.js'); | ||
var Emitter = require('./emitter.js') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a class so not capitalized
Hmm, I'm not sure I like this. Given that variables rarely have spaces in them, I think we should interpret this as two variables if at all possible. This is a difference between variables and emphasis. |
var originalOutput = beautifyWithBugs(html, { | ||
indent_size: 2, | ||
wrap_line_length: 0, | ||
unformatted: ['emu-const', 'emu-val', 'emu-nt'].concat(inlineElements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess just add 'ins'
and 'del'
here.
I can fix everything in my review comments except #22 (comment) about variables. (Actually, maybe I should do that too so I get re-acquianted with the innards.) If you're going to be busy just let me know and I'll take care of them. |
I can do. I also want to do some more robust perf testing. It's possible that I regressed performance a lot in the last version (or that my original benchmarks were off). Want to know which before we proceed. |
@bterlson poke! |
Yes, I promised @rwaldron I would make progress on this this weekend. At the very least I'll decide if I think this is the right path forward (and I'm feeling it is as I suspect such things as MD links and HTML integration will be tough using the PEG grammar). |
Wrote the following benchmark: var Bench = require('benchmark');
var emd = require('../lib/ecmarkdown');
var fragmentCase = 'Pop-up wolf *literally* ~Blue~ Bottle, kitsch pork belly |actually| locavore 3 _wolf_ moon ennui butcher raw denim taxidermy you probably haven\'t heard of them.';
var listCase = "1. Pop-up wolf *literally*\n2. ~Blue~ Bottle, kitsch\n 1. prk belly |actually|\n 2. locavore 3 _wolf_ moon ennui\n 1. butcher raw\n 2. denim taxidermy\n";
for(var i = 0; i < 4; i++) {
fragmentCase += fragmentCase;
listCase += listCase;
}
var suite = new Bench.Suite();
suite.add('Fragment (' + fragmentCase.length + ' chars)', function() {
emd.fragment(fragmentCase);
})
suite.add('List (' + listCase.length + ' chars)', function() {
emd.list(listCase);
});
suite.on('cycle', function(event) {
console.log(String(event.target));
});
suite.run(); Results with latest master: Results with this PR: So at the very least we seem faster. Will proceed with this PR. Have some additional cleanups I want to work on, I can improve tokenizer perf, and will implement your feedback. BTW, since this supports paragraphs, I think we can substantially improve Emu perf by passing the entire innerHTML of a clause to emd.document. Would mean we wouldn't need p tags anymore at least. |
Got another 25% or so by moving to a hand-written scanner (as opposed to constructing a giant regexp). As a bonus, I can now handle tags in a more rigorous per-html-spec way. Fragment (2560 chars) x 290 ops/sec ±6.49% (77 runs sampled) Will now move on to implementing the feedback above (as well as implementing escaping which should now be trivial with the hand-written scanner) and cleaning things up. Also need to write tests for the scanner itself. |
Pushed the latest progress if you want to check it out. I think the only thing left that was raised in this PR is the handling of _. I can fix that as part of this PR or I can move it to a separate PR. Or you can fix it. Up to you :) If you were going to fix it, you'll want to handle it in the parser and as soon as you come across a whitespace token while parsing the _ format, convert the entire parse node back to chars. See parseFormat's handling of unclosed formats for an example. Escape chars should be handled in the lexer. Parser never needs to know about them. Markdown links can also be handled in the lexer by adding a new link token. Probably want to implement and call tryScanLink() when you come across a [ similar to handling of '<' for tags. |
Merged as 29e219a! I'm going to spend a couple hours tweaking things (using class syntax, s/var/const, maybe trying to make the parser into a class instead of a bunch of nested functions that are re-instantiated every time...) But I'll let you know when I'm done so you can rebase and continue your work. |
OK, done for the day! We should fix #25 certainly, but other than that I'm ready to do a new major release. I understand you might have some more ambitious ideas though, e.g. eliminating the fragment vs. document distinction in favor of passing through HTML/EMU tags. |
Todo list for this weekend:
|
I think MD Link syntax might deserve more discussion. External links are rare; internal links are emu-xrefs, which might need their own syntax. We should open a new issue to discuss. |
Haven't really vetted this hard except to confirm it's significantly faster for me. I expect there to be some minor issues and perhaps major feedback on the implementation so I'm starting a PR :)
Semantic delta vs. the old parser:
list
replaced withdocument
). Fragment remains.LMK any thoughts. I'm not an expert at parsers so I may be making obvious mistakes. Obviously I have decided to eschew comments for now to preserve this implementation's mysterious yet oddly compelling appeal ;)