Title: peg-megamarkdown User's Guide
Author: Fletcher T. Penney
Author: Tobias Weingartner Base Header Level: 2
Markdown is a simple markup language used to convert plain text into HTML.
MultiMarkdown is a derivative of Markdown that adds new syntax features, such as footnotes, tables, and metadata. Additionally, it offers mechanisms to convert plain text into LaTeX in addition to HTML.
peg-multimarkdown is an implementation of MultiMarkdown derived from John MacFarlane's peg-markdown. It makes use of a parsing expression grammar (PEG), and is written in C. It should compile for most any (major) operating system.
Thanks to work by Daniel Jalikut, MMD no longer requires GLib2 as a dependency. This should make it easier to compile on various operating systems.
MegaMarkdown is my gutted, bastardized version of all of the above. While looking at adding a number of extensions to something akin to a markdown parser/converter, it was pretty much obvious that most of them are written in a very nasty manner. The perl, python, ruby, or whatever other scripting language parsers were largely overglorified sed(1)/awk(1) scripts. The "peg" versions were at least written in a language something somewhat usefull. Unfortunately, they all seemed to suffer from a number of weird bloat and baggage I didn't need, nor wanted. So first thing that was done, was to nuke a bunch of functionality. If you want latex, pdf, or whatever, output, use a different tool. If you want to run things in batch, use a script or some other method to use a simple tool to implement batch processing. This gives you the option to do things in parallel, should you choose to.
Larger Issues with the code
Some of the larger issues with the code are that all the files are read into one large memory buffer and then processed multiple times by one PEG/LEG specification, using different terminal symbols to implement the parsing of the document(s). Naturaly, this means that the memory consumption of the PEG/LEG versions are much higher. Some of the issues that I've specifically found:
- Tabs are expanded multiple times through the full string buffer. Seems the % operator was not considered. The unix command expand(1) will do the same if you wish to do so. Note, since expansion is done before parsing, it changes parsing.
- All parsing is basically done with a huge contiguous buffer. No real parse tree is built and used. This limits the amount of work that can be done by generic passes over the parse tree (think references, etc).
- Conflicting names for a number of symbols, such as link(2). These should simply be re-named in a manner that does not conflict on most POSIX style systems.
- One of the big issues I've had with the markdown style syntax, is that it knows more about HTML/XML than it really should. Changing things around so that there is a "RAW" or "VERBATIM" option to be able to embedd whatever you wish within the document seemed like a better alternative, than trying to figure out all the different HTML tags someone is going to come up with in the future.
- Another issue is the absence of arbitrary tags for simple elements, such as the ability to have a class or id tag to help with the styling images, and other content.
- No include files. This one could be largely dealt with by using multiple files, and cat(1)'ing them together, or parsing them one after another for getting complete output. In this manner, headers, footers, etc are able to be defined in separate files, and re-used.
- No variables. The multimarkdown variants have a header to define document metadata. This is not a bad idea, but could easily be extended by making those header elements be "variables" that could later be used within the document (think 'Title:'), as well as could be used to modify the behavoir of the parser (internal variables, ie: 'Base Header Level:').
In order to fix and extend the parser to do some of these things, four fundamental changes would have to be done.
- The tab expansion would have to go, which will necessitate some changes in the PEG/LEG parser specification to deal with quoting, etc. This will likely also result in changes to the testing documents. This pretty much makes the parser not fully backwards compatible anymore. At this point we might as well come up with a new testing harness, along with new test cases.
- Explicit parsing of HTML/XML document structure will need to be removed. Unless the markdown document specifies "RAW" or "VERBATIM" (maybe "COPY") mode, all text will be up for parsing/modification of any markdown formating contained within.
- The multi-parse of the result buffer string will have to go, replaced with a single parsing pass to create a parse tree (or possibly forest for multiple files), followed by walking this parse tree (possibly multiple times) to resolve references, expand variables, and other such things.
- Walk parse final tree to output the requested representation, which at this point would be either the printing of the document and/or the header of variables requested.
As such, a new megamarkdown format document will consist of a number of sections, each of them parsed in potentially a different manner. Each of these sections can be written in an explicit and implicit manner. The explicit manner is an un-ambiguous method (with extra markup) to write a document, where there is explicit markup for a section to help megamarkdown determine what type of section it is parsing. The implicit method is similar in nature to the current markdown formating. The current section types amount to the following:
- Header/Variables section. This section is either the (optional) first section in the document, formated in the 'variable: content' style format. The explicit section type (which can occur anywhere in a document) is of type "vars".
- RAW/VERBATIM section. This section is only explicitly instantiated. These sections are basically copied character for charater to the output. This is the "approved" method of embedding HTML/XML/etc markup within a document.
- PARSED section. These sections are parsed according to the markdown sytax, either by explicitly stating the type of section you are instantiating, or by using the less percise implicit section formating from the markdown format.