Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: metadata blocks. Fixes #580 #641

Merged
merged 3 commits into from May 14, 2023

Conversation

Martin1887
Copy link
Collaborator

YAML-style and pluses-delimited metadata blocks are supported, each one in a different Options flag. This is a breaking change because a new Tag enum variant is added.

Fixes: #580

YAML-style and pluses-delimited metadata blocks are supported, each one
in a different `Options` flag. This is a breaking change because a new
`Tag` enum variant is added.

Fixes: pulldown-cmark#580
@raphlinus
Copy link
Collaborator

raphlinus commented May 3, 2023

I am inclined against this pull request, as it feels like an expansion of scope of the crate. The biggest question I have is why the functionality shouldn't be provided in another crate; there's already the existing cmark-frontmatter and gray_matter projects. A secondary concern is that front matter is not standardized to anywhere the same degree as commonmark itself. In particular, I'm not aware of any standardized approach to metadata blocks inline to the document, nor do I yet understand the benefits of doing so.

Another approach to interleaving metadata and commonmark is to consider the outer document a "container" with metadata, commonmark text, and perhaps other content. That requires the ability to determine structure boundaries without necessarily doing detailed parsing of the text, but in my opinion that might be a good thing, avoiding what would otherwise be something of a layering violation.

I could perhaps be persuaded otherwise if there were other users arguing in favor of this approach.

@Martin1887
Copy link
Collaborator Author

The main obstacle to use other crates is the support for multiple metadata blocks, as specified in the comment #580 (comment).

Existing crates only support front matter at the start of the document, but although they support multiple metadata blocks, it would be hard integrating the metadata blocks into the events gotten by pulldown-cmark. Note that the problem here is not getting the information contained in the metadata blocks, but having metadata blocks produces different parsing like h2 Setext headings, horizontal rules and h1 ATX headings (the last one in commented lines of metadata blocks).

Multiple metadata blocks are supported by Pandoc (https://pandoc.org/MANUAL.html#metadata-blocks) and not supporting them is the only error I have found to parse a valid Pandoc document. That is, other Pandoc features are obviously not tracked by this crate, but they are simply ignored meanwhile metadata blocks produce incorrect output.

Also note that, as I commented in the issue, pulldown-cmark may not be always used for final output, but as a Markdown parser it can be used for some preprocessing before other engines like Pandoc, mdBook, etc. I think that this pull request does not hurt pulldown-cmark increasing complexity nor breaking anything meanwhile it would add support for correctly parsing documents with metadata blocks, and always only if flags are enabled.

@wooorm, @jaythomas, and @joelparkerhenderson put a thumb up in my issue comment, so I guess they may be interested in this feature.

@joelparkerhenderson
Copy link
Contributor

joelparkerhenderson commented May 3, 2023

Yes I'm highly in favor of a solution because I experience the same Pandoc incompatibility decribed by @Martin1887.

At the same time, I've seen front matter done in such different ways, many of which are incompatible, so I definitely like the points of @raphlinus about layering, boundaries, architecture, and crate scope.

In my own similar kind of parser and processing, I do the boundary detection approach, and toss each chunk into a list of ad-hoc recognizers, such as for front matter that uses YAML, or TOML, or HTML, or JSON. It turns out that JSON is surprisingly useful for controlling other tooling, much as @Martin1887 discusses.

My opinion (for whatever it's worth) is to set a timebox for us all to think about layers/boundaries/helpers/etc., such as a timebox of one week. If there's a good practical implementable architecture idea by the end of the timebox, then do it, otherwise ship this PR because it's good and helpful for real-world work.

@Martin1887
Copy link
Collaborator Author

I cannot found a fast and clean manner of implementing a layering approach and at the same time providing a clean interface to retrieve boundaries. Have you any ideas, @joelparkerhenderson?

In the future it would interesting to provide an interface to plug new rules with different priorities, multiple parsing stages and/or pre-processors and post-processors interfaces.

On the other hand, do you need a new release soon? My idea was creating a big release with all the breaking changes to reduce the number of breaking releases, but probably there are too many urgent issues to do that, and breaking releases in 0.x stage are not so rare.

Thanks.

@joelparkerhenderson
Copy link
Contributor

Here's how I did it in my own code... it's brute force done via a list of syntax-matcher-parser structs, that try to recognize front-matter versus normal markdown.

E.g. try a syntax-matcher-parser to see if there's a match of front-matter, and if so, parse it. Otherwise, try the next syntax-matcher-parser. Etc.

The purpose of this is to be able to handle a variety of front-matter syntaxes (e.g. HTML-like, JSON-like, YAML-like), then output a btree that is the configuration state of the document.

https://github.com/SixArm/sita-rust-crate/tree/main/src/matter

https://github.com/SixArm/sita-rust-crate/tree/main/src/state

Files:

* `matter/` - Markdown front matter and back matter files.
    * `matter_parser_enum.rs` - Matter parser enum (among BTMS, JSON, TOML, YAML).
    * `matter_parser_mutex.rs` - Matter parser mutex (among BTMS, JSON, TOML, YAML).
    * `matter_parser_trait.rs` - Matter parser trait (among `matter_parser_with_*`).
    * `matter_parser_with_btms.rs` - Matter parser implementation with BTMS (BTreeMap Struct).
    * `matter_parser_with_json.rs` - Matter parser implementation with JSON (JavaScript Object Notation).
    * `matter_parser_with_toml.rs` - Matter parser implementation with TOML (Tom's Obvious Minimal Language).
    * `matter_parser_with_yaml.rs` - Matter parser implementation with YAML (Yet Another Markup Language).
* `state/` - State that holds variables, such as front matter.
    * `state_enum.rs` - State enum (among BTMS, JSON, TOML, YAML).
    * `state_trait.rs` - State trait (among `state_with_*.rs`).
    * `state_with_btms.rs` - State implementation with BTMS (BTreeMap Struct).
    * `state_with_json.rs` - State implementation with JSON (JavaScript Object Notation).
    * `state_with_toml.rs` - State implementation with TOML (Tom's Obvious Minimal Language).
    * `state_with_yaml.rs` - State implementation with YAML (Yet Another Markup Language).

My guess is there's a better strategy.

I don't need a release soon.

@Martin1887
Copy link
Collaborator Author

OK, I see you are using regexes to parse blocks, not the best nor the most correct strategy. A high level and user-friendly approach will need too much time, so I merge this pull request now and in a future version a best approach to handle layers will be designed.

@Martin1887 Martin1887 merged commit c2a65da into pulldown-cmark:master May 14, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Front Matter Support
3 participants