Skip to content

Commit

Permalink
Merge 25db6cb into 0cfe6d7
Browse files Browse the repository at this point in the history
  • Loading branch information
mattj23 committed Mar 16, 2022
2 parents 0cfe6d7 + 25db6cb commit 6cf174b
Show file tree
Hide file tree
Showing 3 changed files with 519 additions and 0 deletions.
134 changes: 134 additions & 0 deletions doc/parsing-ast.md
@@ -0,0 +1,134 @@
# The Abstract Syntax Tree

If successful, the `Markdown.Parse(...)` method returns the abstract syntax tree (AST) of the source text.

This will be an object of the `MarkdownDocument` type, which is in turn derived from a more general block container and is part of a larger taxonomy of classes which represent different semantic constructs of a markdown syntax tree.

This document will discuss the different types of elements within the Markdig representation of the AST.

## Structure of the AST

Within Markdig, there are two general types of node in the markdown syntax tree: `Block`, and `Inline`. Block nodes may contain inline nodes, but the reverse is not true. Blocks may contain other blocks, and inlines may contain other inlines.

The root of the AST is the `MarkdownDocument` which is itself derived from a container block but also contains information on the line count and starting positions within the document. Nodes in the AST have links both to parent and children, allowing the edges in the tree to be traversed efficiently in either direction.

Different semantic constructs are represented by types derived from the `Block` and `Inline` types, which are both `abstract` themselves. These elements are produced by `BlockParser` and `InlineParser` derived types, respectively, and so new constructs can be added with the implementation of a new block or inline parser and a new block or inline type, as well as an extension to register it in the pipeline. For more information on extending Markdig this way refer to the [Extensions/Parsers](parsing-extensions.md) document.

The AST is assembled by the static method `Markdown.Parse(...)` using the collections of block and inline parsers contained in the `MarkdownPipeline`. For more detailed information refer to the [Markdig Parsing Overview](parsing-overview.md) document.

### Quick Examples: Descendants API

The easiest way to traverse the abstract syntax tree is with a group of extension methods that have the name `Descendants`. Several different overloads exist to allow it to search for both `Block` and `Inline` elements, starting from any node in the tree.

The `Descendants` methods return `IEnumerable<MarkdownObject>` or `IEnumerable<T>` as their results. Internally they are using `yield return` to perform edge traversals lazily.

#### Depth-First Like Traversal of All Elements

```csharp
MarkdownDocument result = Markdown.Parse(sourceText, pipeline);

// Iterate through all MarkdownObjects in a depth-first order
foreach (var item in result.Descendants())
{
Console.WriteLine(item.GetType());

// You can use pattern matching to isolate elements of certain type,
// otherwise you can use the filtering mechanism demonstrated in the
// next section
if (item is ListItemBlock listItem)
{
// ...
}
}
```

#### Filtering of Specific Child Types

Filtering can be performed using the `Descendants<T>()` method, in which T is required to be derived from `MarkdownObject`.

```csharp
MarkdownDocument result = Markdown.Parse(sourceText, pipeline);

// Iterate through all ListItem blocks
foreach (var item in result.Descendants<ListItemBlock>())
{
// ...
}

// Iterate through all image links
foreach (var item in result.Descendants<LinkInline>().Where(x => x.IsImage))
{
// ...
}
```

#### Combined Hierarchies

The `Descendants` method can be used on any `MarkdownObject`, not just the root node, so complex hierarchies can be queried.

```csharp
MarkdownDocument result = Markdown.Parse(sourceText, pipeline);

// Find all Emphasis inlines which descend from a ListItem block
var items = document.Descendants<ListItemBlock>()
.Select(block => block.Descendants<EmphasisInline>());

// Find all Emphasis inlines whose direct parent block is a ListItem
var other = document.Descendants<EmphasisInline>()
.Where(inline => inline.ParentBlock is ListItemBlock);
```

## Block Elements

Block elements all derive from `Block` and may be one of two types:

1. `ContainerBlock`, which is a block which holds other blocks (`MarkdownDocument` is itself derived from this)
2. `LeafBlock`, which is a block that has no child blocks, but may contain inlines

Block elements in markdown refer to things like paragraphs, headings, lists, code, etc. Most blocks may contain inlines, with the exception of things like code blocks.

### Properties of Blocks

The following are properties of `Block` objects which warrant elaboration. For a full list of properties see the generated API documentation (coming soon).

#### Block Parent
All blocks have a reference to a parent (`Parent`) of type `ContainerBlock?`, which allows for efficient traversal up the abstract syntax tree. The parent will be `null` in the case of the root node (the `MarkdownDocument`).

#### Parser

All blocks have a reference to a parser (`Parser`) of type `BlockParser?` which refers to the instance of the parser which created this block.

#### IsOpen Flag

Blocks have an `IsOpen` boolean flag which is set true while they're being parsed and then closed when parsing is complete.

Blocks are created by `BlockParser` objects which are managed by an instance of a `BlockProcessor` object. During the parsing algorithm the `BlockProcessor` maintains a list of all currently open `Block` objects as it steps through the source line by line. The `IsOpen` flag indicates to the `BlockProcessor` that the block should remain open as the next line begins. If the `IsOpen` flag is not directly set by the `BlockParser` on each line, the `BlockProcessor` will consider the `Block` fully parsed and will no longer call its `BlockParser` on it.

#### IsBreakable Flag

Blocks are either breakable or not, specified by the `IsBreakable` flag. If a block is non-breakable it indicates to the parser that the close condition of any parent container do not apply so long as the non-breakable child block is still open.

The only built-in example of this is the `FencedCodeBlock`, which, if existing as the child of a container block of some sort, will prevent that container from being closed before the `FencedCodeBlock` is closed, since any characters inside the `FencedCodeBlock` are considered to be valid code and not the container's close condition.

#### RemoveAfterProcessInlines



## Inline Elements

Inlines in markdown refer to things like embellishments (italics, bold, underline, etc), links, urls, inline code, images, etc.

Inline elements may be one of two types:

1. `Inline`, whose parent is always a `ContainerInline`
2. `ContainerInline`, derived from `Inline`, which contains other inlines. `ContainerInline` also has a `ParentBlock` property of type `LeafBlock?`


**(Is there anything special worth documenting about inlines or types of inlines?)**

## The SourceSpan Struct

If the pipeline was configured with `.UsePreciseSourceLocation()`, all elements in the abstract syntax tree will contain a reference to the location in the original source where they occurred. This is done with the `SourceSpan` type, a custom Markdig `struct` which provides a start and end location.

All objects derived from `MarkdownObject` contain the `Span` property, which is of type `SourceSpan`.

158 changes: 158 additions & 0 deletions doc/parsing-extensions.md
@@ -0,0 +1,158 @@
# Extensions and Parsers

Markdig was [implemented in such a way](http://xoofx.com/blog/2016/06/13/implementing-a-markdown-processor-for-dotnet/) as to be extremely pluggable, with even basic behaviors being mutable and extendable.

The basic mechanism for extension of Markdig is the `IMarkdownExtension` interface, which allows any implementing class to be registered with the pipeline builder and thus to directly modify the collections of `BlockParser` and `InlineParser` objects which end up in the pipeline.

This document discusses the `IMarkdownExtension` interface, the `BlockParser` abstract base class, and the `InlineParser` abstract base class, which together are the foundation of extending Markdig's parsing machinery.

## Creating Extensions

Extensions can vary from very simple to very complicated.

A simple extension, for example, might simply find a parser already in the pipeline and modify a setting on it. An example of this is the `SoftlineBreakAsHardlineExtension`, which locates the `LineBreakInlineParser` and modifies a single boolean flag on it.

A complex extension, on the other hand, might add an entire taxonomy of new `Block` and `Inline` types, as well as several related parsers and renderers, and require being added to the the pipeline in a specific order in relation to other extensions which are already configured. The `FootnoteExtension` and `PipeTableExtension` are examples of more complex extensions.

For extensions that don't require order considerations, the implementation of the extension itself is adequate, and the extension can be added to the pipeline with the generic `Use<TExtension>()` method on the pipeline builder. For extensions which do require order considerations, it is best to create an extension method on the `MarkdownPipelineBuilder` to perform the registration. See the following two sections for further information.

### Implementation of an Extension

The [IMarkdownExtension.cs](https://github.com/xoofx/markdig/blob/master/src/Markdig/IMarkdownExtension.cs) interface specifies two methods which must be implemented.

The first, which takes only the pipeline builder as an argument, is called when the `Build()` method on the pipeline builder is invoked, and should set up any modifications to the parsers or parser collections. These parsers will then be used by the main parsing algorithm to process the source text.

```csharp
void Setup(MarkdownPipelineBuilder pipeline);
```

The second, which takes the pipeline itself and a renderer, is used to set up a rendering component in order to convert any special `MarkdownObject` types associated with the extension into an output. This is not relevant for parsing, but is necessary for rendering.

```csharp
void Setup(MarkdownPipeline pipeline, IMarkdownRenderer renderer);
```

The extension can then be registered to the pipeline builder using the `Use<TExtension>()` method. A skeleton example is given below:

```csharp
public class MySpecialBlockParser : BlockParser
{
// ...
}

public class MyExtension : IMarkdownExtension
{
void Setup(MarkdownPipelineBuilder pipeline)
{
pipeline.BlockParsers.AddIfNotAlready<MySpecialBlockParser>();
}

void Setup(MarkdownPipeline pipeline, IMarkdownRenderer renderer) { }
}
```

```csharp
var builder = new MarkdownPipelineBuilder()
.Use<MyExtension>();
```

### Pipeline Builder Extension Methods

For extensions which require specific ordering and/or need to perform multiple operations to register with the builder, it's recommended to create an extension method.

```csharp
public static class MyExtensionMethods
{
public static MarkdownPipelineBuilder UseMyExtension(this MarkdownPipelineBuilder pipeline)
{
// Directly access or modify pipeline.Extensions here, with the ability to
// search for other extensions, insert before or after, remove other extensions,
// or modify their settings.
// ...
return pipeline;
}
}

```

### Simple Extension Example

An example of a simple extension which does not add any new parsers, but instead creates a new, horrific emphasis tag, marked by triple percentage signs. This example is based on [CitationExtension.cs](https://github.com/xoofx/markdig/blob/master/src/Markdig/Extensions/Citations/CitationExtension.cs)

```csharp
/// <summary>
/// An extension which applies to text of the form %%%text%%%
/// </summary>
public class BlinkExtension : IMarkdownExtension
{
// This setup method will be run when the pipeline builder's `Build()` method is invoked. As this
// is a simple, self-contained extension we won't be adding anything new, but rather finding an
// existing parser already in the pipeline and adding some settings to it.
public void Setup(MarkdownPipelineBuilder pipeline)
{
// We check the pipeline builder's inline parser collection and see if we can find a parser
// registered of the type EmphasisInlineParser. This is the parser which nominally handles
// bold and italic emphasis, but we know from its documentation that it is a general parser
// that can have new characters added to it.
var parser = pipeline.InlineParsers.FindExact<EmphasisInlineParser>();

// If we find the parser and it doesn't already have the % character registered, we add
// a descriptor for 3 consecutive % signs. This is specific to the EmphasisInlineParser and
// is just used here as an example.
if (parser is not null && !parser.HasEmphasisChar('%'))
{
parser.EmphasisDescriptors.Add(new EmphasisDescriptor('%', 3, 3, false));
}
}

// This method is called by the pipeline before rendering, which is a separate operation from
// parsing. This implementation is just here for the purpose of the example, in which we
// daisy-chain a delegate specific to the EmphasisInlineRenderer to cause an unconscionable tag
// to be inserted into the HTML output wherever a %%% annotated span was placed in the source.
public void Setup(MarkdownPipeline pipeline, IMarkdownRenderer renderer)
{
if (renderer is not HtmlRenderer) return;

var emphasisRenderer = renderer.ObjectRenderers.FindExact<EmphasisInlineRenderer>();
if (emphasisRenderer is null) return;

var previousTag = emphasisRenderer.GetTag;
emphasisRenderer.GetTag = inline =>
(inline.DelimiterCount == 3 && inline.DelimiterChar == '%' ? "blink" : null)
?? previousTag(inline);
}
}
```

## Parsers

Markdig has two types of parsers, both of which derive from `ParserBase<TProcessor>`.

Block parsers, derived from `BlockParser`, identify block elements from lines in the source text and push them onto the abstract syntax tree. Inline parsers, derived from `InlineParser`, identify inline elements from `LeafBlock` elements and push them into an attached container.

Both inline and block parsers are regex-free, and instead work on finding opening characters and then making fast read-only views into the source text.

### Block Parser

**(The contents of this section I am very unsure of, this is from my reading of the code but I could use some guidance here)**

**(Does `CanInterrupt` specifically refer to interrupting a paragraph block?)**

In order to be added to the parsing pipeline, all block parsers must be derived from `BlockParser`.

Internally, the main parsing algorithm will be stepping through the source text, using the `HasOpeningCharacter(char c)` method of the block parser collection to pre-identify parsers which *could* be opening a block at a given position in the text based on the active character. Thus any derived implementation needs to set the value of the `char[]? OpeningCharacter` property with the initial characters that might begin the block.

If a parser can potentially open a block at a place in the source text it should expect to have the `TryOpen(BlockProcessor processor)` method called. This is a virtual method that must be implemented on any derived class. The `BlockProcessor` argument is a reference to an object which stores the current state of parsing and the position in the source.

**(What are the rules concerning how the `BlockState` return type should work for `TryOpen`? I see examples returning `None`, `Continue`, `BreakDiscard`, `ContinueDiscard`. How does the return value change the algorithm behavior?)**

**(Should a new block always be pushed into `processor.NewBlocks` in the `TryOpen` method?)**

As the main parsing algorithm moves forward, it will then call `TryContinue(...)` on blocks that were opened in `TryOpen(..)`.

**(Is this where/how you close a block? Is there anything that needs to be done to perform that beyond `block.UpdateSpanEnd` and returning `BlockState.Break`?)**


### Inline Parser

0 comments on commit 6cf174b

Please sign in to comment.