Skip to content

Commit

Permalink
Merge pull request #549 from thephpleague/revise-inline-parsing
Browse files Browse the repository at this point in the history
Revise inline parsing
  • Loading branch information
colinodell committed Sep 26, 2020
2 parents 76d1699 + 00649fb commit 995567a
Show file tree
Hide file tree
Showing 40 changed files with 721 additions and 628 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,20 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `BlockStartParserInterface`
- `ChildNodeRendererInterface`
- `CursorState`
- `DelimiterParser`
- `DocumentBlockParser`
- `DocumentRenderedEvent`
- `HtmlRendererInterface`
- `InlineParserEngineInterface`
- `InlineParserMatch`
- `MarkdownParserState`
- `MarkdownParserStateInterface`
- `ReferenceableInterface`
- `RenderedContent`
- `RenderedContentInterface`
- Added several new methods:
- `Environment::setEventDispatcher()`
- `EnvironmentInterface::getInlineParsers()`
- `FencedCode::setInfo()`
- `Heading::setLevel()`
- `HtmlRenderer::renderDocument()`
Expand All @@ -58,10 +61,18 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `ConfigurableEnvironmentInterface::addBlockParser()` is now `ConfigurableEnvironmentInterface::addBlockParserFactory()`
- `ReferenceParser` was re-implemented and works completely different than before
- The paragraph parser no longer needs to be added manually to the environment
- Implemented a new approach to inline parsing where parsers can now specify longer strings or regular expressions they want to parse (instead of just single characters):
- `InlineParserInterface::getCharacters()` is now `getMatchDefinition()` and returns an instance of `InlineParserMatch`
- `InlineParserInterface::parse()` has a new parameter containing the pre-matched text
- `InlineParserContext::__construct()` now requires the contents to be provided as a `Cursor` instead of a `string`
- Implemented delimiter parsing as a special type of inline parser (via the new `DelimiterParser` class)
- Changed block and inline rendering to use common methods and interfaces
- `BlockRendererInterface` and `InlineRendererInterface` were replaced by `NodeRendererInterface` with slightly different parameters. All core renderers now implement this interface.
- `ConfigurableEnvironmentInterface::addBlockRenderer()` and `addInlineRenderer()` are now just `addRenderer()`
- `EnvironmentInterface::getBlockRenderersForClass()` and `getInlineRenderersForClass()` are now just `getRenderersForClass()`
- Re-implemented the GFM Autolink extension using the new inline parser approach instead of document processors
- `EmailAutolinkProcessor` is now `EmailAutolinkParser`
- `UrlAutolinkProcessor` is now `UrlAutolinkParser`
- Combined separate classes/interfaces into one:
- `DisallowedRawHtmlRenderer` replaces `DisallowedRawHtmlBlockRenderer` and `DisallowedRawHtmlInlineRenderer`
- `NodeRendererInterface` replaces `BlockRendererInterface` and `InlineRendererInterface`
Expand Down Expand Up @@ -106,11 +117,14 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- Footnote event listeners now have numbered priorities (but still execute in the same order)
- Footnotes must now be separated from previous content by a blank line
- The line numbers (keys) returned via `MarkdownInput::getLines()` now start at 1 instead of 0
- `DelimiterProcessorCollectionInterface` now extends `Countable`
- `RegexHelper::PARTIAL_` constants must always be used in case-insensitive contexts

### Fixed

- Fixed parsing of footnotes without content
- Fixed rendering of orphaned footnotes and footnote refs
- Fixed some URL autolinks breaking too early (#492)

### Removed

Expand Down Expand Up @@ -159,6 +173,8 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `AbstractBlock::finalize()`
- `ConfigurableEnvironmentInterface::addBlockParser()`
- `Delimiter::setCanClose()`
- `EnvironmentInterface::getInlineParsersForCharacter()`
- `EnvironmentInterface::getInlineParserCharacterRegex()`
- `HtmlRenderer::renderBlock()`
- `HtmlRenderer::renderBlocks()`
- `HtmlRenderer::renderInline()`
Expand Down
59 changes: 36 additions & 23 deletions docs/2.0/customization/inline-parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,43 @@ If your syntax looks like that, consider using a [delimiter processor](/2.0/cust

Inline parsers should implement `InlineParserInterface` and the following two methods:

### getCharacters()
### getMatchDefinition()

This method should return an array of single characters which the inline parser engine should stop on. When it does find a match in the current line the `parse()` method below may be called.
This method should return an instance of `InlineParserMatch` which defines the text the parser is looking for. Examples of this might be something like:

```php
use League\CommonMark\Parser\Inline\InlineParserMatch;

InlineParserMatch::string('@'); // Match any '@' characters found in the text
InlineParserMatch::string('foo'); // Match the text 'foo' (case insensitive)

InlineParserMatch::oneOf('@', '!'); // Match either character
InlineParserMatch::oneOf('http://', 'https://'); // Match either string

InlineParserMatch::regex('\d+'); // Match the regular expression (omit the regex delimiters and any flags)
```

Once a match is found, the `parse()` method below may be called.

### parse()

This method will be called if both conditions are met:

1. The engine has stopped at a matching character; and,
2. No other inline parsers have successfully parsed the character
1. The engine has found at a matching string in the current line; and,
2. No other inline parsers with a [higher priority](/2.0/customization/environment/#addinlineparser) have successfully parsed the text at this point in the line

#### Parameters

* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line.
* `string $match` - Contains the text that matches the start pattern from `getMatchDefinition()`
* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line. (Note that the cursor will be positioned **before** the matching text, so you must advance it yourself if you determine it's a valid match)

#### Return value

`parse()` should return `false` if it's unable to handle the current line/character for any reason. (The [`Cursor`](/2.0/customization/cursor/) state should be restored before returning false if modified). Other parsers will then have a chance to try parsing the line. If all registered parsers return false, the character will be added as plain text.
`parse()` should return `false` if it's unable to handle the text at the current position for any reason. Other parsers will then have a chance to try parsing that text. If all registered parsers return false, the text will be added as plain text.

Returning `true` tells the engine that you've successfully parsed the character (and related ones after it). It is your responsibility to:

1. Advance the cursor to the end of the parsed text
1. Advance the cursor to the end of the parsed/matched text
2. Add the parsed inline to the container (`$inlineContext->getContainer()->appendChild(...)`)

## Inline Parser Examples
Expand All @@ -65,15 +80,17 @@ Let's say you wanted to autolink Twitter handles without using the link syntax.
use League\CommonMark\Environment\Environment;
use League\CommonMark\Extension\CommonMark\Node\Inline\Link;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;

class TwitterHandleParser implements InlineParserInterface
{
public function getCharacters(): array
public function getMatchDefinition(): InlineParserMatch
{
return ['@'];
// Note that you could match the entire regex here instead of in parse() if you wish
return InlineParserMatch::string('@');
}
public function parse(InlineParserContext $inlineContext): bool
public function parse(string $match, InlineParserContext $inlineContext): bool
{
$cursor = $inlineContext->getCursor();
// The @ symbol must not have any other characters immediately prior
Expand Down Expand Up @@ -113,33 +130,27 @@ Let's say you want to automatically convert smilies (or "frownies") to emoticon
use League\CommonMark\Environment\Environment;
use League\CommonMark\Extension\CommonMark\Node\Inline\Image;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;

class SmilieParser implements InlineParserInterface
{
public function getCharacters(): array
public function getMatchDefinition(): InlineParserMatch
{
return [':'];
return InlineParserMatch::oneOf(':)', ':(');
}

public function parse(InlineParserContext $inlineContext): bool
public function parse(string $match, InlineParserContext $inlineContext): bool
{
$cursor = $inlineContext->getCursor();

// The next character must be a paren; if not, then bail
// We use peek() to quickly check without affecting the cursor
$nextChar = $cursor->peek();
if ($nextChar !== '(' && $nextChar !== ')') {
return false;
}

// Advance the cursor past the 2 matched chars since we're able to parse them successfully
$cursor->advanceBy(2);

// Add the corresponding image
if ($nextChar === ')') {
if ($match === ':)') {
$inlineContext->getContainer()->appendChild(new Image('/img/happy.png'));
} elseif ($nextChar === '(') {
} elseif ($match === ':(') {
$inlineContext->getContainer()->appendChild(new Image('/img/sad.png'));
}

Expand All @@ -153,6 +164,8 @@ $environment->addInlineParser(new SmilieParserParser());

## Tips

* For best performance, `return false` **as soon as possible**.
* For best performance:
* Avoid using overly-complex regular expressions in `getMatchDefinition()` - use the simplest regex you can and have `parse()` do the heavier validation
* Have your `parse()` method `return false` **as soon as possible**.
* You can `peek()` without modifying the cursor state. This makes it useful for validating nearby characters as it's quick and you can bail without needed to restore state.
* You can look at (and modify) any part of the AST if needed (via `$inlineContext->getContainer()`).
105 changes: 105 additions & 0 deletions src/Delimiter/DelimiterParser.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<?php

declare(strict_types=1);

namespace League\CommonMark\Delimiter;

use League\CommonMark\Delimiter\Processor\DelimiterProcessorCollection;
use League\CommonMark\Delimiter\Processor\DelimiterProcessorInterface;
use League\CommonMark\Node\Inline\Text;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;
use League\CommonMark\Util\RegexHelper;

/**
* Delimiter parsing is implemented as an Inline Parser with the lowest-possible priority
*
* @internal
*/
final class DelimiterParser implements InlineParserInterface
{
/** @var DelimiterProcessorCollection */
private $collection;

public function __construct(DelimiterProcessorCollection $collection)
{
$this->collection = $collection;
}

public function getMatchDefinition(): InlineParserMatch
{
return InlineParserMatch::oneOf(...$this->collection->getDelimiterCharacters());
}

public function parse(string $match, InlineParserContext $inlineContext): bool
{
$character = $match;
$numDelims = 0;
$cursor = $inlineContext->getCursor();
$processor = $this->collection->getDelimiterProcessor($character);

if ($processor === null) {
throw new \LogicException('Delimiter processor should never be null here');
}

$charBefore = $cursor->peek(-1);
if ($charBefore === null) {
$charBefore = "\n";
}

while ($cursor->peek($numDelims) === $character) {
++$numDelims;
}

if ($numDelims < $processor->getMinLength()) {
return false;
}

$cursor->advanceBy($numDelims);

$charAfter = $cursor->getCharacter();
if ($charAfter === null) {
$charAfter = "\n";
}

[$canOpen, $canClose] = self::determineCanOpenOrClose($charBefore, $charAfter, $character, $processor);

$node = new Text(\str_repeat($character, $numDelims), [
'delim' => true,
]);
$inlineContext->getContainer()->appendChild($node);

// Add entry to stack to this opener
if ($canOpen || $canClose) {
$delimiter = new Delimiter($character, $numDelims, $node, $canOpen, $canClose);
$inlineContext->getDelimiterStack()->push($delimiter);
}

return true;
}

/**
* @return bool[]
*/
private static function determineCanOpenOrClose(string $charBefore, string $charAfter, string $character, DelimiterProcessorInterface $delimiterProcessor): array
{
$afterIsWhitespace = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charAfter);
$afterIsPunctuation = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charAfter);
$beforeIsWhitespace = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charBefore);
$beforeIsPunctuation = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charBefore);

$leftFlanking = ! $afterIsWhitespace && (! $afterIsPunctuation || $beforeIsWhitespace || $beforeIsPunctuation);
$rightFlanking = ! $beforeIsWhitespace && (! $beforeIsPunctuation || $afterIsWhitespace || $afterIsPunctuation);

if ($character === '_') {
$canOpen = $leftFlanking && (! $rightFlanking || $beforeIsPunctuation);
$canClose = $rightFlanking && (! $leftFlanking || $afterIsPunctuation);
} else {
$canOpen = $leftFlanking && $character === $delimiterProcessor->getOpeningCharacter();
$canClose = $rightFlanking && $character === $delimiterProcessor->getClosingCharacter();
}

return [$canOpen, $canClose];
}
}
5 changes: 5 additions & 0 deletions src/Delimiter/Processor/DelimiterProcessorCollection.php
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,9 @@ private function addStaggeredDelimiterProcessorForChar(string $opening, Delimite
$s->add($new);
$this->processorsByChar[$opening] = $s;
}

public function count(): int
{
return \count($this->processorsByChar);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

namespace League\CommonMark\Delimiter\Processor;

interface DelimiterProcessorCollectionInterface
interface DelimiterProcessorCollectionInterface extends \Countable
{
/**
* Add the given delim processor to the collection
Expand Down
Loading

0 comments on commit 995567a

Please sign in to comment.