Merge pull request #549 from thephpleague/revise-inline-parsing

Revise inline parsing
thephpleague · Sep 26, 2020 · 995567a · 995567a
2 parents 76d1699 + 00649fb
commit 995567a
Show file tree

Hide file tree

Showing 40 changed files with 721 additions and 628 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -22,17 +22,20 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
    - `BlockStartParserInterface`
    - `ChildNodeRendererInterface`
    - `CursorState`
+   - `DelimiterParser`
    - `DocumentBlockParser`
    - `DocumentRenderedEvent`
    - `HtmlRendererInterface`
    - `InlineParserEngineInterface`
+   - `InlineParserMatch`
    - `MarkdownParserState`
    - `MarkdownParserStateInterface`
    - `ReferenceableInterface`
    - `RenderedContent`
    - `RenderedContentInterface`
  - Added several new methods:
    - `Environment::setEventDispatcher()`
+   - `EnvironmentInterface::getInlineParsers()`
    - `FencedCode::setInfo()`
    - `Heading::setLevel()`
    - `HtmlRenderer::renderDocument()`
@@ -58,10 +61,18 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
    - `ConfigurableEnvironmentInterface::addBlockParser()` is now `ConfigurableEnvironmentInterface::addBlockParserFactory()`
    - `ReferenceParser` was re-implemented and works completely different than before
    - The paragraph parser no longer needs to be added manually to the environment
+ - Implemented a new approach to inline parsing where parsers can now specify longer strings or regular expressions they want to parse (instead of just single characters):
+   - `InlineParserInterface::getCharacters()` is now `getMatchDefinition()` and returns an instance of `InlineParserMatch`
+   - `InlineParserInterface::parse()` has a new parameter containing the pre-matched text
+   - `InlineParserContext::__construct()` now requires the contents to be provided as a `Cursor` instead of a `string`
+ - Implemented delimiter parsing as a special type of inline parser (via the new `DelimiterParser` class)
  - Changed block and inline rendering to use common methods and interfaces
    - `BlockRendererInterface` and `InlineRendererInterface` were replaced by `NodeRendererInterface` with slightly different parameters. All core renderers now implement this interface.
    - `ConfigurableEnvironmentInterface::addBlockRenderer()` and `addInlineRenderer()` are now just `addRenderer()`
    - `EnvironmentInterface::getBlockRenderersForClass()` and `getInlineRenderersForClass()` are now just `getRenderersForClass()`
+ - Re-implemented the GFM Autolink extension using the new inline parser approach instead of document processors
+   - `EmailAutolinkProcessor` is now `EmailAutolinkParser`
+   - `UrlAutolinkProcessor` is now `UrlAutolinkParser`
  - Combined separate classes/interfaces into one:
    - `DisallowedRawHtmlRenderer` replaces `DisallowedRawHtmlBlockRenderer` and `DisallowedRawHtmlInlineRenderer`
    - `NodeRendererInterface` replaces `BlockRendererInterface` and `InlineRendererInterface`
@@ -106,11 +117,14 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
    - Footnote event listeners now have numbered priorities (but still execute in the same order)
    - Footnotes must now be separated from previous content by a blank line
  - The line numbers (keys) returned via `MarkdownInput::getLines()` now start at 1 instead of 0
+ - `DelimiterProcessorCollectionInterface` now extends `Countable`
+ - `RegexHelper::PARTIAL_` constants must always be used in case-insensitive contexts
 
 ### Fixed
 
  - Fixed parsing of footnotes without content
  - Fixed rendering of orphaned footnotes and footnote refs
+ - Fixed some URL autolinks breaking too early (#492)
 
 ### Removed
 
@@ -159,6 +173,8 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
    - `AbstractBlock::finalize()`
    - `ConfigurableEnvironmentInterface::addBlockParser()`
    - `Delimiter::setCanClose()`
+   - `EnvironmentInterface::getInlineParsersForCharacter()`
+   - `EnvironmentInterface::getInlineParserCharacterRegex()`
    - `HtmlRenderer::renderBlock()`
    - `HtmlRenderer::renderBlocks()`
    - `HtmlRenderer::renderInline()`

diff --git a/docs/2.0/customization/inline-parsing.md b/docs/2.0/customization/inline-parsing.md
@@ -29,28 +29,43 @@ If your syntax looks like that, consider using a [delimiter processor](/2.0/cust
 
 Inline parsers should implement `InlineParserInterface` and the following two methods:
 
-### getCharacters()
+### getMatchDefinition()
 
-This method should return an array of single characters which the inline parser engine should stop on.  When it does find a match in the current line the `parse()` method below may be called.
+This method should return an instance of `InlineParserMatch` which defines the text the parser is looking for.  Examples of this might be something like:
+
+```php
+use League\CommonMark\Parser\Inline\InlineParserMatch;
+
+InlineParserMatch::string('@');                  // Match any '@' characters found in the text
+InlineParserMatch::string('foo');                // Match the text 'foo' (case insensitive)
+
+InlineParserMatch::oneOf('@', '!');              // Match either character
+InlineParserMatch::oneOf('http://', 'https://'); // Match either string
+
+InlineParserMatch::regex('\d+');                 // Match the regular expression (omit the regex delimiters and any flags)
+```
+
+Once a match is found, the `parse()` method below may be called.
 
 ### parse()
 
 This method will be called if both conditions are met:
 
-1. The engine has stopped at a matching character; and,
-2. No other inline parsers have successfully parsed the character
+1. The engine has found at a matching string in the current line; and,
+2. No other inline parsers with a [higher priority](/2.0/customization/environment/#addinlineparser) have successfully parsed the text at this point in the line
 
 #### Parameters
 
-* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line.
+* `string $match` - Contains the text that matches the start pattern from `getMatchDefinition()`
+* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line.  (Note that the cursor will be positioned **before** the matching text, so you must advance it yourself if you determine it's a valid match)
 
 #### Return value
 
-`parse()` should return `false` if it's unable to handle the current line/character for any reason.  (The [`Cursor`](/2.0/customization/cursor/) state should be restored before returning false if modified). Other parsers will then have a chance to try parsing the line.  If all registered parsers return false, the character will be added as plain text.
+`parse()` should return `false` if it's unable to handle the text at the current position for any reason.  Other parsers will then have a chance to try parsing that text.  If all registered parsers return false, the text will be added as plain text.
 
 Returning `true` tells the engine that you've successfully parsed the character (and related ones after it).  It is your responsibility to:
 
-1. Advance the cursor to the end of the parsed text
+1. Advance the cursor to the end of the parsed/matched text
 2. Add the parsed inline to the container (`$inlineContext->getContainer()->appendChild(...)`)
 
 ## Inline Parser Examples
@@ -65,15 +80,17 @@ Let's say you wanted to autolink Twitter handles without using the link syntax.
 use League\CommonMark\Environment\Environment;
 use League\CommonMark\Extension\CommonMark\Node\Inline\Link;
 use League\CommonMark\Parser\Inline\InlineParserInterface;
+use League\CommonMark\Parser\Inline\InlineParserMatch;
 use League\CommonMark\Parser\InlineParserContext;
 
 class TwitterHandleParser implements InlineParserInterface
 {
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-        return ['@'];
+        // Note that you could match the entire regex here instead of in parse() if you wish
+        return InlineParserMatch::string('@');
     }
-    public function parse(InlineParserContext $inlineContext): bool
+    public function parse(string $match, InlineParserContext $inlineContext): bool
     {
         $cursor = $inlineContext->getCursor();
         // The @ symbol must not have any other characters immediately prior
@@ -113,33 +130,27 @@ Let's say you want to automatically convert smilies (or "frownies") to emoticon
 use League\CommonMark\Environment\Environment;
 use League\CommonMark\Extension\CommonMark\Node\Inline\Image;
 use League\CommonMark\Parser\Inline\InlineParserInterface;
+use League\CommonMark\Parser\Inline\InlineParserMatch;
 use League\CommonMark\Parser\InlineParserContext;
 
 class SmilieParser implements InlineParserInterface
 {
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-        return [':'];
+        return InlineParserMatch::oneOf(':)', ':(');
     }
 
-    public function parse(InlineParserContext $inlineContext): bool
+    public function parse(string $match, InlineParserContext $inlineContext): bool
     {
         $cursor = $inlineContext->getCursor();
 
-        // The next character must be a paren; if not, then bail
-        // We use peek() to quickly check without affecting the cursor
-        $nextChar = $cursor->peek();
-        if ($nextChar !== '(' && $nextChar !== ')') {
-            return false;
-        }
-
         // Advance the cursor past the 2 matched chars since we're able to parse them successfully
         $cursor->advanceBy(2);
 
         // Add the corresponding image
-        if ($nextChar === ')') {
+        if ($match === ':)') {
             $inlineContext->getContainer()->appendChild(new Image('/img/happy.png'));
-        } elseif ($nextChar === '(') {
+        } elseif ($match === ':(') {
             $inlineContext->getContainer()->appendChild(new Image('/img/sad.png'));
         }
 
@@ -153,6 +164,8 @@ $environment->addInlineParser(new SmilieParserParser());
 
 ## Tips
 
-* For best performance, `return false` **as soon as possible**.
+* For best performance:
+  * Avoid using overly-complex regular expressions in `getMatchDefinition()` - use the simplest regex you can and have `parse()` do the heavier validation
+  * Have your `parse()` method `return false` **as soon as possible**.
 * You can `peek()` without modifying the cursor state. This makes it useful for validating nearby characters as it's quick and you can bail without needed to restore state.
 * You can look at (and modify) any part of the AST if needed (via `$inlineContext->getContainer()`).
diff --git a/src/Delimiter/DelimiterParser.php b/src/Delimiter/DelimiterParser.php
@@ -0,0 +1,105 @@
+<?php
+
+declare(strict_types=1);
+
+namespace League\CommonMark\Delimiter;
+
+use League\CommonMark\Delimiter\Processor\DelimiterProcessorCollection;
+use League\CommonMark\Delimiter\Processor\DelimiterProcessorInterface;
+use League\CommonMark\Node\Inline\Text;
+use League\CommonMark\Parser\Inline\InlineParserInterface;
+use League\CommonMark\Parser\Inline\InlineParserMatch;
+use League\CommonMark\Parser\InlineParserContext;
+use League\CommonMark\Util\RegexHelper;
+
+/**
+ * Delimiter parsing is implemented as an Inline Parser with the lowest-possible priority
+ *
+ * @internal
+ */
+final class DelimiterParser implements InlineParserInterface
+{
+    /** @var DelimiterProcessorCollection */
+    private $collection;
+
+    public function __construct(DelimiterProcessorCollection $collection)
+    {
+        $this->collection = $collection;
+    }
+
+    public function getMatchDefinition(): InlineParserMatch
+    {
+        return InlineParserMatch::oneOf(...$this->collection->getDelimiterCharacters());
+    }
+
+    public function parse(string $match, InlineParserContext $inlineContext): bool
+    {
+        $character = $match;
+        $numDelims = 0;
+        $cursor    = $inlineContext->getCursor();
+        $processor = $this->collection->getDelimiterProcessor($character);
+
+        if ($processor === null) {
+            throw new \LogicException('Delimiter processor should never be null here');
+        }
+
+        $charBefore = $cursor->peek(-1);
+        if ($charBefore === null) {
+            $charBefore = "\n";
+        }
+
+        while ($cursor->peek($numDelims) === $character) {
+            ++$numDelims;
+        }
+
+        if ($numDelims < $processor->getMinLength()) {
+            return false;
+        }
+
+        $cursor->advanceBy($numDelims);
+
+        $charAfter = $cursor->getCharacter();
+        if ($charAfter === null) {
+            $charAfter = "\n";
+        }
+
+        [$canOpen, $canClose] = self::determineCanOpenOrClose($charBefore, $charAfter, $character, $processor);
+
+        $node = new Text(\str_repeat($character, $numDelims), [
+            'delim' => true,
+        ]);
+        $inlineContext->getContainer()->appendChild($node);
+
+        // Add entry to stack to this opener
+        if ($canOpen || $canClose) {
+            $delimiter = new Delimiter($character, $numDelims, $node, $canOpen, $canClose);
+            $inlineContext->getDelimiterStack()->push($delimiter);
+        }
+
+        return true;
+    }
+
+    /**
+     * @return bool[]
+     */
+    private static function determineCanOpenOrClose(string $charBefore, string $charAfter, string $character, DelimiterProcessorInterface $delimiterProcessor): array
+    {
+        $afterIsWhitespace   = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charAfter);
+        $afterIsPunctuation  = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charAfter);
+        $beforeIsWhitespace  = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charBefore);
+        $beforeIsPunctuation = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charBefore);
+
+        $leftFlanking  = ! $afterIsWhitespace && (! $afterIsPunctuation || $beforeIsWhitespace || $beforeIsPunctuation);
+        $rightFlanking = ! $beforeIsWhitespace && (! $beforeIsPunctuation || $afterIsWhitespace || $afterIsPunctuation);
+
+        if ($character === '_') {
+            $canOpen  = $leftFlanking && (! $rightFlanking || $beforeIsPunctuation);
+            $canClose = $rightFlanking && (! $leftFlanking || $afterIsPunctuation);
+        } else {
+            $canOpen  = $leftFlanking && $character === $delimiterProcessor->getOpeningCharacter();
+            $canClose = $rightFlanking && $character === $delimiterProcessor->getClosingCharacter();
+        }
+
+        return [$canOpen, $canClose];
+    }
+}
diff --git a/src/Delimiter/Processor/DelimiterProcessorCollection.php b/src/Delimiter/Processor/DelimiterProcessorCollection.php
@@ -79,4 +79,9 @@ private function addStaggeredDelimiterProcessorForChar(string $opening, Delimite
         $s->add($new);
         $this->processorsByChar[$opening] = $s;
     }
+
+    public function count(): int
+    {
+        return \count($this->processorsByChar);
+    }
 }
diff --git a/src/Delimiter/Processor/DelimiterProcessorCollectionInterface.php b/src/Delimiter/Processor/DelimiterProcessorCollectionInterface.php
@@ -19,7 +19,7 @@
 
 namespace League\CommonMark\Delimiter\Processor;
 
-interface DelimiterProcessorCollectionInterface
+interface DelimiterProcessorCollectionInterface extends \Countable
 {
     /**
      * Add the given delim processor to the collection