Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

NEW Make shortcode parser more clever about placement

Shortcodes have traditionally had a problem that they are inside <p> tags,
but generate block level elements. This breaks HTML compliance.

This makes the shortcode parser now mutate the DOM based on the "class" attribute on
the shortcode to insert the generated block level element at the right place in the DOM

 - for "left" and "right" elements it puts them just before the block level
   element they are inside

 - for "leftAlone" and "center" elements it splits the DOM around the shortcode.

The trade off is that shortcodes are no longer "text level" features. They need
knowledge of the HTML they are in to perform this transformation, so they can
only be used in (valid) HTML
  • Loading branch information...
commit 2335c074b33cbe69d35ed61aee725e2d437f169d 1 parent 54237d5
@hafriedlander hafriedlander authored
View
94 docs/en/reference/shortcodes.md
@@ -1,17 +1,87 @@
# Shortcodes
-The Shortcode API (new in 2.4) is a simple regex based parser that allows you to replace simple bbcode-like tags within
-a HTMLText or HTMLVarchar field when rendered into a template. It is inspired by and very similar to the [Wordpress
-implementation](http://codex.wordpress.org/Shortcode_API) of shortcodes.
+The Shortcode API is a way to replace simple bbcode-like tags within HTML. It is inspired by and very similar to
+the [Wordpress implementation](http://codex.wordpress.org/Shortcode_API) of shortcodes.
-Here are all variants of the acceptable shortcode tags:
+A guide to syntax
- [shortcode]
- [shortcode/]
- [shortcode,parameter="value"]
- [shortcode,parameter="value"]Enclosed Content[/shortcode]
+ Unclosed - [shortcode]
+ Explicitly closed - [shortcode/]
+ With parameters, mixed quoting - [shortcode parameter=value parameter2='value2' parameter3="value3"]
+ Old style parameter separation - [shortcode,parameter=value,parameter2='value2',parameter3="value3"]
+ With contained content & closing tag - [shortcode]Enclosed Content[/shortcode]
+ Escaped (will output [just] [text] in response) - [[just] [[text]]
+
+Shortcode parsing is already hooked into HTMLText and HTMLVarchar fields when rendered into a template
+
+## Attribute and element scope
+
+HTML with unprocessed shortcodes in it is still valid HTML. As a result, shortcodes can be in two places in HTML:
+
+ - In an attribute value, like so:
+
+ <a title="[title]">link</a>
+
+ - In an element's text, like so:
+
+ <p>
+ Some text [shortcode] more text
+ </p>
+
+The first is called "element scope" use, the second "attribute scope"
+
+You may not use shortcodes in any other location. Specifically, you can not use shortcodes to generate attributes or
+change the name of a tag. These usages are forbidden:
+
+ <[paragraph]>Some test</[paragraph]>
+
+ <a [titleattribute]>link</a>
+
+Also note:
+
+ - you may need to escape text inside attributes `>` becomes `&gt;` etc
+
+ - you can include HTML tags inside a shortcode tag, but you need to be careful of nesting to ensure you don't
+ break the output
+
+Good:
+
+ <div>
+ [shortcode]
+ <p>Caption</p>
+ [/shortcode]
+ </div>
-Note the usage of `,` to delimit the parameters.
+Bad:
+
+ <div>
+ [shortcode]
+ </div>
+ <p>
+ [/shortcode]
+ </p>
+
+## Location
+
+Element scoped shortcodes have a special ability to move the location they are inserted at to comply with
+HTML lexical rules. Take for example this basic paragraph tag:
+
+ <p><a href="#">Head [figure src="assets/a.jpg" caption="caption"] Tail</a></p>
+
+When converted naively would become
+
+ <p><a href="#">Head <figure><img src="assets/a.jpg" /><figcaption>caption</figcaption></figure> Tail</a></p>
+
+However this is not valid HTML - P elements can not contain other block level elements.
+
+To fix this you can specify a "location" attribute on a shortcode. When the location attribute is "left" or "right"
+the inserted content will be moved to immediately before the block tag. The result is this:
+
+ <figure><img src="assets/a.jpg" /><figcaption>caption</figcaption></figure><p><a href="#">Head Tail</a></p>
+
+When the location attribute is "leftAlone" or "center" then the DOM is split around the element. The result is this:
+
+ <p><a href="#">Head </a></p><figure><img src="assets/a.jpg" /><figcaption>caption</figcaption></figure><p><a href="#"> Tail</a></p>
## Defining Custom Shortcodes
@@ -90,8 +160,4 @@ example the below code will not work as expected:
[shortcode][/shortcode]
[/shortcode]
-The parser will recognise this as:
-
- [shortcode]
- [shortcode]
- [/shortcode]
+The parser will raise an error if it can not find a matching opening tag for any particular closing tag
View
419 parsers/ShortcodeParser.php
@@ -9,6 +9,10 @@
*/
class ShortcodeParser {
+ public function img_shortcode($attrs) {
+ return "<img src='".$attrs['src']."'>";
+ }
+
private static $instances = array();
private static $active_instance = 'default';
@@ -96,65 +100,406 @@ public function clear() {
$this->shortcodes = array();
}
+ public function callShortcode($tag, $attributes, $content) {
+ if (!isset($this->shortcodes[$tag])) return false;
+ return call_user_func($this->shortcodes[$tag], $attributes, $content, $this, $tag);
+ }
+
// --------------------------------------------------------------------------------------------------------------
+ protected function removeNode($node) {
+ $node->parentNode->removeChild($node);
+ }
+
+ protected function insertAfter($new, $after) {
+ $parent = $after->parentNode; $next = $after->nextSibling;
+
+ if ($next) {
+ $parent->insertBefore($new, $next);
+ }
+ else {
+ $parent->appendChild($new);
+ }
+ }
+
+ protected function insertListAfter($new, $after) {
+ $doc = $after->ownerDocument; $parent = $after->parentNode; $next = $after->nextSibling;
+
+ for ($i = 0; $i < $new->length; $i++) {
+ $imported = $doc->importNode($new->item($i), true);
+
+ if ($next) {
+ $parent->insertBefore($imported, $next);
+ }
+ else {
+ $parent->appendChild($imported);
+ }
+ }
+ }
+
+ private static $marker_class = '--ss-shortcode-marker';
+
+ private static $block_level_elements = array(
+ 'address', 'article', 'aside', 'audio', 'blockquote', 'canvas', 'dd', 'div', 'dl', 'fieldset', 'figcaption',
+ 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hgroup', 'ol', 'output', 'p',
+ 'pre', 'section', 'table', 'ul'
+ );
+
+ private static $tagrx = '/
+ <(?<element>(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+)> | # HTML Tag - skip attribute scoped tags
+ (?<!\[) \[ (?<open>\w+) (?<attrs>.*?) (?<selfclosed>\/?) \] (?!\]) | # Opening tag
+ (?<!\[) \[\/ (?<close>\w+) \] (?!\]) # Closing tag
+/x';
+
+ private static $attrrx = '/
+ ([^\s\/\'"=,]+) # Name
+ \s* = \s*
+ (?:
+ (?:\'([^\']+)\') | # Value surrounded by \'
+ (?:"([^"]+)") | # Value surrounded by "
+ (\w+) # Bare value
+ )
+/x';
+
+
+ const WARN = 'warn';
+ const STRIP = 'strip';
+ const LEAVE = 'leave';
+ const ERROR = 'error';
+
+ public static $error_behavior = self::LEAVE;
+
+
/**
- * Parse a string, and replace any registered shortcodes within it with the result of the mapped callback.
- *
+ * Look through a string that contains shortcode tags and pull out the locations and details
+ * of those tags
+ *
+ * Doesn't support nested shortcode tags
+ *
* @param string $content
- * @return string
+ * @return array - The list of tags found. When using an open/close pair, only one item will be in the array,
+ * with "content" set to the text between the tags
*/
- public function parse($content) {
- if(!$this->shortcodes) return $content;
-
- $shortcodes = implode('|', array_map('preg_quote', array_keys($this->shortcodes)));
- $pattern = "/\[($shortcodes)(.*?)(\/\]|\](?(4)|(?:(.+?)\[\/\s*\\1\s*\]))|\])/s";
+ protected function extractTags($content) {
+ $tags = array();
- if(preg_match_all($pattern, $content, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER)) {
- $replacements = array();
+ if(preg_match_all(self::$tagrx, $content, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE)) {
foreach($matches as $match) {
- $prefix = $match[0][1] ? $content[$match[0][1]-1] : '';
- if(strlen($match[0][0]) + $match[0][1] < strlen($content)) {
- $suffix = $content[strlen($match[0][0]) + $match[0][1]];
- } else {
- $suffix = '';
+ // Ignore any elements
+ if (empty($match['open'][0]) && empty($match['close'][0])) continue;
+
+ // Pull the attributes out into a key/value hash
+ $attrs = array();
+
+ if (!empty($match['attrs'][0])) {
+ preg_match_all(self::$attrrx, $match['attrs'][0], $attrmatches, PREG_SET_ORDER);
+
+ foreach ($attrmatches as $attr) {
+ list($whole, $name, $value) = array_values(array_filter($attr));
+ $attrs[$name] = $value;
+ }
}
- if($prefix == '[' && $suffix == ']') {
- $replacements[] = array($match[0][0], $match[0][1]-1, strlen($match[0][0]) + 2);
- } else {
- $replacements[] = array($this->handleShortcode($match), $match[0][1], strlen($match[0][0]));
+
+ // And store the indexes, tag details, etc
+ $tags[] = array(
+ 'text' => $match[0][0],
+ 's' => $match[0][1],
+ 'e' => $match[0][1] + strlen($match[0][0]),
+ 'open' => @$match['open'][0],
+ 'close' => @$match['close'][0],
+ 'attrs' => $attrs,
+ 'content' => ''
+ );
+ }
+ }
+
+ $i = count($tags);
+ while($i--) {
+ if($tags[$i]['close']) {
+ // If the tag just before this one isn't the related opening tag, throw an error
+ $err = null;
+
+ if ($i == 0) {
+ $err = 'Close tag "'.$tags[$i]['close'].'" is the first found tag, so has no related open tag';
+ }
+ else if (!$tags[$i-1]['open']) {
+ $err = 'Close tag "'.$tags[$i]['close'].'" preceded by another close tag "'.$tags[$i-1]['close'].'"';
+ }
+ else if ($tags[$i]['close'] != $tags[$i-1]['open']) {
+ $err = 'Close tag "'.$tags[$i]['close'].'" doesn\'t match preceding open tag "'.$tags[$i-1]['open'].'"';
+ }
+
+ if($err) {
+ if(self::$error_behavior == self::ERROR) user_error($err, E_USER_ERRROR);
+ }
+ else {
+ // Otherwise, grab content between tags, save in opening tag & delete the closing one
+ $tags[$i-1]['text'] = substr($content, $tags[$i-1]['s'], $tags[$i]['e'] - $tags[$i-1]['s']);
+ $tags[$i-1]['content'] = substr($content, $tags[$i-1]['e'], $tags[$i]['s'] - $tags[$i-1]['e']);
+ $tags[$i-1]['e'] = $tags[$i]['e'];
+ unset($tags[$i]);
}
}
- // We reverse this so that replacements don't break offsets
- foreach(array_reverse($replacements) as $replace) {
- $content = substr_replace($content, $replace[0], $replace[1], $replace[2]);
+ }
+
+ return array_values($tags);
+ }
+
+ /**
+ * Replaces the shortcode tags extracted by extractTags with HTML element "markers", so that
+ * we can parse the resulting string as HTML and easily mutate the shortcodes in the DOM
+ *
+ * @param string $content - The HTML string with [tag] style shortcodes embedded
+ * @param array $tags - The tags extracted by extractTags
+ * @return string - The HTML string with [tag] style shortcodes replaced by markers
+ */
+ protected function replaceTagsWithText($content, $tags, $generator) {
+ // The string with tags replaced with markers
+ $str = '';
+ // The start index of the next tag, remembered as we step backwards through the list
+ $li = null;
+
+ $i = count($tags);
+ while($i--) {
+ if ($li === null) $tail = substr($content, $tags[$i]['e']);
+ else $tail = substr($content, $tags[$i]['e'], $li - $tags[$i]['e']);
+
+ $str = $generator($i, $tags[$i]). $tail . $str;
+ $li = $tags[$i]['s'];
+ }
+
+ return substr($content, 0, $tags[0]['s']) . $str;
+ }
+
+ /**
+ * Replace the shortcodes in attribute values with the calculated content
+ *
+ * We don't use markers with attributes because there's no point, it's easier to do all the matching
+ * in-DOM after the XML parse
+ *
+ * @param DOMDocument $doc
+ */
+ protected function replaceAttributeTagsWithContent($doc) {
+ $xp = new DOMXPath($doc);
+ $attributes = $xp->query('//@*[contains(.,"[")][contains(.,"]")]');
+ $parser = $this;
+
+ for($i = 0; $i < $attributes->length; $i++) {
+ $node = $attributes->item($i);
+ $tags = $this->extractTags($node->nodeValue);
+
+ if($tags) {
+ $node->nodeValue = $this->replaceTagsWithText($node->nodeValue, $tags, function($idx, $tag) use ($parser){
+ return $parser->callShortcode($tag['open'], $tag['attrs'], $tag['content']);
+ });
}
}
+ }
+
+ /**
+ * Replace the element-scoped tags with markers
+ *
+ * @param string $content
+ */
+ protected function replaceElementTagsWithMarkers($content) {
+ $tags = $this->extractTags($content);
+
+ if($tags) {
+ $markerClass = self::$marker_class;
+
+ $content = $this->replaceTagsWithText($content, $tags, function($idx, $tag) use ($markerClass) {
+ return '<img class="'.$markerClass.'" data-tagid="'.$idx.'" />';
+ });
+ }
+
+ return array($content, $tags);
+ }
+
+ protected function findParentsForMarkers($nodes) {
+ $parents = array();
+
+ foreach($nodes as $node) {
+ $parent = $node;
+
+ do {
+ $parent = $parent->parentNode;
+ }
+ while($parent instanceof DOMElement && !in_array(strtolower($parent->tagName), self::$block_level_elements));
+
+ $node->setAttribute('data-parentid', count($parents));
+ $parents[] = $parent;
+ }
- return $content;
+ return $parents;
}
+ const BEFORE = 'before';
+ const AFTER = 'after';
+ const SPLIT = 'split';
+ const INLINE = 'inline';
+
+ /**
+ * Given a node with represents a shortcode marker and a location string, mutates the DOM to put the
+ * marker in the compliant location
+ *
+ * For shortcodes inserted BEFORE, that location is just before the block container that
+ * the marker is in
+ *
+ * For shortcodes inserted AFTER, that location is just after the block container that
+ * the marker is in
+ *
+ * For shortcodes inserted SPLIT, that location is where the marker is, but the DOM
+ * is split around it up to the block container the marker is in - for instance,
+ *
+ * <p>A<span>B<marker />C</span>D</p>
+ *
+ * becomes
+ *
+ * <p>A<span>B</span></p><marker /><p><span>C</span>D</p>
+ *
+ * For shortcodes inserted INLINE, no modification is needed (but in that case the shortcode handler needs to
+ * generate only inline blocks)
+ *
+ * @param DOMElement $node
+ * @param int $location - ShortcodeParser::BEFORE, ShortcodeParser::SPLIT or ShortcodeParser::INLINE
+ */
+ protected function moveMarkerToCompliantHome($node, $parent, $location) {
+ // Move before block parent
+ if($location == self::BEFORE) {
+ $parent->parentNode->insertBefore($node, $parent);
+ }
+ // Move after block parent
+ else if($location == self::AFTER) {
+ $this->insertAfter($node, $parent);
+ }
+ // Split parent at node
+ else if($location == self::SPLIT) {
+ $at = $node; $splitee = $node->parentNode;
+
+ while($splitee !== $parent->parentNode) {
+ $spliter = $splitee->cloneNode(false);
+
+ $this->insertAfter($spliter, $splitee);
+
+ while($at->nextSibling) {
+ $spliter->appendChild($at->nextSibling);
+ }
+
+ $at = $splitee; $splitee = $splitee->parentNode;
+ }
+
+ $this->insertAfter($node, $parent);
+ }
+ // Do nothing
+ else if($location == self::INLINE) {
+ if(in_array(strtolower($node->tagName), self::$block_level_elements)) {
+ user_error(
+ 'Requested to insert block tag '.$node->tagName.' inline - probably this will break HTML compliance',
+ E_USER_WARNING
+ );
+ }
+ // NOP
+ }
+ else {
+ user_error('Unknown value for $location argument '.$location, E_USER_ERROR);
+ }
+ }
+
+ /**
+ * Given a node with represents a shortcode marker and some informationabout the shortcode, call the
+ * shortcode handler & replace the marker with the actual content
+ *
+ * @param DOMElement $node
+ * @param array $tag
+ */
+ protected function replaceMarkerWithContent($node, $tag) {
+ $content = false;
+ if($tag['open']) $content = $this->callShortcode($tag['open'], $tag['attrs'], $tag['content']);
+
+ if ($content === false) {
+ if(self::$error_behavior == self::ERROR) {
+ user_error('Unknown shortcode tag '.$tag['open'], E_USER_ERRROR);
+ }
+ if (self::$error_behavior == self::WARN) {
+ $content = '<strong class="warning">'.$tag['text'].'</strong>';
+ }
+ else if (self::$error_behavior == self::LEAVE) {
+ $content = $tag['text'];
+ }
+ else {
+ // self::$error_behavior == self::STRIP - NOP
+ }
+ }
+
+ if ($content) {
+ $parsed = HTML5_Parser::parseFragment($content, 'div');
+ $this->insertListAfter($parsed, $node);
+ }
+
+ $this->removeNode($node);
+ }
+
/**
- * @ignore
+ * Parse a string, and replace any registered shortcodes within it with the result of the mapped callback.
+ *
+ * @param string $content
+ * @return string
*/
- protected function handleShortcode($matches) {
- $shortcode = $matches[1][0];
+ public function parse($content) {
+ if(!$this->shortcodes) return $content;
+
+ // First we operate in text mode, replacing any shortcodes with marker elements so that later we can
+ // use a proper DOM
+ list($content, $tags) = $this->replaceElementTagsWithMarkers($content);
+
+ // Now parse the result into a DOM
+ require_once(THIRDPARTY_PATH.'/html5lib/HTML5/Parser.php');
+
+ $res = '';
- $attributes = array(); // Parse attributes into into this array.
+ $bases = HTML5_Parser::parseFragment(trim($content), 'div');
+ $html = $bases->item(0)->parentNode;
+ $doc = $html->ownerDocument;
+
+ $xp = new DOMXPath($doc);
+
+ // First, replace any shortcodes that are in attributes
+ $this->replaceAttributeTagsWithContent($doc);
+
+ // Find all the element scoped shortcode markers
+ $shortcodes = $xp->query('//img[@class="'.self::$marker_class.'"]');
+
+ // Find the parents. Do this before DOM modification, since SPLIT might cause parents to move otherwise
+ $parents = $this->findParentsForMarkers($shortcodes);
- if(preg_match_all('/(\w+) *= *(?:([\'"])(.*?)\\2|([^ ,"\'>]+))/', $matches[2][0], $match, PREG_SET_ORDER)) {
- foreach($match as $attribute) {
- if(!empty($attribute[4])) {
- $attributes[strtolower($attribute[1])] = $attribute[4];
- } elseif(!empty($attribute[3])) {
- $attributes[strtolower($attribute[1])] = $attribute[3];
+ foreach($shortcodes as $shortcode) {
+ $tag = $tags[$shortcode->getAttribute('data-tagid')];
+ $parent = $parents[$shortcode->getAttribute('data-parentid')];
+
+ $class = null;
+ if(!empty($tag['attrs']['location'])) $class = $tag['attrs']['location'];
+ else if(!empty($tag['attrs']['class'])) $class = $tag['attrs']['class'];
+
+ $location = self::INLINE;
+ if($class == 'left' || $class == 'right') $location = self::BEFORE;
+ if($class == 'center' || $class == 'leftALone') $location = self::SPLIT;
+
+ if(!$parent) {
+ if($location !== self::INLINE) {
+ user_error("Parent block for shortcode couldn't be found, but location wasn't INLINE", E_USER_ERROR);
}
}
- }
+ else {
+ $this->moveMarkerToCompliantHome($shortcode, $parent, $location);
+ }
- return call_user_func(
- $this->shortcodes[$shortcode],
- $attributes, isset($matches[4][0]) ? $matches[4][0] : '', $this, $shortcode);
+ $this->replaceMarkerWithContent($shortcode, $tag);
+ }
+
+ foreach($html->childNodes as $child) $res .= $doc->saveHTML($child);
+
+ return preg_replace('/\[\[/', '[', preg_replace('/\]\]/', ']', $res));
}
+
}
View
68 tests/parsers/ShortcodeParserTest.php
@@ -18,6 +18,19 @@ public function setUp() {
* Tests that valid short codes that have not been registered are not replaced.
*/
public function testNotRegisteredShortcode() {
+ ShortcodeParser::$error_behavior = ShortcodeParser::STRIP;
+ $this->assertEquals(
+ '',
+ $this->parser->parse('[not_shortcode]')
+ );
+
+ ShortcodeParser::$error_behavior = ShortcodeParser::WARN;
+ $this->assertEquals(
+ '<strong class="warning">[not_shortcode]</strong>',
+ $this->parser->parse('[not_shortcode]')
+ );
+
+ ShortcodeParser::$error_behavior = ShortcodeParser::LEAVE;
$this->assertEquals('[not_shortcode]',
$this->parser->parse('[not_shortcode]'));
$this->assertEquals('[not_shortcode /]',
@@ -26,11 +39,16 @@ public function testNotRegisteredShortcode() {
$this->parser->parse('[not_shortcode,foo="bar"]'));
$this->assertEquals('[not_shortcode]a[/not_shortcode]',
$this->parser->parse('[not_shortcode]a[/not_shortcode]'));
+ $this->assertEquals('[/not_shortcode]',
+ $this->parser->parse('[/not_shortcode]'));
}
public function testSimpleTag() {
- $tests = array('[test_shortcode]', '[test_shortcode ]', '[test_shortcode,]', '[test_shortcode/]',
- '[test_shortcode /]');
+ $tests = array(
+ '[test_shortcode]',
+ '[test_shortcode ]', '[test_shortcode,]', '[test_shortcode, ]'.
+ '[test_shortcode/]', '[test_shortcode /]', '[test_shortcode,/]', '[test_shortcode, /]'
+ );
foreach($tests as $test) {
$this->parser->parse($test);
@@ -43,9 +61,9 @@ public function testSimpleTag() {
public function testOneArgument() {
$tests = array (
- '[test_shortcode,foo="bar"]',
- "[test_shortcode,foo='bar']",
- '[test_shortcode,foo = "bar" /]'
+ '[test_shortcode foo="bar"]', '[test_shortcode,foo="bar"]',
+ "[test_shortcode foo='bar']", "[test_shortcode,foo='bar']",
+ '[test_shortcode foo = "bar" /]', '[test_shortcode, foo = "bar" /]'
);
foreach($tests as $test) {
@@ -58,7 +76,7 @@ public function testOneArgument() {
}
public function testMultipleArguments() {
- $this->parser->parse('[test_shortcode,foo = "bar",bar=\'foo\',baz="buz"]');
+ $this->parser->parse('[test_shortcode foo = "bar",bar=\'foo\', baz="buz"]');
$this->assertEquals(array('foo' => 'bar', 'bar' => 'foo', 'baz' => 'buz'), $this->arguments);
$this->assertEquals('', $this->contents);
@@ -86,7 +104,7 @@ public function testShortcodeEscaping() {
$this->assertEquals('[test_shortcode]content[/test_shortcode]',
$this->parser->parse('[[test_shortcode]content[/test_shortcode]]'));
}
-
+
public function testUnquotedArguments() {
$this->assertEquals('', $this->parser->parse('[test_shortcode,foo=bar,baz = buz]'));
$this->assertEquals(array('foo' => 'bar', 'baz' => 'buz'), $this->arguments);
@@ -111,6 +129,42 @@ public function testConsecutiveTags() {
$this->assertEquals('', $this->parser->parse('[test_shortcode][test_shortcode]'));
}
+ protected function assertEqualsIgnoringWhitespace($a, $b, $message = null) {
+ $this->assertEquals(preg_replace('/\s+/', '', $a), preg_replace('/\s+/', '', $b), $message);
+ }
+
+ public function testtExtract() {
+ // Left extracts to before the current block
+ $this->assertEqualsIgnoringWhitespace(
+ 'Code<div>FooBar</div>',
+ $this->parser->parse('<div>Foo[test_shortcode class=left]Code[/test_shortcode]Bar</div>')
+ );
+
+ // Even if the immediate parent isn't a the current block
+ $this->assertEqualsIgnoringWhitespace(
+ 'Code<div>Foo<b>BarBaz</b>Qux</div>',
+ $this->parser->parse('<div>Foo<b>Bar[test_shortcode class=left]Code[/test_shortcode]Baz</b>Qux</div>')
+ );
+
+ // Center splits the current block
+ $this->assertEqualsIgnoringWhitespace(
+ '<div>Foo</div>Code<div>Bar</div>',
+ $this->parser->parse('<div>Foo[test_shortcode class=center]Code[/test_shortcode]Bar</div>')
+ );
+
+ // Even if the immediate parent isn't a the current block
+ $this->assertEqualsIgnoringWhitespace(
+ '<div>Foo<b>Bar</b></div>Code<div><b>Baz</b>Qux</div>',
+ $this->parser->parse('<div>Foo<b>Bar[test_shortcode class=center]Code[/test_shortcode]Baz</b>Qux</div>')
+ );
+
+ // No class means don't extract
+ $this->assertEqualsIgnoringWhitespace(
+ '<div>FooCodeBar</div>',
+ $this->parser->parse('<div>Foo[test_shortcode]Code[/test_shortcode]Bar</div>')
+ );
+ }
+
// -----------------------------------------------------------------------------------------------------------------
/**
View
114 thirdparty/html5lib/HTML5/Data.php
@@ -0,0 +1,114 @@
+<?php
+
+// warning: this file is encoded in UTF-8!
+
+class HTML5_Data
+{
+
+ // at some point this should be moved to a .ser file. Another
+ // possible optimization is to give UTF-8 bytes, not Unicode
+ // codepoints
+ // XXX: Not quite sure why it's named this; this is
+ // actually the numeric entity dereference table.
+ protected static $realCodepointTable = array(
+ 0x00 => 0xFFFD, // REPLACEMENT CHARACTER
+ 0x0D => 0x000A, // LINE FEED (LF)
+ 0x80 => 0x20AC, // EURO SIGN ('€')
+ 0x81 => 0x0081, // <control>
+ 0x82 => 0x201A, // SINGLE LOW-9 QUOTATION MARK ('‚')
+ 0x83 => 0x0192, // LATIN SMALL LETTER F WITH HOOK ('ƒ')
+ 0x84 => 0x201E, // DOUBLE LOW-9 QUOTATION MARK ('„')
+ 0x85 => 0x2026, // HORIZONTAL ELLIPSIS ('…')
+ 0x86 => 0x2020, // DAGGER ('†')
+ 0x87 => 0x2021, // DOUBLE DAGGER ('‡')
+ 0x88 => 0x02C6, // MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
+ 0x89 => 0x2030, // PER MILLE SIGN ('‰')
+ 0x8A => 0x0160, // LATIN CAPITAL LETTER S WITH CARON ('Š')
+ 0x8B => 0x2039, // SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
+ 0x8C => 0x0152, // LATIN CAPITAL LIGATURE OE ('Œ')
+ 0x8D => 0x008D, // <control>
+ 0x8E => 0x017D, // LATIN CAPITAL LETTER Z WITH CARON ('Ž')
+ 0x8F => 0x008F, // <control>
+ 0x90 => 0x0090, // <control>
+ 0x91 => 0x2018, // LEFT SINGLE QUOTATION MARK ('‘')
+ 0x92 => 0x2019, // RIGHT SINGLE QUOTATION MARK ('’')
+ 0x93 => 0x201C, // LEFT DOUBLE QUOTATION MARK ('“')
+ 0x94 => 0x201D, // RIGHT DOUBLE QUOTATION MARK ('”')
+ 0x95 => 0x2022, // BULLET ('•')
+ 0x96 => 0x2013, // EN DASH ('–')
+ 0x97 => 0x2014, // EM DASH ('—')
+ 0x98 => 0x02DC, // SMALL TILDE ('˜')
+ 0x99 => 0x2122, // TRADE MARK SIGN ('™')
+ 0x9A => 0x0161, // LATIN SMALL LETTER S WITH CARON ('š')
+ 0x9B => 0x203A, // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
+ 0x9C => 0x0153, // LATIN SMALL LIGATURE OE ('œ')
+ 0x9D => 0x009D, // <control>
+ 0x9E => 0x017E, // LATIN SMALL LETTER Z WITH CARON ('ž')
+ 0x9F => 0x0178, // LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
+ );
+
+ protected static $namedCharacterReferences;
+
+ protected static $namedCharacterReferenceMaxLength;
+
+ /**
+ * Returns the "real" Unicode codepoint of a malformed character
+ * reference.
+ */
+ public static function getRealCodepoint($ref) {
+ if (!isset(self::$realCodepointTable[$ref])) return false;
+ else return self::$realCodepointTable[$ref];
+ }
+
+ public static function getNamedCharacterReferences() {
+ if (!self::$namedCharacterReferences) {
+ self::$namedCharacterReferences = unserialize(
+ file_get_contents(dirname(__FILE__) . '/named-character-references.ser'));
+ }
+ return self::$namedCharacterReferences;
+ }
+
+ /**
+ * Converts a Unicode codepoint to sequence of UTF-8 bytes.
+ * @note Shamelessly stolen from HTML Purifier, which is also
+ * shamelessly stolen from Feyd (which is in public domain).
+ */
+ public static function utf8chr($code) {
+ /* We don't care: we live dangerously
+ * if($code > 0x10FFFF or $code < 0x0 or
+ ($code >= 0xD800 and $code <= 0xDFFF) ) {
+ // bits are set outside the "valid" range as defined
+ // by UNICODE 4.1.0
+ return "\xEF\xBF\xBD";
+ }*/
+
+ $x = $y = $z = $w = 0;
+ if ($code < 0x80) {
+ // regular ASCII character
+ $x = $code;
+ } else {
+ // set up bits for UTF-8
+ $x = ($code & 0x3F) | 0x80;
+ if ($code < 0x800) {
+ $y = (($code & 0x7FF) >> 6) | 0xC0;
+ } else {
+ $y = (($code & 0xFC0) >> 6) | 0x80;
+ if($code < 0x10000) {
+ $z = (($code >> 12) & 0x0F) | 0xE0;
+ } else {
+ $z = (($code >> 12) & 0x3F) | 0x80;
+ $w = (($code >> 18) & 0x07) | 0xF0;
+ }
+ }
+ }
+ // set up the actual character
+ $ret = '';
+ if($w) $ret .= chr($w);
+ if($z) $ret .= chr($z);
+ if($y) $ret .= chr($y);
+ $ret .= chr($x);
+
+ return $ret;
+ }
+
+}
View
284 thirdparty/html5lib/HTML5/InputStream.php
@@ -0,0 +1,284 @@
+<?php
+
+/*
+
+Copyright 2009 Geoffrey Sneddon <http://gsnedders.com/>
+
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+*/
+
+// Some conventions:
+// /* */ indicates verbatim text from the HTML 5 specification
+// // indicates regular comments
+
+class HTML5_InputStream {
+ /**
+ * The string data we're parsing.
+ */
+ private $data;
+
+ /**
+ * The current integer byte position we are in $data
+ */
+ private $char;
+
+ /**
+ * Length of $data; when $char === $data, we are at the end-of-file.
+ */
+ private $EOF;
+
+ /**
+ * Parse errors.
+ */
+ public $errors = array();
+
+ /**
+ * @param $data Data to parse
+ */
+ public function __construct($data) {
+
+ /* Given an encoding, the bytes in the input stream must be
+ converted to Unicode characters for the tokeniser, as
+ described by the rules for that encoding, except that the
+ leading U+FEFF BYTE ORDER MARK character, if any, must not
+ be stripped by the encoding layer (it is stripped by the rule below).
+
+ Bytes or sequences of bytes in the original byte stream that
+ could not be converted to Unicode characters must be converted
+ to U+FFFD REPLACEMENT CHARACTER code points. */
+
+ // XXX currently assuming input data is UTF-8; once we
+ // build encoding detection this will no longer be the case
+ //
+ // We previously had an mbstring implementation here, but that
+ // implementation is heavily non-conforming, so it's been
+ // omitted.
+ if (extension_loaded('iconv')) {
+ // non-conforming
+ $data = @iconv('UTF-8', 'UTF-8//IGNORE', $data);
+ } else {
+ // we can make a conforming native implementation
+ throw new Exception('Not implemented, please install mbstring or iconv');
+ }
+
+ /* One leading U+FEFF BYTE ORDER MARK character must be
+ ignored if any are present. */
+ if (substr($data, 0, 3) === "\xEF\xBB\xBF") {
+ $data = substr($data, 3);
+ }
+
+ /* All U+0000 NULL characters in the input must be replaced
+ by U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such
+ characters is a parse error. */
+ for ($i = 0, $count = substr_count($data, "\0"); $i < $count; $i++) {
+ $this->errors[] = array(
+ 'type' => HTML5_Tokenizer::PARSEERROR,
+ 'data' => 'null-character'
+ );
+ }
+ /* U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED
+ (LF) characters are treated specially. Any CR characters
+ that are followed by LF characters must be removed, and any
+ CR characters not followed by LF characters must be converted
+ to LF characters. Thus, newlines in HTML DOMs are represented
+ by LF characters, and there are never any CR characters in the
+ input to the tokenization stage. */
+ $data = str_replace(
+ array(
+ "\0",
+ "\r\n",
+ "\r"
+ ),
+ array(
+ "\xEF\xBF\xBD",
+ "\n",
+ "\n"
+ ),
+ $data
+ );
+
+ /* Any occurrences of any characters in the ranges U+0001 to
+ U+0008, U+000B, U+000E to U+001F, U+007F to U+009F,
+ U+D800 to U+DFFF , U+FDD0 to U+FDEF, and
+ characters U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF,
+ U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE,
+ U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF,
+ U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
+ U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
+ U+10FFFF are parse errors. (These are all control characters
+ or permanently undefined Unicode characters.) */
+ // Check PCRE is loaded.
+ if (extension_loaded('pcre')) {
+ $count = preg_match_all(
+ '/(?:
+ [\x01-\x08\x0B\x0E-\x1F\x7F] # U+0001 to U+0008, U+000B, U+000E to U+001F and U+007F
+ |
+ \xC2[\x80-\x9F] # U+0080 to U+009F
+ |
+ \xED(?:\xA0[\x80-\xFF]|[\xA1-\xBE][\x00-\xFF]|\xBF[\x00-\xBF]) # U+D800 to U+DFFFF
+ |
+ \xEF\xB7[\x90-\xAF] # U+FDD0 to U+FDEF
+ |
+ \xEF\xBF[\xBE\xBF] # U+FFFE and U+FFFF
+ |
+ [\xF0-\xF4][\x8F-\xBF]\xBF[\xBE\xBF] # U+nFFFE and U+nFFFF (1 <= n <= 10_{16})
+ )/x',
+ $data,
+ $matches
+ );
+ for ($i = 0; $i < $count; $i++) {
+ $this->errors[] = array(
+ 'type' => HTML5_Tokenizer::PARSEERROR,
+ 'data' => 'invalid-codepoint'
+ );
+ }
+ } else {
+ // XXX: Need non-PCRE impl, probably using substr_count
+ }
+
+ $this->data = $data;
+ $this->char = 0;
+ $this->EOF = strlen($data);
+ }
+
+ /**
+ * Returns the current line that the tokenizer is at.
+ */
+ public function getCurrentLine() {
+ // Check the string isn't empty
+ if($this->EOF) {
+ // Add one to $this->char because we want the number for the next
+ // byte to be processed.
+ return substr_count($this->data, "\n", 0, min($this->char, $this->EOF)) + 1;
+ } else {
+ // If the string is empty, we are on the first line (sorta).
+ return 1;
+ }
+ }
+
+ /**
+ * Returns the current column of the current line that the tokenizer is at.
+ */
+ public function getColumnOffset() {
+ // strrpos is weird, and the offset needs to be negative for what we
+ // want (i.e., the last \n before $this->char). This needs to not have
+ // one (to make it point to the next character, the one we want the
+ // position of) added to it because strrpos's behaviour includes the
+ // final offset byte.
+ $lastLine = strrpos($this->data, "\n", $this->char - 1 - strlen($this->data));
+
+ // However, for here we want the length up until the next byte to be
+ // processed, so add one to the current byte ($this->char).
+ if($lastLine !== false) {
+ $findLengthOf = substr($this->data, $lastLine + 1, $this->char - 1 - $lastLine);
+ } else {
+ $findLengthOf = substr($this->data, 0, $this->char);
+ }
+
+ // Get the length for the string we need.
+ if(extension_loaded('iconv')) {
+ return iconv_strlen($findLengthOf, 'utf-8');
+ } elseif(extension_loaded('mbstring')) {
+ return mb_strlen($findLengthOf, 'utf-8');
+ } elseif(extension_loaded('xml')) {
+ return strlen(utf8_decode($findLengthOf));
+ } else {
+ $count = count_chars($findLengthOf);
+ // 0x80 = 0x7F - 0 + 1 (one added to get inclusive range)
+ // 0x33 = 0xF4 - 0x2C + 1 (one added to get inclusive range)
+ return array_sum(array_slice($count, 0, 0x80)) +
+ array_sum(array_slice($count, 0xC2, 0x33));
+ }
+ }
+
+ /**
+ * Retrieve the currently consume character.
+ * @note This performs bounds checking
+ */
+ public function char() {
+ return ($this->char++ < $this->EOF)
+ ? $this->data[$this->char - 1]
+ : false;
+ }
+
+ /**
+ * Get all characters until EOF.
+ * @note This performs bounds checking
+ */
+ public function remainingChars() {
+ if($this->char < $this->EOF) {
+ $data = substr($this->data, $this->char);
+ $this->char = $this->EOF;
+ return $data;
+ } else {
+ return false;
+ }
+ }
+
+ /**
+ * Matches as far as possible until we reach a certain set of bytes
+ * and returns the matched substring.
+ * @param $bytes Bytes to match.
+ */
+ public function charsUntil($bytes, $max = null) {
+ if ($this->char < $this->EOF) {
+ if ($max === 0 || $max) {
+ $len = strcspn($this->data, $bytes, $this->char, $max);
+ } else {
+ $len = strcspn($this->data, $bytes, $this->char);
+ }
+ $string = (string) substr($this->data, $this->char, $len);
+ $this->char += $len;
+ return $string;
+ } else {
+ return false;
+ }
+ }
+
+ /**
+ * Matches as far as possible with a certain set of bytes
+ * and returns the matched substring.
+ * @param $bytes Bytes to match.
+ */
+ public function charsWhile($bytes, $max = null) {
+ if ($this->char < $this->EOF) {
+ if ($max === 0 || $max) {
+ $len = strspn($this->data, $bytes, $this->char, $max);
+ } else {
+ $len = strspn($this->data, $bytes, $this->char);
+ }
+ $string = (string) substr($this->data, $this->char, $len);
+ $this->char += $len;
+ return $string;
+ } else {
+ return false;
+ }
+ }
+
+ /**
+ * Unconsume one character.
+ */
+ public function unget() {
+ if ($this->char <= $this->EOF) {
+ $this->char--;
+ }
+ }
+}
View
36 thirdparty/html5lib/HTML5/Parser.php
@@ -0,0 +1,36 @@
+<?php
+
+require_once dirname(__FILE__) . '/Data.php';
+require_once dirname(__FILE__) . '/InputStream.php';
+require_once dirname(__FILE__) . '/TreeBuilder.php';
+require_once dirname(__FILE__) . '/Tokenizer.php';
+
+/**
+ * Outwards facing interface for HTML5.
+ */
+class HTML5_Parser
+{
+ /**
+ * Parses a full HTML document.
+ * @param $text HTML text to parse
+ * @param $builder Custom builder implementation
+ * @return Parsed HTML as DOMDocument
+ */
+ static public function parse($text, $builder = null) {
+ $tokenizer = new HTML5_Tokenizer($text, $builder);
+ $tokenizer->parse();
+ return $tokenizer->save();
+ }
+ /**
+ * Parses an HTML fragment.
+ * @param $text HTML text to parse
+ * @param $context String name of context element to pretend parsing is in.
+ * @param $builder Custom builder implementation
+ * @return Parsed HTML as DOMDocument
+ */
+ static public function parseFragment($text, $context = null, $builder = null) {
+ $tokenizer = new HTML5_Tokenizer($text, $builder);
+ $tokenizer->parseFragment($context);
+ return $tokenizer->save();
+ }
+}
View
2,422 thirdparty/html5lib/HTML5/Tokenizer.php
2,422 additions, 0 deletions not shown
View
3,840 thirdparty/html5lib/HTML5/TreeBuilder.php
3,840 additions, 0 deletions not shown
View
1  thirdparty/html5lib/HTML5/named-character-references.ser
1 addition, 0 deletions not shown
View
22 thirdparty/html5lib/LICENSE
@@ -0,0 +1,22 @@
+Copyright (c) 2006-2011 The Authors
+
+Contributors:
+James Graham - jg307@cam.ac.uk
+Anne van Kesteren - annevankesteren@gmail.com
+Lachlan Hunt - lachlan.hunt@lachy.id.au
+Matt McDonald - kanashii@kanashii.ca
+Sam Ruby - rubys@intertwingly.net
+Ian Hickson (Google) - ian@hixie.ch
+Thomas Broyer - t.broyer@ltgt.net
+Jacques Distler - distler@golem.ph.utexas.edu
+Henri Sivonen - hsivonen@iki.fi
+Adam Barth - abarth@webkit.org
+Eric Seidel - eric@webkit.org
+The Mozilla Foundation (contributions from Henri Sivonen since 2008)
+David Flanagan (Mozilla) - dflanagan@mozilla.com
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
View
47 thirdparty/html5lib/README
@@ -0,0 +1,47 @@
+html5lib - php flavour
+
+This is an implementation of the tokenization and tree-building parts
+of the HTML5 specification in PHP. Potential uses of this library
+can be found in web-scrapers and HTML filters.
+
+Warning: This is a pre-alpha release, and as such, certain parts of
+this code are not up-to-snuff (e.g. error reporting and performance).
+However, the code is very close to spec and passes 100% of tests
+not related to parse errors. Nevertheless, expect to have to update
+your code on the next upgrade.
+
+
+Usage notes:
+
+ <?php
+ require_once '/path/to/HTML5/Parser.php';
+ $dom = HTML5_Parser::parse('<html><body>...');
+ $nodelist = HTML5_Parser::parseFragment('<b>Boo</b><br>');
+ $nodelist = HTML5_Parser::parseFragment('<td>Bar</td>', 'table');
+
+
+Documentation:
+
+HTML5_Parser::parse($text)
+ $text : HTML to parse
+ return : DOMDocument of parsed document
+
+HTML5_Parser::parseFragment($text, $context)
+ $text : HTML to parse
+ $context : String name of context element
+ return : DOMDocument of parsed document
+
+
+Developer notes:
+
+* To setup unit tests, you need to add a small stub file test-settings.php
+ that contains $simpletest_location = 'path/to/simpletest/'; This needs to
+ be version 1.1 (or, until that is released, SVN trunk) of SimpleTest.
+
+* We don't want to ultimately use PHP's DOM because it is not tolerant
+ of certain types of errors that HTML 5 allows (for example, an element
+ "foo@bar"). But the current implementation uses it, since it's easy.
+ Eventually, this html5lib implementation will get a version of SimpleTree;
+ and may possibly start using that by default.
+
+ vim: et sw=4 sts=4
Please sign in to comment.
Something went wrong with that request. Please try again.