Skip to content

Replace Misc::get_element() with DOMDocument for autodiscovery #37

Closed
rmccue opened this Issue Sep 29, 2009 · 10 comments

3 participants

@rmccue
SimplePie member
rmccue commented Sep 29, 2009

Originally reported by Philippe L. as issue 143

I've tried the following PHP script to run on 2 different environments. And to my surprise, I obtain different results depending on the environment it's run from.

$feed = new SimplePie();

$feed->set_feed_url('http://www.lemonde.fr');
$feed->enable_cache(false);

$feed->init();

if($feed->data) echo 'SimplePie: parsing successful';
else echo $feed->error;

The environments are as follow:

Environment 1 (PHP4): SimplePie 1.2 / PHP4.4.9 / libxml2 2.6.27 / mbstring / No iconv / PCRE 7.7 2008-05-07 / libcurl/7.15.5 OpenSSL/0.9.8c zlib/1.2.3 libidn/0.6.5
Environment 2 (PHP5): SimplePie 1.2 / PHP5.2.9 / libxml2 2.7.3 / mbstring 4.4.4 / iconv 2.5 / PCRE 7.8 2008-09-05 / libcurl/7.19.4 OpenSSL/0.9.8i zlib/1.2.3 libidn/0.6.5
When run on Environment 1 (PHP4), the script is able to find the feed and renders:

SimplePie: parsing successful

But, if I execute the exact same PHP script on Environment 2 (PHP5), it renders this simplepie error:

A feed could not be found at http://www.lemonde.fr

I have also tried other experiments:
If I force_feed(), I get the following SimplePie error on both environments:

XML error: syntax error at line 3, column 45

If I replace the feed URL by the actual feed location http://www.lemonde.fr/rss/une.xml, the feed is correctly parsed by SimplePie on both environments.

@rmccue
SimplePie member
rmccue commented Oct 1, 2009

This is an error in the SimplePie_Misc::get_element() regex. Trying to sort it out now.

@gsnedders
SimplePie member

…and then you'll just introduce another. You'll never fix all issues while using regexp to parse HTML.

@rmccue
SimplePie member
rmccue commented Feb 18, 2010

I didn't write the original code. Feel free to replace it with a proper HTML5 parser. :)

@gsnedders
SimplePie member

That's done in my ComplexPie branch. ;)

@rmccue
SimplePie member
rmccue commented Oct 28, 2010

gsnedders, the error is the backtracking limit in PCRE, due to the file size. Thoughts on how we can get around it without replacing it completely? (Chunking the HTML into several strings somehow?)

I'd also accept a patch to change to a HTML5 parser (don't commit though, need to merge in the refactor branch first).

@dev101
dev101 commented Oct 29, 2010

i use a quick'n'dirty workaround by hard-limiting $string size to the first 50K if exhausting limits. note that preg_match_all() does not return FALSE when hitting the limit (at least in my environment), that's why i check preg_last_error() explicitly:

if (preg_match_all())
{
...
}
elseif (preg_last_error() != PREG_NO_ERROR) {
return SimplePie_Misc::get_element($realname, mb_substr($string, 0, 50000));
}

@gsnedders
SimplePie member

Oh, if it's just due to the backtracking limit, then it's wontfix under the current parser, IMO. Known limitation.

@rmccue
SimplePie member
rmccue commented Nov 4, 2010

I disagree. I say we break it down into chunks and parse those individually if we hit a backtracking limit error. It's not fantastic (since it could break in the middle of tag), but IMHO, it beats the current one.

@gsnedders
SimplePie member

There may well be security considerations if we break within the attribute value of something like:

  <a href="http://example.com" title="<link rel='alternate' type='text/xml' href='http://example.com/rss'>">foobar</a>
@rmccue
SimplePie member
rmccue commented Apr 13, 2011

So, the solution to this is to replace it with using DOMDocument instead, although SimplePie_Misc::get_element() is also used to parse XML, so we can't replace that function. Instead, I propose each place that uses it instead takes care of its own DOMDocument. @gsnedders, your thoughts?

@rmccue rmccue added a commit that closed this issue Jan 16, 2012
@rmccue rmccue Finally switch from regex to DOM for parsing HTML.
This has the extra benefit of allowing autodiscovery to work with large pages, and hence fixes #37
878c83c
@rmccue rmccue closed this in 878c83c Jan 16, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.