XML parsing with CDATA not working #817

oxivanisher · 2024-04-25T08:22:11Z

I try to monitor new releases of factorio. But it seems that fields with CDATA fields are always returned empty.

Factorio publishes the new releases in their phpbb which has a atom feed. The entry that should work IMHO is:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath: '//entry[1]/title/text()'

One of the entries looks like this:

<entry>
	<author><name><![CDATA[FactorioBot]]></name></author>
	<updated>2024-04-11T15:29:30</updated>
	<published>2024-04-11T15:29:30</published>
	<id>https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190</id>
	<link href="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"/>
	<title type="html"><![CDATA[Releases • Version 1.1.107]]></title>
	<category term="Releases" scheme="https://forums.factorio.com/viewforum.php?f=3" label="Releases"/>
	<content type="html" xml:base="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"><![CDATA[
	<strong class="text-strong">Modding</strong>  <ul>    <li>Added an optional "mods" to simulation definitions.</li>  </ul><strong class="text-strong">Scripting</strong>  <ul>    <li>Disabled the majority of the lua "debug" library due to security issues.</li>  </ul><strong class="text-strong">Bugfixes</strong>  <ul>    <li>Fixed LuaEntity::set_request_slot would not accept count of 0. (<a href="https://forums.factorio.com/110676" class="postlink">110676</a>)</li>    <li>Fixed first tutorial level advancing to a wrong story step after drill is set in quickbar. (<a href="https://forums.factorio.com/109315" class="postlink">109315</a>)</li>    <li>Fixed mods sorting order by last highlighted and by last updated. (<a href="https://forums.factorio.com/106420" class="postlink">106420</a>)</li>  </ul>Use the automatic updater if you can (check experimental updates in other settings) or download full installation at <a href="https://www.factorio.com/download/experimental" class="postlink">https://www.factorio.com/download/experimental</a>.<p>Statistics: Posted by <a href="https://forums.factorio.com/memberlist.php?mode=viewprofile&amp;u=7177">FactorioBot</a> — Thu Apr 11, 2024 3:29 pm</p><hr />
	]]></content>
</entry>

I am able to get all the fields not containing a CDATA but none containing one. So for example '//entry[1]/id/text()' works without a problem.

The text was updated successfully, but these errors were encountered:

Jamstah · 2024-05-29T23:29:58Z

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'

oxivanisher · 2024-05-30T21:29:50Z

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:
name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'

This works great! So my problem is solved, but I don't know if the issue should be left open, since it probably should work with xpath also?

Jamstah · 2024-05-30T21:39:19Z

I'm not sure. Your trying to parse XML with an html parser. From what I could see it should work but doesn't.

I expect a simple test case using lxml etree on its own would be a good start, open an issue on the lxml bug tracker with sample code and see what happens.

I don't see anything wrong with how urlwatch is using the library, but I'm not an expert.

oxivanisher · 2024-05-30T21:44:00Z

I don't know either. But according to wikipedia XPath stands for "XML Path Language" ... I also found lots of XML examples without searching for it... Maybe the used library is not set out for XML? But that makes also not really sense. Let's keep this here for the moment and see what the dev(s) have to say about this.

Jamstah · 2024-05-30T22:19:37Z

By default urlwatch uses the HTMLParser class from lxml etree. My example switches it to the XML parser.

Jamstah · 2024-05-31T09:08:46Z

FYI: https://bugs.launchpad.net/lxml/+bug/2067707

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML parsing with CDATA not working #817

XML parsing with CDATA not working #817

oxivanisher commented Apr 25, 2024 •

edited

Loading

Jamstah commented May 29, 2024

oxivanisher commented May 30, 2024

Jamstah commented May 30, 2024

oxivanisher commented May 30, 2024

Jamstah commented May 30, 2024

Jamstah commented May 31, 2024

XML parsing with CDATA not working #817

XML parsing with CDATA not working #817

Comments

oxivanisher commented Apr 25, 2024 • edited Loading

Jamstah commented May 29, 2024

oxivanisher commented May 30, 2024

Jamstah commented May 30, 2024

oxivanisher commented May 30, 2024

Jamstah commented May 30, 2024

Jamstah commented May 31, 2024

oxivanisher commented Apr 25, 2024 •

edited

Loading