Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML parsing with CDATA not working #817

Open
oxivanisher opened this issue Apr 25, 2024 · 6 comments
Open

XML parsing with CDATA not working #817

oxivanisher opened this issue Apr 25, 2024 · 6 comments

Comments

@oxivanisher
Copy link

oxivanisher commented Apr 25, 2024

I try to monitor new releases of factorio. But it seems that fields with CDATA fields are always returned empty.

Factorio publishes the new releases in their phpbb which has a atom feed. The entry that should work IMHO is:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath: '//entry[1]/title/text()'

One of the entries looks like this:

<entry>
	<author><name><![CDATA[FactorioBot]]></name></author>
	<updated>2024-04-11T15:29:30</updated>
	<published>2024-04-11T15:29:30</published>
	<id>https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190</id>
	<link href="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"/>
	<title type="html"><![CDATA[Releases • Version 1.1.107]]></title>
	<category term="Releases" scheme="https://forums.factorio.com/viewforum.php?f=3" label="Releases"/>
	<content type="html" xml:base="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"><![CDATA[
	<strong class="text-strong">Modding</strong>  <ul>    <li>Added an optional "mods" to simulation definitions.</li>  </ul><strong class="text-strong">Scripting</strong>  <ul>    <li>Disabled the majority of the lua "debug" library due to security issues.</li>  </ul><strong class="text-strong">Bugfixes</strong>  <ul>    <li>Fixed LuaEntity::set_request_slot would not accept count of 0. (<a href="https://forums.factorio.com/110676" class="postlink">110676</a>)</li>    <li>Fixed first tutorial level advancing to a wrong story step after drill is set in quickbar. (<a href="https://forums.factorio.com/109315" class="postlink">109315</a>)</li>    <li>Fixed mods sorting order by last highlighted and by last updated. (<a href="https://forums.factorio.com/106420" class="postlink">106420</a>)</li>  </ul>Use the automatic updater if you can (check experimental updates in other settings) or download full installation at <a href="https://www.factorio.com/download/experimental" class="postlink">https://www.factorio.com/download/experimental</a>.<p>Statistics: Posted by <a href="https://forums.factorio.com/memberlist.php?mode=viewprofile&amp;u=7177">FactorioBot</a> — Thu Apr 11, 2024 3:29 pm</p><hr />
	]]></content>
</entry>

I am able to get all the fields not containing a CDATA but none containing one. So for example '//entry[1]/id/text()' works without a problem.

@Jamstah
Copy link
Contributor

Jamstah commented May 29, 2024

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'

@oxivanisher
Copy link
Author

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'

This works great! So my problem is solved, but I don't know if the issue should be left open, since it probably should work with xpath also?

@Jamstah
Copy link
Contributor

Jamstah commented May 30, 2024

I'm not sure. Your trying to parse XML with an html parser. From what I could see it should work but doesn't.

I expect a simple test case using lxml etree on its own would be a good start, open an issue on the lxml bug tracker with sample code and see what happens.

I don't see anything wrong with how urlwatch is using the library, but I'm not an expert.

@oxivanisher
Copy link
Author

I don't know either. But according to wikipedia XPath stands for "XML Path Language" ... I also found lots of XML examples without searching for it... Maybe the used library is not set out for XML? But that makes also not really sense. Let's keep this here for the moment and see what the dev(s) have to say about this.

@Jamstah
Copy link
Contributor

Jamstah commented May 30, 2024

By default urlwatch uses the HTMLParser class from lxml etree. My example switches it to the XML parser.

@Jamstah
Copy link
Contributor

Jamstah commented May 31, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants