Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove invalid XML characters from feed. #409

Merged
merged 11 commits into from
Mar 24, 2022
Merged

Conversation

budzanowski
Copy link
Collaborator

Changes proposed in this Pull Request:

Remove invalid XML characters from the XML sections.

Closes #401 .

XML defines a correct ranges for allowed characters https://www.w3.org/TR/REC-xml/#charsets we need to remove characters that are not in the ranges.

Screenshots:

Detailed test instructions:

  1. Configure the plugin
  2. Create a product.
  3. Update the description to contain an invalid XML char. For example copy a char from here https://www.fileformat.info/info/unicode/char/0003/browsertest.htm Mark and copy inside the UTF-8 section - you can't see it but it is there.
  4. Generate the feed.
  5. Open the feed in the browser using the link from Marketing > Pinterest > Catalog - it should open correctly. Browsers validate the XML. For double-checking check that the copied char is not there.
  6. Do the same on develop - the XML should be corrupted.

Changelog entry

Fix - Remove invalid XML characters from the feed.

@budzanowski budzanowski requested a review from a team March 22, 2022 14:34
@budzanowski budzanowski self-assigned this Mar 22, 2022
@budzanowski
Copy link
Collaborator Author

Please ignore UT for now. WooCommerce core is doing something funky.

Copy link
Collaborator

@mikkamp mikkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The character ranges you linked to also mentions #x10000-#x10FFFF is a valid range. I know they are mostly symbols but shouldn't we allow those as well?

I was able to reproduce the error with the develop branch.
image

However when I repeat it with this branch I get odd results:

<description><![CDATA[Bad XML character in quotes   8220;   8221;   038; some   8216;single  8217; quotes ]]></description>

The original description I added in the product was:

Bad XML character in quotes "�" & some 'single' quotes.

I'm not sure what's causing the quotes to get messed up, but having a look at esc_xml it seems to handle escaping within a CDATA block differently, should we also pass the CDATA block to the esc function?

Copy link
Collaborator

@mikkamp mikkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting better results after those last changes, but I still noticed two other issues I wanted to mention:

  1. I can still break it by adding the invalid character to the SKU field (WC doesn't let me add it to a shipping name or category name, although I still wonder if we need to escape that in case it has corrupted data in the database).
  2. The output for attributes is using esc_html, that should probably be esc_xml for consistency (or self::sanitize).

@budzanowski
Copy link
Collaborator Author

budzanowski commented Mar 24, 2022

@mikkamp

However when I repeat it with this branch I get odd results:
<description><![CDATA[Bad XML character in quotes 8220; 8221; 038; some 8216;single 8217; quotes ]]><

That is because the original regex was just wrong.

I'm not sure what's causing the quotes to get messed up, but having a look at esc_xml it seems to handle escaping within a CDATA block differently, should we also pass the CDATA block to the esc function?

If CDATA is going to be used then it needs to go through the esc_xml. If esc_xml is used but the CDATA has added afterward we get messed up results. So yep. If there is CDATA we use it.

The character ranges you linked to also mentions #x10000-#x10FFFF

Yes, I had it initially but removed it when I was doing testing. The compilation of regex was failing initially because of it.
It turns out that I need to change the format and add /u to allow large Unicode values.

I can still break it by adding the invalid character to the SKU field (WC doesn't let me add it to a shipping name or category name, although I still wonder if we need to escape that in case it has corrupted data in the database).

Should be OK now.

Do you have any experience with the S regex modifier? It looks like this is something that we could benefit from. Especially on larger catalogs.

image

Copy link
Collaborator

@mikkamp mikkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional changes and clarifications. I'm getting the expected results in the feed now:

<g:id>72</g:id>
<title><![CDATA[Bad XML]]></title>
<description><![CDATA[Bad XML character in quotes &#8220; &#8221; &#038; some &#8216;single&#8217; quotes. 😀️]]></description>
<g:product_type>East &amp; West</g:product_type>
<link><![CDATA[https://woo-pinterest.test/product/bad-xml/]]></link>
<g:image_link><![CDATA[https://woo-pinterest.test/wp-content/uploads/2021/12/img-autem.png]]></g:image_link>
<g:availability>in stock</g:availability>
<g:price>10.00EUR</g:price>
<g:mpn>bad xml &quot; &quot; &amp; &gt; &lt;</g:mpn>
<g:shipping>
  <g:country>AT</g:country>
  <g:service>Flat &amp; rate</g:service>
  <g:price>10.00 EUR</g:price>
</g:shipping>

I've never used the S modifier, but I had a quick look on the php page and from the comments it explains a bit more what it does, but it seems it's no longer needed for PHP 7.3+ (so we can probably ignore it)
image

@budzanowski budzanowski merged commit 1315c33 into develop Mar 24, 2022
@budzanowski budzanowski deleted the fix/bad-xml-characters branch March 24, 2022 13:37
@jconroy
Copy link
Member

jconroy commented Mar 25, 2022

Thanks @mikkamp and @budzanowski for working through this one 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with the generated catalog XML
3 participants