-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove invalid XML characters from feed. #409
Conversation
Please ignore UT for now. WooCommerce core is doing something funky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The character ranges you linked to also mentions #x10000-#x10FFFF
is a valid range. I know they are mostly symbols but shouldn't we allow those as well?
I was able to reproduce the error with the develop branch.
However when I repeat it with this branch I get odd results:
<description><![CDATA[Bad XML character in quotes 8220; 8221; 038; some 8216;single 8217; quotes ]]></description>
The original description I added in the product was:
Bad XML character in quotes "�" & some 'single' quotes.
I'm not sure what's causing the quotes to get messed up, but having a look at esc_xml it seems to handle escaping within a CDATA block differently, should we also pass the CDATA block to the esc function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting better results after those last changes, but I still noticed two other issues I wanted to mention:
- I can still break it by adding the invalid character to the SKU field (WC doesn't let me add it to a shipping name or category name, although I still wonder if we need to escape that in case it has corrupted data in the database).
- The output for attributes is using
esc_html
, that should probably beesc_xml
for consistency (orself::sanitize
).
That is because the original regex was just wrong.
If CDATA is going to be used then it needs to go through the esc_xml. If esc_xml is used but the CDATA has added afterward we get messed up results. So yep. If there is CDATA we use it.
Yes, I had it initially but removed it when I was doing testing. The compilation of regex was failing initially because of it.
Should be OK now. Do you have any experience with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional changes and clarifications. I'm getting the expected results in the feed now:
<g:id>72</g:id>
<title><![CDATA[Bad XML]]></title>
<description><![CDATA[Bad XML character in quotes “ ” & some ‘single’ quotes. 😀️]]></description>
<g:product_type>East & West</g:product_type>
<link><![CDATA[https://woo-pinterest.test/product/bad-xml/]]></link>
<g:image_link><![CDATA[https://woo-pinterest.test/wp-content/uploads/2021/12/img-autem.png]]></g:image_link>
<g:availability>in stock</g:availability>
<g:price>10.00EUR</g:price>
<g:mpn>bad xml " " & > <</g:mpn>
<g:shipping>
<g:country>AT</g:country>
<g:service>Flat & rate</g:service>
<g:price>10.00 EUR</g:price>
</g:shipping>
I've never used the S modifier, but I had a quick look on the php page and from the comments it explains a bit more what it does, but it seems it's no longer needed for PHP 7.3+ (so we can probably ignore it)
Thanks @mikkamp and @budzanowski for working through this one 🙇 |
Changes proposed in this Pull Request:
Remove invalid XML characters from the XML sections.
Closes #401 .
XML defines a correct ranges for allowed characters https://www.w3.org/TR/REC-xml/#charsets we need to remove characters that are not in the ranges.
Screenshots:
Detailed test instructions:
Changelog entry