-
Notifications
You must be signed in to change notification settings - Fork 119
Avoid including itemprop from child itemscopes when using itemref #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #105 +/- ##
==========================================
- Coverage 87.3% 87.15% -0.16%
==========================================
Files 11 11
Lines 457 467 +10
Branches 97 101 +4
==========================================
+ Hits 399 407 +8
- Misses 52 53 +1
- Partials 6 7 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #105 +/- ##
==========================================
+ Coverage 87.3% 87.58% +0.27%
==========================================
Files 11 11
Lines 457 467 +10
Branches 97 99 +2
==========================================
+ Hits 399 409 +10
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
|
I detected some cases not working... Work in progress. |
Codecov Report
@@ Coverage Diff @@
## master #105 +/- ##
==========================================
+ Coverage 87.3% 87.63% +0.32%
==========================================
Files 11 11
Lines 457 469 +12
Branches 97 101 +4
==========================================
+ Hits 399 411 +12
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
|
@lopuhin I'm finished with that. Feel free to change the reviewer if you think other person should be the reviewer, I just added you as it was the github suggestion. |
| "prop3": "REFERENCED TO INCLUDE PROPERTIES AND ALSO INDIVIDUAL PRODUCT", | ||
| "image": [ | ||
| "img-2.jpg", | ||
| "anvil_executive.jpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the images seems reversed to the depth-first traversal order, I wonder if we should care about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure. Looking at that: https://www.w3.org/TR/microdata/#associating-names-with-items I would say that referenced items should be processed after the current item. Maybe is just as easy as changing the order on which
itemref is processed. I did a quick test and only one test fails... I'll check deeper later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me to change the order in which itemref is processed. I have committed a fix.
| { | ||
| "type": "http://schema.org/Product", | ||
| "properties": { | ||
| "prop3": "REFERENCED TO INCLUDE PROPERTIES AND ALSO INDIVIDUAL PRODUCT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe I'm reading the schema incorrectly, but why do we have a separate product with its properties duplicated also in the product above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The product with id more-properties does not have attribute itemprop. Because of that, when referenced only inner attributes are included. As the product has not been extracted when referencing, it is finally extracted as an individual Product.
Regarding if that should be behaviour or not, I'm not completelly sure. This is the standard definition of itemrefhttps://www.w3.org/TR/microdata/#names:-the-itemprop-attribute but is not very complete.
There is an algorithm definition here: https://www.w3.org/TR/microdata/#associating-names-with-items but I didn't go so deep for this patch as it seems it is not the same approach followed by extruct.
I have just tested this html in Google Dev Tools and it also extracts the product isolated, but it is not including prop3 into the first product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for checking this @ivanprado . I think the current approach makes sense for extruct.
lopuhin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivanprado , the changes make sense to me and look great 👍 Left a minor comment but I think everything is fine as-is.
| for prop in node.xpath("id($refid)/descendant-or-self::*[@itemprop]", refid=refid): | ||
| for p, v in self._extract_property( | ||
| prop, items_seen=items_seen, base_url=base_url): | ||
| ref_node = node.xpath("id($refid)[1]", refid=refid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, if there were several elements with the same id, they were all extracted, but now just the first one is extracted? I realise that there supposed to be only one element with each id so this seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... I went to standard: https://www.w3.org/TR/html401/struct/global.html#h-7.5.2
id = name [CS]
This attribute assigns a name to an element. This name must be unique in a document.
So I think is safe to expect just one.
|
Looks good to me! Thank you for fixing this @ivanprado :) |
|
Thanks @ivanprado and @Kebniss |
Extruct deals wrongly with elements referenced by
itemrefif the referenced element also include other items defined inside withitemscope. For example:Is extracted as (see the duplicate name and the extra image):
[{'@context': 'http://schema.org', '@type': 'Product', 'image': ['img2.jpg', 'rel-prod-1.jpg', 'img1.jpg'], 'name': ['REL PROD 1', 'Executive Anvil'], 'related_products': {'@type': 'Product', 'image': 'rel-prod-1.jpg', 'name': 'REL PROD 1'}}]This PR patchs the code to solve this problem, so that the extracted result is:
[{'@context': 'http://schema.org', '@type': 'Product', 'image': ['img2.jpg', 'img1.jpg'], 'name': 'Executive Anvil', 'related_products': {'@type': 'Product', 'image': 'rel-prod-1.jpg', 'name': 'REL PROD 1'}}]I patched function
_extract_property_refs()so that properties in a different scope than the referenced element are skipped. I have passed the tests and they pass.