Skip to content

Conversation

@ivanprado
Copy link
Contributor

Extruct deals wrongly with elements referenced by itemref if the referenced element also include other items defined inside with itemscope. For example:

<html>
<body>
<div id="product" itemscope itemtype="http://schema.org/Product" itemref="other-product-properties">
    <span itemprop="name">Executive Anvil</span>
    <img itemprop="image" src="img1.jpg"/>
</div>
<div id="other-product-properties">
    <img itemprop="image" src="img2.jpg"/>
    <div itemscope itemtype="http://schema.org/Product" itemprop="related_products">
        <span itemprop="name">REL PROD 1</span>
        <img itemprop="image" src="rel-prod-1.jpg">
    </div>
</div>
</body>
</html>

Is extracted as (see the duplicate name and the extra image):

[{'@context': 'http://schema.org',
  '@type': 'Product',
  'image': ['img2.jpg', 'rel-prod-1.jpg', 'img1.jpg'],
  'name': ['REL PROD 1', 'Executive Anvil'],
  'related_products': {'@type': 'Product',
                       'image': 'rel-prod-1.jpg',
                       'name': 'REL PROD 1'}}]

This PR patchs the code to solve this problem, so that the extracted result is:

[{'@context': 'http://schema.org',
  '@type': 'Product',
  'image': ['img2.jpg', 'img1.jpg'],
  'name': 'Executive Anvil',
  'related_products': {'@type': 'Product',
                       'image': 'rel-prod-1.jpg',
                       'name': 'REL PROD 1'}}]

I patched function _extract_property_refs() so that properties in a different scope than the referenced element are skipped. I have passed the tests and they pass.

@ivanprado ivanprado requested a review from lopuhin February 6, 2019 14:49
@codecov
Copy link

codecov bot commented Feb 6, 2019

Codecov Report

Merging #105 into master will decrease coverage by 0.15%.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
- Coverage    87.3%   87.15%   -0.16%     
==========================================
  Files          11       11              
  Lines         457      467      +10     
  Branches       97      101       +4     
==========================================
+ Hits          399      407       +8     
- Misses         52       53       +1     
- Partials        6        7       +1
Impacted Files Coverage Δ
extruct/w3cmicrodata.py 97.29% <81.81%> (-1.72%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...258664f. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 6, 2019

Codecov Report

Merging #105 into master will increase coverage by 0.27%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage    87.3%   87.58%   +0.27%     
==========================================
  Files          11       11              
  Lines         457      467      +10     
  Branches       97       99       +2     
==========================================
+ Hits          399      409      +10     
  Misses         52       52              
  Partials        6        6
Impacted Files Coverage Δ
extruct/w3cmicrodata.py 99.09% <100%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...b0d55ac. Read the comment docs.

@ivanprado
Copy link
Contributor Author

I detected some cases not working... Work in progress.

@ivanprado ivanprado changed the title Avoid including itemprop from child itemscopes when using itemref [WIP] Avoid including itemprop from child itemscopes when using itemref Feb 6, 2019
@ivanprado ivanprado changed the title [WIP] Avoid including itemprop from child itemscopes when using itemref Avoid including itemprop from child itemscopes when using itemref Feb 6, 2019
@codecov-io
Copy link

codecov-io commented Feb 6, 2019

Codecov Report

Merging #105 into master will increase coverage by 0.32%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage    87.3%   87.63%   +0.32%     
==========================================
  Files          11       11              
  Lines         457      469      +12     
  Branches       97      101       +4     
==========================================
+ Hits          399      411      +12     
  Misses         52       52              
  Partials        6        6
Impacted Files Coverage Δ
extruct/w3cmicrodata.py 99.11% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...e655b0a. Read the comment docs.

@ivanprado
Copy link
Contributor Author

@lopuhin I'm finished with that. Feel free to change the reviewer if you think other person should be the reviewer, I just added you as it was the github suggestion.

"prop3": "REFERENCED TO INCLUDE PROPERTIES AND ALSO INDIVIDUAL PRODUCT",
"image": [
"img-2.jpg",
"anvil_executive.jpg"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the images seems reversed to the depth-first traversal order, I wonder if we should care about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure. Looking at that: https://www.w3.org/TR/microdata/#associating-names-with-items I would say that referenced items should be processed after the current item. Maybe is just as easy as changing the order on which
itemref is processed. I did a quick test and only one test fails... I'll check deeper later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to change the order in which itemref is processed. I have committed a fix.

{
"type": "http://schema.org/Product",
"properties": {
"prop3": "REFERENCED TO INCLUDE PROPERTIES AND ALSO INDIVIDUAL PRODUCT",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I'm reading the schema incorrectly, but why do we have a separate product with its properties duplicated also in the product above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The product with id more-properties does not have attribute itemprop. Because of that, when referenced only inner attributes are included. As the product has not been extracted when referencing, it is finally extracted as an individual Product.

Regarding if that should be behaviour or not, I'm not completelly sure. This is the standard definition of itemrefhttps://www.w3.org/TR/microdata/#names:-the-itemprop-attribute but is not very complete.

There is an algorithm definition here: https://www.w3.org/TR/microdata/#associating-names-with-items but I didn't go so deep for this patch as it seems it is not the same approach followed by extruct.

I have just tested this html in Google Dev Tools and it also extracts the product isolated, but it is not including prop3 into the first product.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for checking this @ivanprado . I think the current approach makes sense for extruct.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivanprado , the changes make sense to me and look great 👍 Left a minor comment but I think everything is fine as-is.

for prop in node.xpath("id($refid)/descendant-or-self::*[@itemprop]", refid=refid):
for p, v in self._extract_property(
prop, items_seen=items_seen, base_url=base_url):
ref_node = node.xpath("id($refid)[1]", refid=refid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, if there were several elements with the same id, they were all extracted, but now just the first one is extracted? I realise that there supposed to be only one element with each id so this seems fine.

Copy link
Contributor Author

@ivanprado ivanprado Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... I went to standard: https://www.w3.org/TR/html401/struct/global.html#h-7.5.2

id = name [CS]
This attribute assigns a name to an element. This name must be unique in a document.

So I think is safe to expect just one.

@lopuhin
Copy link
Member

lopuhin commented Feb 12, 2019

@Kebniss @kmike I think this is ready to merge, would you like to have a look?

@Kebniss
Copy link
Contributor

Kebniss commented Feb 13, 2019

Looks good to me! Thank you for fixing this @ivanprado :)

@lopuhin lopuhin merged commit b99b14d into scrapinghub:master Feb 14, 2019
@lopuhin
Copy link
Member

lopuhin commented Feb 14, 2019

Thanks @ivanprado and @Kebniss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants