Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse priorities for repeated properties in uniform format for opengraph #115

Merged
merged 4 commits into from
Jun 7, 2019

Conversation

ivanprado
Copy link
Contributor

Some pages contain a duplicated definition of some properties like "og:image". See the following pages:
https://nerdist.com/article/star-wars-cast-reylo-episode-ix/
https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

Extruct default behaviour seems to be keep the last one, meanwhile Facebook default behaviour seems to be keep the first one according to results at the developer tool (see https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fnerdist.com%2Farticle%2Fstar-wars-cast-reylo-episode-ix%2F for an example).

Extruct should mimic Facebook behaviour so this PR is reverting the priorities when flattening OpenGraph properties.

@ivanprado ivanprado requested a review from lopuhin June 7, 2019 10:12
@ivanprado
Copy link
Contributor Author

@lopuhin I see two Travis runs, one is succeeding and one is failing. I don't understand, what is the difference between both? (save for codecov)

@lopuhin
Copy link
Member

lopuhin commented Jun 7, 2019

@ivanprado this looks like a genuine failure on python 3.4 (likely due to dictionary iteration order or something similar). Not sure what does this "push" build represent, looking...

@lopuhin
Copy link
Member

lopuhin commented Jun 7, 2019

@lopuhin
Copy link
Member

lopuhin commented Jun 7, 2019

So it could be that the failure is random, and one build was less lucky

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivanprado that's a great catch! If I read the tests correctly, we don't test whether properties are reversed or not, right? If this is correct, could you please add such a test?

@codecov
Copy link

codecov bot commented Jun 7, 2019

Codecov Report

Merging #115 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #115   +/-   ##
=======================================
  Coverage   87.63%   87.63%           
=======================================
  Files          11       11           
  Lines         469      469           
  Branches      101      101           
=======================================
  Hits          411      411           
  Misses         52       52           
  Partials        6        6
Impacted Files Coverage Δ
extruct/uniform.py 93.22% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de219cb...f987d9a. Read the comment docs.

@ivanprado
Copy link
Contributor Author

I see now the problem. It seems that PyRDFA is returning repeated properties in an inconsistent ordering.

@ivanprado
Copy link
Contributor Author

@lopuhin thank you for your insights, very useful. It seems that the failure was caused because RDFa returns inconsistent ordering with duplicated properties. I have a filled an issue, created an xfail test and patched test so that is is not failing.

@lopuhin
Copy link
Member

lopuhin commented Jun 7, 2019

Great, thanks for taking care of this issue @ivanprado 👍

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the fix and for taking care of the test and extra issues that arose @ivanprado

@lopuhin lopuhin merged commit 713f70b into master Jun 7, 2019
@lopuhin lopuhin deleted the opengraph_uniform_reversed_precedence branch June 7, 2019 12:54
@kmike
Copy link
Member

kmike commented Jun 10, 2019

I think this is not the only thing we need to do, as it breaks extraction on some websites. It seems we need to take first non-empty result. E.g. on https://www.triganostore.com/tente-de-camping-raclet-bora-4.html there are two og:description values, the first one is empty. https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fwww.triganostore.com%2Ftente-de-camping-raclet-bora-4.html shows that a non-empty one is extracted. Opened #117 for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants