-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse priorities for repeated properties in uniform format for opengraph #115
Conversation
@lopuhin I see two Travis runs, one is succeeding and one is failing. I don't understand, what is the difference between both? (save for codecov) |
@ivanprado this looks like a genuine failure on python 3.4 (likely due to dictionary iteration order or something similar). Not sure what does this "push" build represent, looking... |
Aha this is the answer: https://stackoverflow.com/questions/41524861/travis-failed-push-but-passed-pr/41534988 |
So it could be that the failure is random, and one build was less lucky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivanprado that's a great catch! If I read the tests correctly, we don't test whether properties are reversed or not, right? If this is correct, could you please add such a test?
Codecov Report
@@ Coverage Diff @@
## master #115 +/- ##
=======================================
Coverage 87.63% 87.63%
=======================================
Files 11 11
Lines 469 469
Branches 101 101
=======================================
Hits 411 411
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
I see now the problem. It seems that PyRDFA is returning repeated properties in an inconsistent ordering. |
@lopuhin thank you for your insights, very useful. It seems that the failure was caused because RDFa returns inconsistent ordering with duplicated properties. I have a filled an issue, created an xfail test and patched test so that is is not failing. |
Great, thanks for taking care of this issue @ivanprado 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the fix and for taking care of the test and extra issues that arose @ivanprado
I think this is not the only thing we need to do, as it breaks extraction on some websites. It seems we need to take first non-empty result. E.g. on https://www.triganostore.com/tente-de-camping-raclet-bora-4.html there are two |
Some pages contain a duplicated definition of some properties like "og:image". See the following pages:
https://nerdist.com/article/star-wars-cast-reylo-episode-ix/
https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/
Extruct default behaviour seems to be keep the last one, meanwhile Facebook default behaviour seems to be keep the first one according to results at the developer tool (see https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fnerdist.com%2Farticle%2Fstar-wars-cast-reylo-episode-ix%2F for an example).
Extruct should mimic Facebook behaviour so this PR is reverting the priorities when flattening OpenGraph properties.