Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

Merged
merged 10 commits into from
Aug 8, 2018

Conversation

shiquanwang
Copy link
Contributor

please see #84 for detail

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiquanwang , the fix looks good to me (I only noticed that shoul is likely a typo - missing d), but the build failed on travis, could you please check the failures - at first sight it looks like they could be caused by this change?

@shiquanwang
Copy link
Contributor Author

shiquanwang commented Aug 1, 2018

@lopuhin Yes it's a typo, thanks. I've fixed it.

I tried to see how the tests work. I switch to master branch and run the tests and get the same errors in travis, please help check:

(extructdev_py36) Shiquans-MBP:extruct shiquan$ git branch
* master
  scrapinghub/extruct#84
(extructdev_py36) Shiquans-MBP:extruct shiquan$ export TOXENV=py36
(extructdev_py36) Shiquans-MBP:extruct shiquan$ tox
GLOB sdist-make: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/setup.py
py36 inst-nodeps: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/.tox/dist/extruct-0.5.0.zip
py36 installed: atomicwrites==1.1.5,attrs==18.1.0,beautifulsoup4==4.6.1,bottle==0.12.13,certifi==2018.4.16,chardet==3.0.4,coverage==4.5.1,extruct==0.5.0,gevent==1.3.5,greenlet==0.4.14,html5lib==1.0.1,idna==2.7,isodate==0.6.0,lxml==4.2.3,mf2py==1.1.1,mock==2.0.0,more-itertools==4.3.0,pbr==4.2.0,pluggy==0.7.1,py==1.5.4,pyparsing==2.2.0,pytest==3.7.0,pytest-cov==2.5.1,rdflib==4.2.2,rdflib-jsonld==0.4.0,requests==2.19.1,six==1.11.0,urllib3==1.23,w3lib==1.19.0,webencodings==0.5.1
py36 runtests: PYTHONHASHSEED='2226345225'
py36 runtests: commands[0] | py.test --cov-report=term --cov-report= --cov=extruct extruct tests
============================================================================ test session starts ============================================================================
platform darwin -- Python 3.6.6, pytest-3.7.0, py-1.5.4, pluggy-0.7.1
rootdir: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct, inifile:
plugins: cov-2.5.1
collected 53 items

tests/test_extruct.py .....                                                                                                                                           [  9%]
tests/test_extruct_uniform.py ..F.                                                                                                                                    [ 16%]
tests/test_jsonld.py ...                                                                                                                                              [ 22%]
tests/test_microdata.py ..................                                                                                                                            [ 56%]
tests/test_microformat.py F                                                                                                                                           [ 58%]
tests/test_opengraph.py .                                                                                                                                             [ 60%]
tests/test_rdfa.py .....                                                                                                                                              [ 69%]
tests/test_tool.py ..........                                                                                                                                         [ 88%]
tests/test_uniform.py ....F.                                                                                                                                          [100%]

================================================================================= FAILURES ==================================================================================
_______________________________________________________________________ TestFlatten.test_microformat ________________________________________________________________________

self = <tests.test_extruct_uniform.TestFlatten testMethod=test_microformat>

    def test_microformat(self):
        body = get_testdata('misc', 'microformat_test.html')
        expected = json.loads(get_testdata('misc', 'microformat_flat_test.json').decode('UTF-8'))
        data = extruct.extract(body, uniform=True)
>       self.assertEqual(jsonize_dict(data['microformat']), expected)
E       AssertionError: Lists differ: [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], '@context': '[798 chars]'}]}] != [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': [''],[798 chars]0']}]
E
E       First differing element 0:
E       {'@type': ['h-hidden-phone', 'h-hidden-tablet'], '@context': '[40 chars]['']}
E       {'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': [''],[40 chars]ki/'}
E
E         [{'@context': 'http://microformats.org/wiki/',
E       -   '@type': ['h-hidden-phone', 'h-hidden-tablet'],
E       +   '@type': ['h-hidden-tablet', 'h-hidden-phone'],
E           'name': ['']},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-hidden-phone'],
E       -   'children': [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], 'name': ['']},
E       ?                           ------------------
E
E       +   'children': [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': ['']},
E       ?                                            ++++++++++++++++++
E
E                        {'@type': ['h-hidden-phone'],
E                         'name': ['aJ Styles FastLane 2018 15 x 17 Framed Plaque w/ '
E                                  'Ring Canvas'],
E                         'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}]},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-entry'],
E           'author': [{'@type': ['h-card'],
E                       'name': ['W. Developer'],
E                       'url': ['http://example.com'],
E                       'value': 'W. Developer'}],
E           'content': [{'html': '<p>Blah blah blah</p>', 'value': 'Blah blah blah'}],
E           'name': ['Microformats are amazing'],
E           'published': ['2013-06-13 12:00:00'],
E           'summary': ['In which I extoll the virtues of using microformats.']}]

tests/test_extruct_uniform.py:29: AssertionError
_____________________________________________________________________ TestMicroformat.test_microformat ______________________________________________________________________

self = <tests.test_microformat.TestMicroformat testMethod=test_microformat>

    def test_microformat(self):
        body = get_testdata('misc', 'microformat_test.html')
        expected = json.loads(get_testdata('misc', 'microformat_test.json').decode('UTF-8'))

        opengraphe = MicroformatExtractor()
        data = opengraphe.extract(body)
>       self.assertEqual(jsonize_dict(data), expected)
E       AssertionError: Lists differ: [{'type': ['h-hidden-phone', 'h-hidden-table[774 chars]}]}}] != [{'properties': {'name': ['']}, 'type': ['h-[774 chars]y']}]
E
E       First differing element 0:
E       {'type': ['h-hidden-phone', 'h-hidden-tablet'], 'properties': {'name': ['']}}
E       {'properties': {'name': ['']}, 'type': ['h-hidden-tablet', 'h-hidden-phone']}
E
E       - [{'properties': {'name': ['']}, 'type': ['h-hidden-phone', 'h-hidden-tablet']},
E       ?                                          ------------------
E
E       + [{'properties': {'name': ['']}, 'type': ['h-hidden-tablet', 'h-hidden-phone']},
E       ?                                                           ++++++++++++++++++
E
E          {'children': [{'properties': {'name': ['']},
E       -                 'type': ['h-hidden-phone', 'h-hidden-tablet']},
E       +                 'type': ['h-hidden-tablet', 'h-hidden-phone']},
E                        {'properties': {'name': ['aJ Styles FastLane 2018 15 x 17 '
E                                                 'Framed Plaque w/ Ring Canvas'],
E                                        'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']},
E                         'type': ['h-hidden-phone']}],
E           'properties': {},
E           'type': ['h-hidden-phone']},
E          {'properties': {'author': [{'properties': {'name': ['W. Developer'],
E                                                     'url': ['http://example.com']},
E                                      'type': ['h-card'],
E                                      'value': 'W. Developer'}],
E                          'content': [{'html': '<p>Blah blah blah</p>',
E                                       'value': 'Blah blah blah'}],
E                          'name': ['Microformats are amazing'],
E                          'published': ['2013-06-13 12:00:00'],
E                          'summary': ['In which I extoll the virtues of using '
E                                      'microformats.']},
E           'type': ['h-entry']}]

tests/test_microformat.py:19: AssertionError
_______________________________________________________________________ TestUniform.test_umicroformat _______________________________________________________________________

self = <tests.test_uniform.TestUniform testMethod=test_umicroformat>

    def test_umicroformat(self):
        expected = [ { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-hidden-tablet', 'h-hidden-phone'],
                     'name': ['']},
                   { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-hidden-phone'],
                     'children': [ { '@type': [ 'h-hidden-tablet',
                                                'h-hidden-phone'],
                                     'name': ['']},
                                   { '@type': ['h-hidden-phone'],
                                     'name': [ 'aJ Styles FastLane 2018 15 x '
                                               '17 Framed Plaque w/ Ring '
                                               'Canvas'],
                                     'photo': [ '/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}],
                   },
                   { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-entry'],
                     'author': [ { '@type': ['h-card'],
                                   'name': ['W. Developer'],
                                   'url': ['http://example.com'],
                                   'value': 'W. Developer'}],
                     'content': [ { 'html': '<p>Blah blah blah</p>',
                                    'value': 'Blah blah blah'}],
                     'name': ['Microformats are amazing'],
                     'published': ['2013-06-13 12:00:00'],
                     'summary': [ 'In which I extoll the virtues of using '
                                  'microformats.']}]
        body = get_testdata('misc', 'microformat_test.html')
        data = extruct.extract(body, syntaxes=['microformat'], uniform=True)
>       self.assertEqual(data['microformat'], expected)
E       AssertionError: Lists differ: [{'@type': ['h-hidden-phone', 'h-hidden-table[816 chars]'}]}] != [{'@context': 'http://microformats.org/wiki/'[816 chars].']}]
E
E       First differing element 0:
E       {'@type': ['h-hidden-phone', 'h-hidden-table[58 chars]['']}
E       {'@context': 'http://microformats.org/wiki/'[58 chars]['']}
E
E         [{'@context': 'http://microformats.org/wiki/',
E       -   '@type': ['h-hidden-phone', 'h-hidden-tablet'],
E       +   '@type': ['h-hidden-tablet', 'h-hidden-phone'],
E           'name': ['']},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-hidden-phone'],
E       -   'children': [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], 'name': ['']},
E       ?                           ------------------
E
E       +   'children': [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': ['']},
E       ?                                            ++++++++++++++++++
E
E                        {'@type': ['h-hidden-phone'],
E                         'name': ['aJ Styles FastLane 2018 15 x 17 Framed Plaque w/ '
E                                  'Ring Canvas'],
E                         'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}]},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-entry'],
E           'author': [{'@type': ['h-card'],
E                       'name': ['W. Developer'],
E                       'url': ['http://example.com'],
E                       'value': 'W. Developer'}],
E           'content': [{'html': '<p>Blah blah blah</p>', 'value': 'Blah blah blah'}],
E           'name': ['Microformats are amazing'],
E           'published': ['2013-06-13 12:00:00'],
E           'summary': ['In which I extoll the virtues of using microformats.']}]

tests/test_uniform.py:59: AssertionError

---------- coverage: platform darwin, python 3.6.6-final-0 -----------
Name                      Stmts   Miss Branch BrPart  Cover
-----------------------------------------------------------
extruct/__init__.py           7      0      0      0   100%
extruct/__main__.py           4      4      2      0     0%
extruct/_extruct.py          53      8     33      2    84%
extruct/jsonld.py            24      0      6      1    97%
extruct/microformat.py        7      0      2      0   100%
extruct/opengraph.py         20      0      8      0   100%
extruct/rdfa.py              23      0      0      0   100%
extruct/tool.py              25      0      0      0   100%
extruct/uniform.py           59      2     34      2    94%
extruct/w3cmicrodata.py     102      0     63      1    99%
extruct/xmldom.py           108     29     28      0    65%
-----------------------------------------------------------
TOTAL                       432     43    176      6    88%

==================================================================== 3 failed, 50 passed in 2.90 seconds ====================================================================
ERROR: InvocationError for command '/Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/.tox/py36/bin/py.test --cov-report=term --cov-report= --cov=extruct extruct tests' (exited with code 1)
__________________________________________________________________________________ summary __________________________________________________________________________________
ERROR:   py36: commands failed

@shiquanwang
Copy link
Contributor Author

shiquanwang commented Aug 1, 2018

@lopuhin

all errors show the expected and returned lists differ in order:

  • expected: ['h-hidden-tablet', 'h-hidden-phone']
  • returned: ['h-hidden-phone', 'h-hidden-tablet']

and the cause is that mf2py return sorted list: check HERE

@codecov
Copy link

codecov bot commented Aug 1, 2018

Codecov Report

Merging #85 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #85   +/-   ##
=======================================
  Coverage   88.26%   88.26%           
=======================================
  Files          10       10           
  Lines         426      426           
  Branches       88       88           
=======================================
  Hits          376      376           
  Misses         44       44           
  Partials        6        6
Impacted Files Coverage Δ
extruct/jsonld.py 95.83% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79dff34...d839faa. Read the comment docs.

@shiquanwang
Copy link
Contributor Author

shiquanwang commented Aug 1, 2018

@lopuhin
I tried to improve coverage with adding some tests.
However, json.JSONDecodeError first appear in Python 3.5 and cause tests fail for py2.7 and py3.4.
Don't know how to fix this properly.

Finally I use a second layer try-catch with ValueError since json.JSONDecodeError is based on ValueError so I can make both unit tests and coverage tests happy.

Shiquan Wang added 2 commits August 1, 2018 12:36
Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shiquanwang , looks great!

mf2py errors are indeed caused by the new version, master build failed with the same error https://travis-ci.org/scrapinghub/extruct/jobs/389836540 - thanks for fixing them too! 👍

Using ValueError instead of json.JSONDecodeError looks good to me (in general you could try building a tuple of caught exceptions outside the function, trying to use json.JSONDecodeError and falling back to ValueError if it's not defined, but in this case ValueError works great IMO).

@lopuhin lopuhin changed the title Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters [MRG+1] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters Aug 1, 2018
@lopuhin
Copy link
Member

lopuhin commented Aug 1, 2018

Hey @kmike @Kebniss this PR looks good to me, do you mind if we merge it, or would you like to have another look at it?

@kmike
Copy link
Member

kmike commented Aug 1, 2018

There are two other PRs handling the same issue: #57, #58. Questions:

  1. Should we parse data multiple times? In this PR data can be parsed up to 3 times, in Allow control characters inside JSON-LD strings #57 and Update jsonld.py, fallback to strict=False if err #58 it is two times.
  2. Does strict=False work together with HTML_OR_JS_COMMENTLINE replacement? Here: no, Allow control characters inside JSON-LD strings #57: yes, Update jsonld.py, fallback to strict=False if err #58: yes
  3. Can user force strict=True/False? Here: no, Allow control characters inside JSON-LD strings #57: yes, Update jsonld.py, fallback to strict=False if err #58: no.

If we provide an option to use strict=False, and have it ON by default (like in #57), we may try parsing with strict=False from the beginning (currently none of PRs does that) - why should it be a fallback, is there any downside of doing it by default? Also, I think it makes sense for this feature to work with HTML comment stripping - in the current PR it is either stripping the comments or using strict=False, but not both.

I don't have a strong opinion on whether it should be configurable.

@lopuhin
Copy link
Member

lopuhin commented Aug 1, 2018

If we provide an option to use strict=False, and have it ON by default (like in #57), we may try parsing with strict=False from the beginning (currently none of PRs does that) - why should it be a fallback, is there any downside of doing it by default? Also, I think it makes sense for this feature to work with HTML comment stripping - in the current PR it is either stripping the comments or using strict=False, but not both.

Great points, agree to both! I think we don't need for it to be configurable, but I'm not against it.

@shiquanwang
Copy link
Contributor Author

shiquanwang commented Aug 2, 2018

Thanks @kmike for your great points.

Can you merge any PR(this one, 57, 58) that solves the control character issue first, which causes errors when parsing.

And then we can add some TODOs in the function to note it can be further considered with some better ways of doing.


Now I'm for to do data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script), strict=False) directly:

  • try to remove leading comments if there is any
  • strict=False will allow control characters, don't see any downside to make this default from my opinion

Let me do.

@kmike
Copy link
Member

kmike commented Aug 2, 2018

@shiquanwang I think removing comments is a much more destructive operation, since it can remove some content from the JSON data (e.g. when a text looking like a comment is inside a value). I'd still prefer comment removal to be only a fallback.

@shiquanwang
Copy link
Contributor Author

@kmike
Good point. I didn't notice that. I've made changes accordingly.

@kmike
Copy link
Member

kmike commented Aug 3, 2018

Thanks @shiquanwang! The implementation looks good to me now. Would you mind adding a couple more test cases, to check the behavior you've covered (html comment inside json value, html comment + control character)?

@shiquanwang
Copy link
Contributor Author

Tests are added. @kmike
Hope can merge soon.

@kmike
Copy link
Member

kmike commented Aug 6, 2018

Looks good @shiquanwang! However, would you mind bringing old, simpler tests back as well? Having tests for basic behavior helps with debugging / isolating an issue, even if there are complex tests which cover more edge cases.

@shiquanwang
Copy link
Contributor Author

Hi @kmike .
I didn't remove old tests. The test for control characters is newly added.
Sorry, don't quite get what you want exactly.

@kmike
Copy link
Member

kmike commented Aug 6, 2018

@shiquanwang by "old tests" I meant tests you've added initially in this PR.

@shiquanwang
Copy link
Contributor Author

@kmike get it. New push is made to make tests for single/simple case.

@kmike
Copy link
Member

kmike commented Aug 8, 2018

Looks good to me, thanks @shiquanwang!

@kmike kmike changed the title [MRG+1] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters [MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters Aug 8, 2018
@lopuhin lopuhin merged commit cd56da2 into scrapinghub:master Aug 8, 2018
@lopuhin
Copy link
Member

lopuhin commented Aug 8, 2018

Looks good to me too, thanks @shiquanwang and @kmike !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants