[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

shiquanwang · 2018-07-31T09:01:46Z

please see #84 for detail

lopuhin

Thanks @shiquanwang , the fix looks good to me (I only noticed that shoul is likely a typo - missing d), but the build failed on travis, could you please check the failures - at first sight it looks like they could be caused by this change?

shiquanwang · 2018-08-01T01:49:01Z

@lopuhin Yes it's a typo, thanks. I've fixed it.

I tried to see how the tests work. I switch to master branch and run the tests and get the same errors in travis, please help check:

(extructdev_py36) Shiquans-MBP:extruct shiquan$ git branch
* master
  scrapinghub/extruct#84
(extructdev_py36) Shiquans-MBP:extruct shiquan$ export TOXENV=py36
(extructdev_py36) Shiquans-MBP:extruct shiquan$ tox
GLOB sdist-make: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/setup.py
py36 inst-nodeps: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/.tox/dist/extruct-0.5.0.zip
py36 installed: atomicwrites==1.1.5,attrs==18.1.0,beautifulsoup4==4.6.1,bottle==0.12.13,certifi==2018.4.16,chardet==3.0.4,coverage==4.5.1,extruct==0.5.0,gevent==1.3.5,greenlet==0.4.14,html5lib==1.0.1,idna==2.7,isodate==0.6.0,lxml==4.2.3,mf2py==1.1.1,mock==2.0.0,more-itertools==4.3.0,pbr==4.2.0,pluggy==0.7.1,py==1.5.4,pyparsing==2.2.0,pytest==3.7.0,pytest-cov==2.5.1,rdflib==4.2.2,rdflib-jsonld==0.4.0,requests==2.19.1,six==1.11.0,urllib3==1.23,w3lib==1.19.0,webencodings==0.5.1
py36 runtests: PYTHONHASHSEED='2226345225'
py36 runtests: commands[0] | py.test --cov-report=term --cov-report= --cov=extruct extruct tests
============================================================================ test session starts ============================================================================
platform darwin -- Python 3.6.6, pytest-3.7.0, py-1.5.4, pluggy-0.7.1
rootdir: /Users/shiquan/workspace/projects/github.com/shiquanwang/extruct, inifile:
plugins: cov-2.5.1
collected 53 items

tests/test_extruct.py .....                                                                                                                                           [  9%]
tests/test_extruct_uniform.py ..F.                                                                                                                                    [ 16%]
tests/test_jsonld.py ...                                                                                                                                              [ 22%]
tests/test_microdata.py ..................                                                                                                                            [ 56%]
tests/test_microformat.py F                                                                                                                                           [ 58%]
tests/test_opengraph.py .                                                                                                                                             [ 60%]
tests/test_rdfa.py .....                                                                                                                                              [ 69%]
tests/test_tool.py ..........                                                                                                                                         [ 88%]
tests/test_uniform.py ....F.                                                                                                                                          [100%]

================================================================================= FAILURES ==================================================================================
_______________________________________________________________________ TestFlatten.test_microformat ________________________________________________________________________

self = <tests.test_extruct_uniform.TestFlatten testMethod=test_microformat>

    def test_microformat(self):
        body = get_testdata('misc', 'microformat_test.html')
        expected = json.loads(get_testdata('misc', 'microformat_flat_test.json').decode('UTF-8'))
        data = extruct.extract(body, uniform=True)
>       self.assertEqual(jsonize_dict(data['microformat']), expected)
E       AssertionError: Lists differ: [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], '@context': '[798 chars]'}]}] != [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': [''],[798 chars]0']}]
E
E       First differing element 0:
E       {'@type': ['h-hidden-phone', 'h-hidden-tablet'], '@context': '[40 chars]['']}
E       {'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': [''],[40 chars]ki/'}
E
E         [{'@context': 'http://microformats.org/wiki/',
E       -   '@type': ['h-hidden-phone', 'h-hidden-tablet'],
E       +   '@type': ['h-hidden-tablet', 'h-hidden-phone'],
E           'name': ['']},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-hidden-phone'],
E       -   'children': [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], 'name': ['']},
E       ?                           ------------------
E
E       +   'children': [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': ['']},
E       ?                                            ++++++++++++++++++
E
E                        {'@type': ['h-hidden-phone'],
E                         'name': ['aJ Styles FastLane 2018 15 x 17 Framed Plaque w/ '
E                                  'Ring Canvas'],
E                         'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}]},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-entry'],
E           'author': [{'@type': ['h-card'],
E                       'name': ['W. Developer'],
E                       'url': ['http://example.com'],
E                       'value': 'W. Developer'}],
E           'content': [{'html': '<p>Blah blah blah</p>', 'value': 'Blah blah blah'}],
E           'name': ['Microformats are amazing'],
E           'published': ['2013-06-13 12:00:00'],
E           'summary': ['In which I extoll the virtues of using microformats.']}]

tests/test_extruct_uniform.py:29: AssertionError
_____________________________________________________________________ TestMicroformat.test_microformat ______________________________________________________________________

self = <tests.test_microformat.TestMicroformat testMethod=test_microformat>

    def test_microformat(self):
        body = get_testdata('misc', 'microformat_test.html')
        expected = json.loads(get_testdata('misc', 'microformat_test.json').decode('UTF-8'))

        opengraphe = MicroformatExtractor()
        data = opengraphe.extract(body)
>       self.assertEqual(jsonize_dict(data), expected)
E       AssertionError: Lists differ: [{'type': ['h-hidden-phone', 'h-hidden-table[774 chars]}]}}] != [{'properties': {'name': ['']}, 'type': ['h-[774 chars]y']}]
E
E       First differing element 0:
E       {'type': ['h-hidden-phone', 'h-hidden-tablet'], 'properties': {'name': ['']}}
E       {'properties': {'name': ['']}, 'type': ['h-hidden-tablet', 'h-hidden-phone']}
E
E       - [{'properties': {'name': ['']}, 'type': ['h-hidden-phone', 'h-hidden-tablet']},
E       ?                                          ------------------
E
E       + [{'properties': {'name': ['']}, 'type': ['h-hidden-tablet', 'h-hidden-phone']},
E       ?                                                           ++++++++++++++++++
E
E          {'children': [{'properties': {'name': ['']},
E       -                 'type': ['h-hidden-phone', 'h-hidden-tablet']},
E       +                 'type': ['h-hidden-tablet', 'h-hidden-phone']},
E                        {'properties': {'name': ['aJ Styles FastLane 2018 15 x 17 '
E                                                 'Framed Plaque w/ Ring Canvas'],
E                                        'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']},
E                         'type': ['h-hidden-phone']}],
E           'properties': {},
E           'type': ['h-hidden-phone']},
E          {'properties': {'author': [{'properties': {'name': ['W. Developer'],
E                                                     'url': ['http://example.com']},
E                                      'type': ['h-card'],
E                                      'value': 'W. Developer'}],
E                          'content': [{'html': '<p>Blah blah blah</p>',
E                                       'value': 'Blah blah blah'}],
E                          'name': ['Microformats are amazing'],
E                          'published': ['2013-06-13 12:00:00'],
E                          'summary': ['In which I extoll the virtues of using '
E                                      'microformats.']},
E           'type': ['h-entry']}]

tests/test_microformat.py:19: AssertionError
_______________________________________________________________________ TestUniform.test_umicroformat _______________________________________________________________________

self = <tests.test_uniform.TestUniform testMethod=test_umicroformat>

    def test_umicroformat(self):
        expected = [ { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-hidden-tablet', 'h-hidden-phone'],
                     'name': ['']},
                   { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-hidden-phone'],
                     'children': [ { '@type': [ 'h-hidden-tablet',
                                                'h-hidden-phone'],
                                     'name': ['']},
                                   { '@type': ['h-hidden-phone'],
                                     'name': [ 'aJ Styles FastLane 2018 15 x '
                                               '17 Framed Plaque w/ Ring '
                                               'Canvas'],
                                     'photo': [ '/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}],
                   },
                   { '@context': 'http://microformats.org/wiki/',
                     '@type': ['h-entry'],
                     'author': [ { '@type': ['h-card'],
                                   'name': ['W. Developer'],
                                   'url': ['http://example.com'],
                                   'value': 'W. Developer'}],
                     'content': [ { 'html': '<p>Blah blah blah</p>',
                                    'value': 'Blah blah blah'}],
                     'name': ['Microformats are amazing'],
                     'published': ['2013-06-13 12:00:00'],
                     'summary': [ 'In which I extoll the virtues of using '
                                  'microformats.']}]
        body = get_testdata('misc', 'microformat_test.html')
        data = extruct.extract(body, syntaxes=['microformat'], uniform=True)
>       self.assertEqual(data['microformat'], expected)
E       AssertionError: Lists differ: [{'@type': ['h-hidden-phone', 'h-hidden-table[816 chars]'}]}] != [{'@context': 'http://microformats.org/wiki/'[816 chars].']}]
E
E       First differing element 0:
E       {'@type': ['h-hidden-phone', 'h-hidden-table[58 chars]['']}
E       {'@context': 'http://microformats.org/wiki/'[58 chars]['']}
E
E         [{'@context': 'http://microformats.org/wiki/',
E       -   '@type': ['h-hidden-phone', 'h-hidden-tablet'],
E       +   '@type': ['h-hidden-tablet', 'h-hidden-phone'],
E           'name': ['']},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-hidden-phone'],
E       -   'children': [{'@type': ['h-hidden-phone', 'h-hidden-tablet'], 'name': ['']},
E       ?                           ------------------
E
E       +   'children': [{'@type': ['h-hidden-tablet', 'h-hidden-phone'], 'name': ['']},
E       ?                                            ++++++++++++++++++
E
E                        {'@type': ['h-hidden-phone'],
E                         'name': ['aJ Styles FastLane 2018 15 x 17 Framed Plaque w/ '
E                                  'Ring Canvas'],
E                         'photo': ['/on/demandware.static/-/Sites-main/default/dwa3227ee6/images/small/CN1148.jpg']}]},
E          {'@context': 'http://microformats.org/wiki/',
E           '@type': ['h-entry'],
E           'author': [{'@type': ['h-card'],
E                       'name': ['W. Developer'],
E                       'url': ['http://example.com'],
E                       'value': 'W. Developer'}],
E           'content': [{'html': '<p>Blah blah blah</p>', 'value': 'Blah blah blah'}],
E           'name': ['Microformats are amazing'],
E           'published': ['2013-06-13 12:00:00'],
E           'summary': ['In which I extoll the virtues of using microformats.']}]

tests/test_uniform.py:59: AssertionError

---------- coverage: platform darwin, python 3.6.6-final-0 -----------
Name                      Stmts   Miss Branch BrPart  Cover
-----------------------------------------------------------
extruct/__init__.py           7      0      0      0   100%
extruct/__main__.py           4      4      2      0     0%
extruct/_extruct.py          53      8     33      2    84%
extruct/jsonld.py            24      0      6      1    97%
extruct/microformat.py        7      0      2      0   100%
extruct/opengraph.py         20      0      8      0   100%
extruct/rdfa.py              23      0      0      0   100%
extruct/tool.py              25      0      0      0   100%
extruct/uniform.py           59      2     34      2    94%
extruct/w3cmicrodata.py     102      0     63      1    99%
extruct/xmldom.py           108     29     28      0    65%
-----------------------------------------------------------
TOTAL                       432     43    176      6    88%

==================================================================== 3 failed, 50 passed in 2.90 seconds ====================================================================
ERROR: InvocationError for command '/Users/shiquan/workspace/projects/github.com/shiquanwang/extruct/.tox/py36/bin/py.test --cov-report=term --cov-report= --cov=extruct extruct tests' (exited with code 1)
__________________________________________________________________________________ summary __________________________________________________________________________________
ERROR:   py36: commands failed

shiquanwang · 2018-08-01T03:09:12Z

@lopuhin

all errors show the expected and returned lists differ in order:

expected: ['h-hidden-tablet', 'h-hidden-phone']
returned: ['h-hidden-phone', 'h-hidden-tablet']

and the cause is that mf2py return sorted list: check HERE

codecov · 2018-08-01T03:24:12Z

Codecov Report

Merging #85 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #85   +/-   ##
=======================================
  Coverage   88.26%   88.26%           
=======================================
  Files          10       10           
  Lines         426      426           
  Branches       88       88           
=======================================
  Hits          376      376           
  Misses         44       44           
  Partials        6        6

Impacted Files	Coverage Δ
extruct/jsonld.py	`95.83% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79dff34...d839faa. Read the comment docs.

shiquanwang · 2018-08-01T04:33:32Z

@lopuhin
I tried to improve coverage with adding some tests.
However, json.JSONDecodeError first appear in Python 3.5 and cause tests fail for py2.7 and py3.4.
Don't know how to fix this properly.

Finally I use a second layer try-catch with ValueError since json.JSONDecodeError is based on ValueError so I can make both unit tests and coverage tests happy.

…thon 3.5 and on.

…py35 Seems can pass unit tests and coverage tests finally.

lopuhin

Thanks @shiquanwang , looks great!

mf2py errors are indeed caused by the new version, master build failed with the same error https://travis-ci.org/scrapinghub/extruct/jobs/389836540 - thanks for fixing them too! 👍

Using ValueError instead of json.JSONDecodeError looks good to me (in general you could try building a tuple of caught exceptions outside the function, trying to use json.JSONDecodeError and falling back to ValueError if it's not defined, but in this case ValueError works great IMO).

lopuhin · 2018-08-01T10:13:24Z

Hey @kmike @Kebniss this PR looks good to me, do you mind if we merge it, or would you like to have another look at it?

kmike · 2018-08-01T13:21:32Z

There are two other PRs handling the same issue: #57, #58. Questions:

Should we parse data multiple times? In this PR data can be parsed up to 3 times, in Allow control characters inside JSON-LD strings #57 and Update jsonld.py, fallback to strict=False if err #58 it is two times.
Does strict=False work together with HTML_OR_JS_COMMENTLINE replacement? Here: no, Allow control characters inside JSON-LD strings #57: yes, Update jsonld.py, fallback to strict=False if err #58: yes
Can user force strict=True/False? Here: no, Allow control characters inside JSON-LD strings #57: yes, Update jsonld.py, fallback to strict=False if err #58: no.

If we provide an option to use strict=False, and have it ON by default (like in #57), we may try parsing with strict=False from the beginning (currently none of PRs does that) - why should it be a fallback, is there any downside of doing it by default? Also, I think it makes sense for this feature to work with HTML comment stripping - in the current PR it is either stripping the comments or using strict=False, but not both.

I don't have a strong opinion on whether it should be configurable.

lopuhin · 2018-08-01T14:28:48Z

If we provide an option to use strict=False, and have it ON by default (like in #57), we may try parsing with strict=False from the beginning (currently none of PRs does that) - why should it be a fallback, is there any downside of doing it by default? Also, I think it makes sense for this feature to work with HTML comment stripping - in the current PR it is either stripping the comments or using strict=False, but not both.

Great points, agree to both! I think we don't need for it to be configurable, but I'm not against it.

shiquanwang · 2018-08-02T01:50:44Z

Thanks @kmike for your great points.

Can you merge any PR(this one, 57, 58) that solves the control character issue first, which causes errors when parsing.

And then we can add some TODOs in the function to note it can be further considered with some better ways of doing.

Now I'm for to do data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script), strict=False) directly:

try to remove leading comments if there is any
strict=False will allow control characters, don't see any downside to make this default from my opinion

Let me do.

kmike · 2018-08-02T18:25:38Z

@shiquanwang I think removing comments is a much more destructive operation, since it can remove some content from the JSON data (e.g. when a text looking like a comment is inside a value). I'd still prefer comment removal to be only a fallback.

shiquanwang · 2018-08-03T02:03:52Z

@kmike
Good point. I didn't notice that. I've made changes accordingly.

kmike · 2018-08-03T09:13:26Z

Thanks @shiquanwang! The implementation looks good to me now. Would you mind adding a couple more test cases, to check the behavior you've covered (html comment inside json value, html comment + control character)?

shiquanwang · 2018-08-06T02:27:45Z

Tests are added. @kmike
Hope can merge soon.

kmike · 2018-08-06T10:34:29Z

Looks good @shiquanwang! However, would you mind bringing old, simpler tests back as well? Having tests for basic behavior helps with debugging / isolating an issue, even if there are complex tests which cover more edge cases.

shiquanwang · 2018-08-06T13:37:19Z

Hi @kmike .
I didn't remove old tests. The test for control characters is newly added.
Sorry, don't quite get what you want exactly.

kmike · 2018-08-06T13:38:30Z

@shiquanwang by "old tests" I meant tests you've added initially in this PR.

shiquanwang · 2018-08-06T14:00:33Z

@kmike get it. New push is made to make tests for single/simple case.

kmike · 2018-08-08T15:42:18Z

Looks good to me, thanks @shiquanwang!

lopuhin · 2018-08-08T15:46:55Z

Looks good to me too, thanks @shiquanwang and @kmike !

Fix: scrapinghub#84 try to parse JSON-LD with control characters.

ddb25ae

lopuhin reviewed Jul 31, 2018

View reviewed changes

Fix: A typo.

3b36247

Fix: Make expected list in tests in order to be aligned with mf2py.

00d1796

Add: Add test for JSON-LD data with control characters.

af34117

Shiquan Wang added 2 commits August 1, 2018 12:36

Fix: Make tests pass, json.JSONDecodeError is only available for Py…

2869a02

…thon 3.5 and on.

Mod: Use ValueError again since json.JSONDecodeError exists from …

b95cdc8

…py35 Seems can pass unit tests and coverage tests finally.

lopuhin approved these changes Aug 1, 2018

View reviewed changes

lopuhin changed the title ~~Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters~~ [MRG+1] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters Aug 1, 2018

Mod: Remove leading comments and allow control characters directly.

d1e64b8

Mod: Make comment removal a fallback when failed.

1874473

Add: Add test for comment, control characters individually and together.

e3b0c4a

Mod: Make each test for a single and simple case.

d839faa

kmike changed the title ~~[MRG+1] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters~~ [MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters Aug 8, 2018

lopuhin merged commit cd56da2 into scrapinghub:master Aug 8, 2018

This was referenced Aug 8, 2018

How to correct "nasty" jsonl+ld #53

Open

Allow control characters inside JSON-LD strings #57

Closed

Update jsonld.py, fallback to strict=False if err #58

Closed

shiquanwang deleted the scrapinghub/extruct#84 branch August 9, 2018 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

shiquanwang commented Jul 31, 2018

lopuhin left a comment

shiquanwang commented Aug 1, 2018 •

edited

Loading

shiquanwang commented Aug 1, 2018 •

edited

Loading

codecov bot commented Aug 1, 2018 •

edited

Loading

shiquanwang commented Aug 1, 2018 •

edited

Loading

lopuhin left a comment

lopuhin commented Aug 1, 2018

kmike commented Aug 1, 2018 •

edited

Loading

lopuhin commented Aug 1, 2018

shiquanwang commented Aug 2, 2018 •

edited

Loading

kmike commented Aug 2, 2018

shiquanwang commented Aug 3, 2018

kmike commented Aug 3, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 6, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 6, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 8, 2018

lopuhin commented Aug 8, 2018

[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85

Conversation

shiquanwang commented Jul 31, 2018

lopuhin left a comment

Choose a reason for hiding this comment

shiquanwang commented Aug 1, 2018 • edited Loading

shiquanwang commented Aug 1, 2018 • edited Loading

codecov bot commented Aug 1, 2018 • edited Loading

Codecov Report

shiquanwang commented Aug 1, 2018 • edited Loading

lopuhin left a comment

Choose a reason for hiding this comment

lopuhin commented Aug 1, 2018

kmike commented Aug 1, 2018 • edited Loading

lopuhin commented Aug 1, 2018

shiquanwang commented Aug 2, 2018 • edited Loading

kmike commented Aug 2, 2018

shiquanwang commented Aug 3, 2018

kmike commented Aug 3, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 6, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 6, 2018

shiquanwang commented Aug 6, 2018

kmike commented Aug 8, 2018

lopuhin commented Aug 8, 2018

shiquanwang commented Aug 1, 2018 •

edited

Loading

shiquanwang commented Aug 1, 2018 •

edited

Loading

codecov bot commented Aug 1, 2018 •

edited

Loading

shiquanwang commented Aug 1, 2018 •

edited

Loading

kmike commented Aug 1, 2018 •

edited

Loading

shiquanwang commented Aug 2, 2018 •

edited

Loading