Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow control characters inside JSON-LD strings #57

Closed
wants to merge 3 commits into from
Closed

Allow control characters inside JSON-LD strings #57

wants to merge 3 commits into from

Conversation

vu3jej
Copy link

@vu3jej vu3jej commented Nov 11, 2017

As of now, control characters are not allowed inside strings, causing errors such as this one:

$ extruct "https://www.cosaordino.it/locale/887/monza-e-brianza/hambu"
Traceback (most recent call last):
  File "/home/pinky/.virtualenvs/richey-crawler/bin/extruct", line 11, in <module>
    sys.exit(main())
  File "/home/pinky/.virtualenvs/richey-crawler/local/lib/python2.7/site-packages/extruct/tool.py", line 64, in main
    metadata = metadata_from_url(args.url)
  File "/home/pinky/.virtualenvs/richey-crawler/local/lib/python2.7/site-packages/extruct/tool.py", line 29, in metadata_from_url
    result['json-ld'] = jsonlde.extract_items(tree, resp.url)
  File "/home/pinky/.virtualenvs/richey-crawler/local/lib/python2.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/home/pinky/.virtualenvs/richey-crawler/local/lib/python2.7/site-packages/extruct/jsonld.py", line 35, in _extract_items
    data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 18 column 52 (char 699)

This commit adds an option allow_control_characters to JsonLdExtractor to
set the strict parameter inside json.loads. default is true(strict=false),
based on @kmike's suggestion.

Thanks!

@codecov
Copy link

codecov bot commented Nov 11, 2017

Codecov Report

Merging #57 into master will increase coverage by 0.09%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   85.14%   85.24%   +0.09%     
==========================================
  Files           7        7              
  Lines         303      305       +2     
  Branches       53       53              
==========================================
+ Hits          258      260       +2     
  Misses         43       43              
  Partials        2        2
Impacted Files Coverage Δ
extruct/jsonld.py 96.15% <100%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a524fed...6f65c78. Read the comment docs.

@@ -15,6 +15,12 @@
class JsonLdExtractor(object):
_xp_jsonld = lxml.etree.XPath('descendant-or-self::script[@type="application/ld+json"]')

def __init__(self, allow_control_characters=True):
if allow_control_characters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even set self.strict = not allow_control_characters to simplify this.

@@ -15,6 +15,12 @@
class JsonLdExtractor(object):
_xp_jsonld = lxml.etree.XPath('descendant-or-self::script[@type="application/ld+json"]')

def __init__(self, allow_control_characters=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, makes sense to me to have it not-strict by default.


jsonlde = JsonLdExtractor()
data = jsonlde.extract(body)
self.assertEqual(data, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want to test with strict=True

Copy link
Contributor

@redapple redapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vu3jej

@vu3jej
Copy link
Author

vu3jej commented Nov 23, 2017

Thanks for the review, @redapple 🙇‍♂️

I've added the changes as per your suggestions. Please let me know if anything is amiss.

@lopuhin
Copy link
Member

lopuhin commented Aug 8, 2018

Sorry, I'll have to close this PR as the issue was ultimately fixed in #85, sorry for letting this PR slip through.

@lopuhin lopuhin closed this Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants