-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85
[MRG+2] Fix: scrapinghub/extruct#84 try to parse JSON-LD with control characters #85
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shiquanwang , the fix looks good to me (I only noticed that shoul
is likely a typo - missing d
), but the build failed on travis, could you please check the failures - at first sight it looks like they could be caused by this change?
@lopuhin Yes it's a typo, thanks. I've fixed it. I tried to see how the tests work. I switch to master branch and run the tests and get the same errors in travis, please help check:
|
Codecov Report
@@ Coverage Diff @@
## master #85 +/- ##
=======================================
Coverage 88.26% 88.26%
=======================================
Files 10 10
Lines 426 426
Branches 88 88
=======================================
Hits 376 376
Misses 44 44
Partials 6 6
Continue to review full report at Codecov.
|
@lopuhin Finally I use a second layer |
…thon 3.5 and on.
…py35 Seems can pass unit tests and coverage tests finally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shiquanwang , looks great!
mf2py errors are indeed caused by the new version, master build failed with the same error https://travis-ci.org/scrapinghub/extruct/jobs/389836540 - thanks for fixing them too! 👍
Using ValueError
instead of json.JSONDecodeError
looks good to me (in general you could try building a tuple of caught exceptions outside the function, trying to use json.JSONDecodeError
and falling back to ValueError
if it's not defined, but in this case ValueError
works great IMO).
There are two other PRs handling the same issue: #57, #58. Questions:
If we provide an option to use strict=False, and have it ON by default (like in #57), we may try parsing with strict=False from the beginning (currently none of PRs does that) - why should it be a fallback, is there any downside of doing it by default? Also, I think it makes sense for this feature to work with HTML comment stripping - in the current PR it is either stripping the comments or using strict=False, but not both. I don't have a strong opinion on whether it should be configurable. |
Great points, agree to both! I think we don't need for it to be configurable, but I'm not against it. |
Thanks @kmike for your great points. Can you merge any PR(this one, 57, 58) that solves the control character issue first, which causes errors when parsing. And then we can add some TODOs in the function to note it can be further considered with some better ways of doing. Now I'm for to do
Let me do. |
@shiquanwang I think removing comments is a much more destructive operation, since it can remove some content from the JSON data (e.g. when a text looking like a comment is inside a value). I'd still prefer comment removal to be only a fallback. |
@kmike |
Thanks @shiquanwang! The implementation looks good to me now. Would you mind adding a couple more test cases, to check the behavior you've covered (html comment inside json value, html comment + control character)? |
Tests are added. @kmike |
Looks good @shiquanwang! However, would you mind bringing old, simpler tests back as well? Having tests for basic behavior helps with debugging / isolating an issue, even if there are complex tests which cover more edge cases. |
Hi @kmike . |
@shiquanwang by "old tests" I meant tests you've added initially in this PR. |
@kmike get it. New push is made to make tests for single/simple case. |
Looks good to me, thanks @shiquanwang! |
Looks good to me too, thanks @shiquanwang and @kmike ! |
please see #84 for detail