Accept JSON parsing errors in JSON-LD extractor #45

giordand · 2017-06-11T18:15:31Z

When the JsonLdExtractor tries to parse json ld in some web page raise ValueError; no json object could be decoded.
My solution was to catch the error in JsonLdExtractor._extract_items(self, node) (because maybe the extractor detected some microdata or rdfa in the webpage but the error only occurs with json-ld, and if we catch the error in extruct.extract we'll lose that data) and by default return an empty list:

def _extract_items(self, node):
        try:
            data = json.loads(node.xpath('string()'))
            if isinstance(data, list):
                return data
            elif isinstance(data, dict):
                return [data]
        except Exception as e:
            print e
        return []

The text was updated successfully, but these errors were encountered:

redapple · 2017-06-11T19:41:47Z

Hi @giordand , thanks for the report.
Would you happen to have an example URL where this happens?
I have a local stashed change that I had to apply for examples you get from schema.org,
the root cause being for me that there was some HTML comment at the start of the script text node

$ git diff HEAD 
diff --git a/extruct/jsonld.py b/extruct/jsonld.py
index fe555db..84dfe74 100644
--- a/extruct/jsonld.py
+++ b/extruct/jsonld.py
@@ -4,6 +4,7 @@ JSON-LD extractor
 """
 
 import json
+import re
 
 import lxml.etree
 import lxml.html
@@ -24,7 +25,12 @@ class JsonLdExtractor(object):
                          if item]
 
     def _extract_items(self, node):
-        data = json.loads(node.xpath('string()'))
+        script = node.xpath('string()')
+        try:
+            data = json.loads(script)
+        except ValueError:
+            # sometimes this is due to some JavaScript comment
+            data = json.loads(re.sub('^(\s*//.*)|(\s*<!--.*-->\s*)', '', script))
         if isinstance(data, list):
             return data
         elif isinstance(data, dict):

giordand · 2017-06-11T22:40:12Z

Yes, i remember that in my case there was HTML comments too , so it should be fixed when you commit & push that changes. Let me ask you a question , when you commit that changes will it be available with a pip update command to the extruct library?

redapple · 2017-06-12T10:52:06Z

I'll need to release a new version of extruct for the change to be available directly from PyPI via pip.
Note that pip also allows installing from git specific commits

redapple · 2017-06-12T10:53:00Z

@giordand , it would be most helpful if you can provide a real example of a URL (or the HTML of it) where extruct failed, just to check if my patch really does solve your issue.

giordand · 2017-06-12T17:26:30Z

@redapple here is the json-ld script wich the jason.loads cannot load:

{
      "@context": "http://schema.org",
      "@type": "Organization",
      "name": "Action Car and Truck Accessories",
      "url": "http://www.actiontrucks.com",
      "sameAs" : [ "https://twitter.com/actioncar_truck",
        "https://www.youtube.com/user/actioncarandtruck",
        https://www.facebook.com/actioncarandtruck],
       "logo": " http://actiontrucks.com/files/images/logo.png",
      "contactPoint" : [
        { "@type" : "ContactPoint",
        "telephone" : "+1-855-560-2233",
        "contactType" : "sales"} ]
    }

Look at the red line, the double cuotes are missing in that element of the array. I did the test completing it with the double cuotes and no error were catched, so here we've got an example where apparently has no solution because the original json object is malformed and surely that object is not loading correctly in the web page. I think that the only solution for this without changing the reality is to catch the error and return an empty list

redapple · 2017-06-13T09:18:20Z

Thanks for the feedback @giordand .
I'd go for catching the parse error, log a warning or error, and return an empty list like you suggest.

vu3jej · 2017-11-11T20:44:19Z

Observed something similar while working on the same website as in #57; in here

{
"@context":"http://schema.org",
"@type":"Restaurant",
"@id":"https://www.cosaordino.it/locale/906/monza-e-brianza/sedici-piadina",
"name":"SeDICI Piadina",
"image":"https://www.cosaordino.it//pictures/locale/wxthumb/5430632eaa15ffff058debd370253417_thumb.png",
"sameAs":"https://www.cosaordino.it/locale/906/monza-e-brianza/sedici-piadina",
"servesCuisine":"piadine",
"address":{
"@type":"PostalAddress",
"streetAddress":"via Monza, 29",
"addressLocality":"Brugherio",
"postalCode":"20861",
"addressRegion":"Brugherio",
"addressCountry":"IT"
},
"telephone":"039 914 3386",
"geo":{
"@type":"GeoCoordinates",
"latitude":,
"longitude":
},
"aggregateRating":{
"@type":"AggregateRating",
"ratingValue":"0",
"bestRating":"0",
"worstRating":"0",
"ratingCount":"0"
},
"potentialAction":{
"@type":"OrderAction",
"target":{
"actionPlatform":[
"http://schema.org/DesktopWebPlatform",
"http://schema.org/MobileWebPlatform"
],
"inLanguage":"it-IT",
"url":"https://www.cosaordino.it/info/906/monza-e-brianza/sedici-piadina"
},
"deliveryMethod":[
"http://purl.org/goodrelations/v1#DeliveryModeOwnFleet"
]
}
}

Notice the missing double quotes around latitude and longitude values. In this case I fixed it with the following regex, though I doubt there's a catchall solution.

json_str = re.sub(
                pattern=r'(\"\:\s)([^"\{\[])',
                repl=r'":""\2',
                string=json_str
            )

Gallaecio · 2020-01-08T13:51:37Z

I’m looking at the code, and I see that extract already handles this as suggested depending on the errors parameter, which can be set to log for the suggested behavior, ignore to do nothing or strict (default) to let the exception raise.

When using a specific parser, I think it makes sense to keep the current behavior; users are free to catch the exception of let it raise further.

Add jsonStringFixer.py, which has a function to add quotes around any required text in a json string. Used this in jsonld.py to handle invalid jsonld string.

redapple changed the title ~~Error parsing Json-Ld in JsonLdExtractor~~ Accept JSON parsing errors in JSON-LD extractor Jun 13, 2017

redapple added the enhancement label Jun 13, 2017

Gallaecio mentioned this issue May 28, 2019

Parsing of JSON-LD breaks when the JSON is followed by a semicolon #109

Closed

Gallaecio added the discuss label Jan 8, 2020

bhavya17037 added a commit to bhavya17037/extruct that referenced this issue May 29, 2020

Solves issue scrapinghub#45

11287d7

Add jsonStringFixer.py, which has a function to add quotes around any required text in a json string. Used this in jsonld.py to handle invalid jsonld string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept JSON parsing errors in JSON-LD extractor #45

Accept JSON parsing errors in JSON-LD extractor #45

giordand commented Jun 11, 2017

redapple commented Jun 11, 2017

giordand commented Jun 11, 2017

redapple commented Jun 12, 2017

redapple commented Jun 12, 2017

giordand commented Jun 12, 2017

redapple commented Jun 13, 2017

vu3jej commented Nov 11, 2017

Gallaecio commented Jan 8, 2020

Accept JSON parsing errors in JSON-LD extractor #45

Accept JSON parsing errors in JSON-LD extractor #45

Comments

giordand commented Jun 11, 2017

redapple commented Jun 11, 2017

giordand commented Jun 11, 2017

redapple commented Jun 12, 2017

redapple commented Jun 12, 2017

giordand commented Jun 12, 2017

redapple commented Jun 13, 2017

vu3jej commented Nov 11, 2017

Gallaecio commented Jan 8, 2020