Unicode/string parsing error #93

andrewsu · 2018-09-19T21:51:52Z

I'm trying to parse structured metadata from this url. I first executed this code on the example URL https://www.optimizesmart.com/how-to-use-open-graph-protocol/:

import extruct
import requests
from w3lib.html import get_base_url

def extract_metadata(url):
    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    data = extruct.extract(r.text, base_url=base_url)
    return(data)

url = 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'
data = extract_metadata(url)
print(data)

And works just fine. However, this block of code:

url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
data = extract_metadata(url)
print(data)

returns this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-f0db0dd65eaf> in <module>()
      1 url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
----> 2 data = extract_metadata(url)
      3 print(data)

<ipython-input-3-25c85aeebf1a> in extract_metadata(url)
      2     r = requests.get(url)
      3     base_url = get_base_url(r.text, r.url)
----> 4     data = extruct.extract(r.text, base_url=base_url)
      5     return(data)

/usr/local/lib/python3.5/dist-packages/extruct/_extruct.py in extract(htmlstring, base_url, encoding, syntaxes, errors, uniform, return_html_node, schema_context, **kwargs)
     50         raise ValueError('Invalid error command, valid values are either "log"'
     51                          ', "ignore" or "strict"')
---> 52     tree = parse_xmldom_html(htmlstring, encoding=encoding)
     53     processors = []
     54     if 'microdata' in syntaxes:

/usr/local/lib/python3.5/dist-packages/extruct/utils.py in parse_xmldom_html(html, encoding)
     14     """ Parse HTML using XmlDomHTMLParser, return a tree """
     15     parser = XmlDomHTMLParser(encoding=encoding)
---> 16     return lxml.html.fromstring(html, parser=parser)

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw)
    874     else:
    875         is_full_html = _looks_like_full_html_unicode(html)
--> 876     doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
    877     if is_full_html:
    878         return doc

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw)
    760     if parser is None:
    761         parser = html_parser
--> 762     value = etree.fromstring(html, parser, **kw)
    763     if value is None:
    764         raise etree.ParserError(

src/lxml/etree.pyx in lxml.etree.fromstring()

src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Any idea what is going on here? Seems like an lxml.etree parsing error? Can I somehow modify r.text to fix this error? Any help is appreciated...

The text was updated successfully, but these errors were encountered:

Kebniss · 2018-09-19T23:07:12Z

You cannot parse a unicode string that contains an encoding declaration, see here. In Harvard's html the encoding is specified in the first line. You can just encode the text before passing it to extruct: data = extruct.extract(r.text.encode('utf8'), base_url=base_url)

Strings encoding are very confusing, this article helped me understanding the basis :)

andrewsu · 2018-09-19T23:40:02Z

Thank you @Kebniss, worked perfectly! Your help (and your addition to my must-read list) is much appreciated!

lopuhin · 2019-10-30T14:28:16Z

I think it's possible to make extruct work on such cases, and it should be a responsibility of the library.

lopuhin · 2019-10-30T14:30:43Z

Example document:

extruct.extract('<?xml version="1.0" encoding="utf-8"?><html><body>foo</body></html>')

jimmytuc · 2019-12-03T15:33:43Z

Not about Unicode, but I got an issue when parsing from json-ld structure has hex string in this url
The root cause is because of the description which is having hex string, and it is fixed by removing \x according to this article
I think this case should be handled as well. Does anyone have any idea?

andrewsu closed this as completed Sep 19, 2018

lopuhin reopened this Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode/string parsing error #93

Unicode/string parsing error #93

andrewsu commented Sep 19, 2018

Kebniss commented Sep 19, 2018

andrewsu commented Sep 19, 2018

lopuhin commented Oct 30, 2019

lopuhin commented Oct 30, 2019

jimmytuc commented Dec 3, 2019

Unicode/string parsing error #93

Unicode/string parsing error #93

Comments

andrewsu commented Sep 19, 2018

Kebniss commented Sep 19, 2018

andrewsu commented Sep 19, 2018

lopuhin commented Oct 30, 2019

lopuhin commented Oct 30, 2019

jimmytuc commented Dec 3, 2019