Enable override default type of selector to use on xpath method #4627

Urahara · 2020-06-12T19:09:01Z

Summary

When calls response.xpath scrapy automatically chooses the "best parsing rules" but looking from the code of parsel it always try to parse using html parse even with xml documents. I would like that user could pass the type on method call, like: response.xpath("xpath", type='xml')

Motivation

I have a response that Content-Type is application/xhtml+xml;charset=iso-8859-1, but body itself is a xml document, but scrapy consider html the best parsing rule type, because when i try to parse CDATA a get empty values.

Describe alternatives you've considered

I using seletor = Selector(text=response.text, type='xml') to parse my data.

Additional context

Here a proof-of-concept that i made:

from parsel import Selector

xml_doc = """
<?xml version='1.0' encoding='iso-8859-1'?>
<rows>
    <page>1</page>
    <total>1</total>
    <records>1</records>
    <row id="1">
        <cell><![CDATA[Scrapy]]></cell>
        <cell><![CDATA[https://scrapy.org/]]></cell>
        <cell><![CDATA[Python]]></cell>
    </row>
</rows>
"""

seletor = Selector(text=xml_doc)

print(seletor.xpath("//cell[1]/text()").get())  # None

seletor = Selector(text=xml_doc, type='xml')

print(seletor.xpath("//cell[1]/text()").get())  # Scrapy

The text was updated successfully, but these errors were encountered:

Gallaecio · 2020-06-14T21:03:31Z

Related to #4240

Gallaecio · 2021-10-04T07:30:47Z

The issue reported here is getting an HtmlResponse class because the headers say XHTML even though the actual content is pure XML.

I don’t think it makes sense to allow passing a different selector type in calls to response.xpath and similar.

I think we should wait for #5204 to be merged, and then see if the problem persists. If it does, we can probably modify the changes in that pull requests slightly to accommodate for this scenario.

As a workaround in the meantime, on top of the one provided in the issue report, you can easily convert HtmlResponse to XmlResponse:

>>> body = b"""<?xml version='1.0' encoding='iso-8859-1'?>
... <root><![CDATA[cdata]]></root>"""
>>> html_response = HtmlResponse("https://example.com", body=body)
>>> html_response.css("root::text").get()
>>> xml_response = XmlResponse(html_response.url, body=html_response.body)
>>> xml_response.css("root::text").get()
'cdata'

Urahara changed the title ~~Enable override default Selector type to use on xpath method~~ Enable override default type of selector to use on xpath method Jun 12, 2020

Gallaecio added the enhancement label Jun 14, 2020

Laerte mentioned this issue Oct 4, 2021

Override default type of selector #5257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable override default type of selector to use on xpath method #4627

Enable override default type of selector to use on xpath method #4627

Urahara commented Jun 12, 2020 •

edited

Loading

Gallaecio commented Jun 14, 2020

Gallaecio commented Oct 4, 2021 •

edited

Loading

Enable override default type of selector to use on xpath method #4627

Enable override default type of selector to use on xpath method #4627

Comments

Urahara commented Jun 12, 2020 • edited Loading

Summary

Motivation

Describe alternatives you've considered

Additional context

Gallaecio commented Jun 14, 2020

Gallaecio commented Oct 4, 2021 • edited Loading

Urahara commented Jun 12, 2020 •

edited

Loading

Gallaecio commented Oct 4, 2021 •

edited

Loading