Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable override default type of selector to use on xpath method #4627

Open
Urahara opened this issue Jun 12, 2020 · 2 comments
Open

Enable override default type of selector to use on xpath method #4627

Urahara opened this issue Jun 12, 2020 · 2 comments

Comments

@Urahara
Copy link

Urahara commented Jun 12, 2020

Summary

When calls response.xpath scrapy automatically chooses the "best parsing rules" but looking from the code of parsel it always try to parse using html parse even with xml documents. I would like that user could pass the type on method call, like: response.xpath("xpath", type='xml')

Motivation

I have a response that Content-Type is application/xhtml+xml;charset=iso-8859-1, but body itself is a xml document, but scrapy consider html the best parsing rule type, because when i try to parse CDATA a get empty values.

Describe alternatives you've considered

I using seletor = Selector(text=response.text, type='xml') to parse my data.

Additional context

Here a proof-of-concept that i made:

from parsel import Selector

xml_doc = """
<?xml version='1.0' encoding='iso-8859-1'?>
<rows>
    <page>1</page>
    <total>1</total>
    <records>1</records>
    <row id="1">
        <cell><![CDATA[Scrapy]]></cell>
        <cell><![CDATA[https://scrapy.org/]]></cell>
        <cell><![CDATA[Python]]></cell>
    </row>
</rows>
"""

seletor = Selector(text=xml_doc)

print(seletor.xpath("//cell[1]/text()").get())  # None

seletor = Selector(text=xml_doc, type='xml')

print(seletor.xpath("//cell[1]/text()").get())  # Scrapy
@Urahara Urahara changed the title Enable override default Selector type to use on xpath method Enable override default type of selector to use on xpath method Jun 12, 2020
@Gallaecio
Copy link
Member

Related to #4240

@Gallaecio
Copy link
Member

Gallaecio commented Oct 4, 2021

The issue reported here is getting an HtmlResponse class because the headers say XHTML even though the actual content is pure XML.

I don’t think it makes sense to allow passing a different selector type in calls to response.xpath and similar.

I think we should wait for #5204 to be merged, and then see if the problem persists. If it does, we can probably modify the changes in that pull requests slightly to accommodate for this scenario.

As a workaround in the meantime, on top of the one provided in the issue report, you can easily convert HtmlResponse to XmlResponse:

>>> body = b"""<?xml version='1.0' encoding='iso-8859-1'?>
... <root><![CDATA[cdata]]></root>"""
>>> html_response = HtmlResponse("https://example.com", body=body)
>>> html_response.css("root::text").get()
>>> xml_response = XmlResponse(html_response.url, body=html_response.body)
>>> xml_response.css("root::text").get()
'cdata'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants