You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calls response.xpath scrapy automatically chooses the "best parsing rules" but looking from the code of parsel it always try to parse using html parse even with xml documents. I would like that user could pass the type on method call, like: response.xpath("xpath", type='xml')
Motivation
I have a response that Content-Type is application/xhtml+xml;charset=iso-8859-1, but body itself is a xml document, but scrapy consider html the best parsing rule type, because when i try to parse CDATA a get empty values.
Describe alternatives you've considered
I using seletor = Selector(text=response.text, type='xml') to parse my data.
The text was updated successfully, but these errors were encountered:
Urahara
changed the title
Enable override default Selector type to use on xpath method
Enable override default type of selector to use on xpath method
Jun 12, 2020
The issue reported here is getting an HtmlResponse class because the headers say XHTML even though the actual content is pure XML.
I don’t think it makes sense to allow passing a different selector type in calls to response.xpath and similar.
I think we should wait for #5204 to be merged, and then see if the problem persists. If it does, we can probably modify the changes in that pull requests slightly to accommodate for this scenario.
As a workaround in the meantime, on top of the one provided in the issue report, you can easily convert HtmlResponse to XmlResponse:
Summary
When calls
response.xpath
scrapy automatically chooses the "best parsing rules" but looking from the code of parsel it always try to parse usinghtml
parse even with xml documents. I would like that user could pass the type on method call, like:response.xpath("xpath", type='xml')
Motivation
I have a response that Content-Type is
application/xhtml+xml;charset=iso-8859-1
, but body itself is a xml document, but scrapy considerhtml
the best parsing rule type, because when i try to parseCDATA
a get empty values.Describe alternatives you've considered
I using
seletor = Selector(text=response.text, type='xml')
to parse my data.Additional context
Here a proof-of-concept that i made:
The text was updated successfully, but these errors were encountered: