Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Cannot use xpath on a Selector of type 'json' #5923

Closed
damodharheadrun opened this issue May 5, 2023 · 7 comments
Closed

ValueError: Cannot use xpath on a Selector of type 'json' #5923

damodharheadrun opened this issue May 5, 2023 · 7 comments

Comments

@damodharheadrun
Copy link

Description

.local/lib/python3.8/site-packages/parsel/selector.py", line 621, in xpath
Scrapy version 2.8.0

[Description of the issue]

Steps to Reproduce

  1. URL request response is json.
  2. Some times we use to get the html response.
  3. We have written xpath for the html_response , but now i'm getting the ValueError: Cannot use xpath on a Selector of type 'json'
  4. Earlier we are able to get the response, now it is thowing an exception

Expected behavior: Error should not accour

Actual behavior: json is removed from the self.type under selector
ex: if self.type not in ("html", "xml", "text", "json"):
raise ValueError(
f"Cannot use xpath on a Selector of type {self.type!r}"
)
if self.type in ("html", "xml"):
try:
xpathev = self.root.xpath
except AttributeError:
return typing.cast(
SelectorList[_SelectorType], self.selectorlist_cls([])
)
else:
try:
xpathev = self._get_root(self._text or "", type="html").xpath
except AttributeError:
return typing.cast(
SelectorList[_SelectorType], self.selectorlist_cls([])
)

Reproduces how often: 100%

Versions

Please paste here the output of executing scrapy version --verbose in the command line.

Additional context

Any additional information, configuration, data or output from commands that might be necessary to reproduce or understand the issue. Please try not to include screenshots of code or the command line, paste the contents as text instead. You can use GitHub Flavored Markdown to make the text look better.

@Gallaecio
Copy link
Member

I find it interesting that you were able to use XPath to parse HTML within a JSON structure before.

If you are looking for a workaround, downgrade parsel to <1.8.

As for a long-term solution, I am inclined to say that this is how things should work.

If you have {"html": "<html><title>foo</title></html>"}, you do not use response.xpath("//title"), you use response.selector.jmespath("html").xpath("//tittle") (or, starting with the upcoming Scrapy 2.9, response.jmespath("html").xpath("//tittle")).

@damodharheadrun
Copy link
Author

Thanks for the quick update @Gallaecio

Is this same for 2.5.0 also?

@Gallaecio
Copy link
Member

It affects any version of Scrapy if your version of Parsel is 1.8.0 or later.

@damodharheadrun
Copy link
Author

ohh ok @Gallaecio i will try above mentioned method and update here

@GeorgeA92
Copy link
Contributor

I find it interesting that you were able to use XPath to parse HTML within a JSON structure before.

On parsel 1.6.0 and older - application add extra <html>><body><p> tags that "make" it.. possible to use css xpath selectors (attempt to "get" valid html as browser does)

from parsel import Selector
selector_json = Selector(text='{"a":"1"}')

print(selector_json.getall())
# parsel 1.6: ['<html><body><p>{"a":"1"}</p></body></html>']
# parsel 1.8: [{'a': '1'}] # <-converted to dict, we see ' instead of original ", most likely expected to receive str here"

#print(selector_json.css('*::text').getall())
print(selector_json.css('p::text').getall())

# parsel 1.6: ['{"a":"1"}']

# parsel 1.8:
'''
    print(selector_json.css('p::text').getall())
  File "<redacted>\parsel\parsel\selector.py", line 680, in css
    raise ValueError(
ValueError: Cannot use css on a Selector of type 'json'

Process finished with exit code 1

'''

If you have {"html": "<title>foo</title>"}, you do not use response.xpath("//title"), you use response.selector.jmespath("html").xpath("//tittle") (or, starting with the upcoming Scrapy 2.9, response.jmespath("html").xpath("//tittle")).

In case if server return json response with.. html inside it's variables - Selector will be was able to parse it's html content by xpath/css selectors without any additional data transformations:

text2 = '''{
"prod_1": "<div class=product><div class=price>1$</div></div>",
"prod_2": "<div class=product><div class=price>2$</div></div>",
"prod_3": "<div class=product><div class=price>3$</div></div>"}
'''

selector_html = Selector(text=text2)
print(selector_html.css('div.price::text').getall())
# parsel 1.6: ['1$', '2$', '3$']
# parsel 1.8:
'''
    print(selector_html.css('div.price::text').getall())
  File "<redacted>\parsel\parsel\selector.py", line 680, in css
    raise ValueError(
ValueError: Cannot use css on a Selector of type 'json'
'''

@damodharheadrun
Copy link
Author

Thanks for the detailed information @GeorgeA92

@wRAR wRAR closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023
@damodharheadrun
Copy link
Author

damodharheadrun commented Jul 4, 2023

from scrapy.selector import Selector
response = Selector(text=str(data))
response.xpath("//title")

above soltion also work to avoid the Value error

data = '{"test": "verify_xpath", "data": "test"}'
from scrapy.selector import Selector
sel = Selector(text=data)
sel.xpath('.//text()').extract()
['{"test": "verify_xpath", "data": "test"}']

Criamos added a commit to openeduhub/oeh-search-etl that referenced this issue Feb 9, 2024
- fix: fixed hidden ValueError when trying to use 'LrmiBase.getLRMI()' on Response objects of type 'json'
-- Scrapy's older version used the 'parsel'-package <1.8 which (somehow) was less strict when erroneously trying to navigate a 'json'-object with XPath-selectors
-- as of Scrapy v2.9+ trying to use 'response.xpath()' on a response object other than of type "html" will throw an Error which needs to be handled
-- a bare except previously hid this problem from us, causing digitallearninglab_spider.py to throw warnings which obfuscated the real problem
-- see: scrapy/scrapy#5923
- fix: fixed weak warnings (ambiguous variable names)
- fix: fixed weak warning regarding comparison with None (PEP8:E711)
- optimized imports

Signed-off-by: Andreas Schnäpp <981166+Criamos@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants