Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy/Parsel type of response issue. #275

Closed
bulatbulat48 opened this issue May 26, 2023 · 13 comments · Fixed by #278
Closed

Scrapy/Parsel type of response issue. #275

bulatbulat48 opened this issue May 26, 2023 · 13 comments · Fixed by #278

Comments

@bulatbulat48
Copy link

Hello guys. I am using Scrapy and code in the MW has broken because of new parsel version. Could you suggest best practice how to get type of the response in the MW, please?

parsel == 1.6.0

from parsel import Selector
text = '{"isValidAddress": "1"}'
selector = Selector(text=text)
selector.css('h1')
[] 

parsel == 1.8.1

text = '{"isValidAddress": "1"}'
selector = Selector(text=text)
selector.css('h1')
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.10/3.10.1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/lib/python3.10/site-packages/parsel/selector.py", line 680, in css
    raise ValueError(
ValueError: Cannot use css on a Selector of type 'json'
@Laerte
Copy link
Member

Laerte commented May 26, 2023

You could pin the parsel version?

@bulatbulat48
Copy link
Author

bulatbulat48 commented May 29, 2023

@Laerte sure, we can pin it. But the question is still there in parsel == 1.8.1. How to get a type of a response?

@Gallaecio
Copy link
Member

Gallaecio commented May 29, 2023

Could you suggest best practice how to get type of the response in the MW, please?

>>> from parsel import Selector
>>> text = '{"isValidAddress": "1"}'
>>> selector = Selector(text=text)
>>> selector.type
'json'

@bulatbulat48
Copy link
Author

bulatbulat48 commented May 29, 2023

@Gallaecio thanks, response.selector.type works for me.

Also, I am just pointing out that the new version break compatibility and makes it harder to maintain edge cases in Scrapy MW.

@Gallaecio
Copy link
Member

Gallaecio commented May 29, 2023

I agree that it is not a backward-compatible change.

However, I think the new behavior makes sense, and is somewhat coherent with how Scrapy handles e.g. calling response.css() on a binary response, even if the exception you get is different.

The previous behavior of silently failing could waste developer time during debugging.

For cases where there was HTML embedded in the JSON and before selector.css() would successfully crawl that, I can see how this change could be seen as a step backwards, but I still think it is better this way, more reliable.

@bulatbulat48
Copy link
Author

bulatbulat48 commented May 29, 2023

@Gallaecio I think it is worth adding about the non backward-compatible change to the https://github.com/scrapy/parsel/releases/tag/v1.8.0

@Laerte
Copy link
Member

Laerte commented Jun 7, 2023

@Gallaecio

This note should be on History under 1.8.0 version release notes right?

What you think about this text:

Due to the implementation of JMESPath support, the usage of .css and .xpath methods on JSON data became backwards incompatible. If you attempt to use these methods, a ValueError exception will be raised. Instead, you should use the .jmespath method for querying JSON data.

@bulatbulat48
Copy link
Author

bulatbulat48 commented Jun 15, 2023

@Gallaecio
Discovered another one case. Is this expected behaviour?

>>> s = Selector(text='1')
>>> s.type
'json'

>>> s
Traceback (most recent call last):
  File "parsel/utils.py", line 89, in shorten
    if len(text) <= width:
TypeError: object of type 'int' has no len()

@Gallaecio
Copy link
Member

Gallaecio commented Jun 15, 2023

I would consider str(s) raising TypeError a bug. But interpreting 1 as a JSON int is expected behavior.

To force a text interpretation (which feels pointless for Parsel) you could do:

>>> s = Selector(text='1', type="text")
>>> s.type
'text'
>>> s
<Selector query=None data='1'>

Which has its own bug, str(s) should probably give "1", not the same as repr(s).

Alternatively you could force its interpretation as a (JSON) string using root instead of text, although it is also affected by the str(s) bug from the type="text" example above:

>>> s = Selector(root='1')
>>> s.type
'json'
>>> s
<Selector query=None data='1'>

Note though that in both cases, type="text" and root="1", you can use get() to get the actual string, which is probably a better approach than str(s) anyway:

>>> s.get()
'1'

@Laerte
Copy link
Member

Laerte commented Jun 15, 2023

@wRAR Can i raise a PR to always cast the text on shorten function to string? Also this also happen for boolean type as well:

from parsel import Selector
s = Selector(text='true')
print(s)

@wRAR
Copy link
Member

wRAR commented Jun 15, 2023

@Laerte I think it makes sense to cast self.get() instead of modifying shorten()

@Laerte
Copy link
Member

Laerte commented Jun 15, 2023

@wRAR Make sense, since shorten already expect a string right?

diff --git a/parsel/selector.py b/parsel/selector.py
index 8d884a4..4285ee3 100644
--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -749,7 +749,7 @@ class Selector:
         Percent encoded content is unquoted.
         """
         if self.type in ("text", "json"):
-            return self.root
+            return str(self.root)
         try:
             return typing.cast(
                 str

@wRAR
Copy link
Member

wRAR commented Jun 15, 2023

I was thinking about shorten(str(self.get()), width=40) in Selector.__str__(). As for get(), its docstring currently says that it returns a string, I think this is wrong and it should still return a non-string JSON if the type is JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants