Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

Open
GeorgeA92 opened this issue Apr 21, 2023 · 2 comments
Open

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

GeorgeA92 opened this issue Apr 21, 2023 · 2 comments

Comments

@GeorgeA92
Copy link
Contributor

Summary

Please review possibility to apply RAM memory efficient Selector that can accept bytes as input
added as result of scrapy/parsel#210 (added in parsel 1.8.0) in scrapy (as default)

Currently UnifiedSelector (subclass of parsel's Selector used in scrapy) configured to use Response.text as input
Response.text (property) -> creates new object (Response.body -> converted to str) which is memory intensive and not needed

def __init__(self, response=None, text=None, type=None, root=None, **kwargs):
if response is not None and text is not None:
raise ValueError(
f"{self.__class__.__name__}.__init__() received "
"both response and text"
)
st = _st(response, type)

Motivation

As mentioned on scrapy/parsel#210 usage of str input for creating Selector object is more RAM memory intensive

Describe alternatives you've considered

At this stage it will require to.. separately create Selector object with bytes input inside spiders callback method (and not use Response.Selector)

Additional context

Removing other usages of Response.text will significantly reduce RAM required to processing response

@wRAR
Copy link
Member

wRAR commented Apr 21, 2023

We've discussed this briefly yesterday and found that this may not be useful if we detect the response encoding at the same time as converting body to str (I think the relevant code is at

def _body_inferred_encoding(self):
?). We didn't do any extensive analysis of this though.

@GeorgeA92
Copy link
Contributor Author

As we can see from

@property
def encoding(self):
return self._declared_encoding() or self._body_inferred_encoding()
def _declared_encoding(self):
return (
self._encoding
or self._bom_encoding()
or self._headers_encoding()
or self._body_declared_encoding()
)

Response object with valid encoding received without call of _body_inferred_encoding (with conversion to unicode str ) has _cached_ubody as None and encoding (not None).

It means that on unified Selector __init__ we can safely create bytes based selector for Response objects that match values mentioned above (and parsel v.1.8.0) and probably maintain current behaviour for other (w. encoding received from _body_inferred_encoding call) Response objects

def __init__(self, response=None, text=None, type=None, root=None, **kwargs):
if response is not None and text is not None:
raise ValueError(
f"{self.__class__.__name__}.__init__() received "
"both response and text"
)
st = _st(response, type)
if text is not None:
response = _response_from_text(text, st)
if response is not None:
text = response.text
kwargs.setdefault("base_url", response.url)
self.response = response
super().__init__(text=text, type=st, root=root, **kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants