Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

GeorgeA92 · 2023-04-21T10:59:37Z

Summary

Please review possibility to apply RAM memory efficient Selector that can accept bytes as input
added as result of scrapy/parsel#210 (added in parsel 1.8.0) in scrapy (as default)

Currently UnifiedSelector (subclass of parsel's Selector used in scrapy) configured to use Response.text as input
Response.text (property) -> creates new object (Response.body -> converted to str) which is memory intensive and not needed

scrapy/scrapy/selector/unified.py

Lines 66 to 73 in 5a37af1

    
           def __init__(self, response=None, text=None, type=None, root=None, **kwargs): 
        
               if response is not None and text is not None: 
        
                   raise ValueError( 
        
                       f"{self.__class__.__name__}.__init__() received " 
        
                       "both response and text" 
        
                   ) 
        
               st = _st(response, type)

Motivation

As mentioned on scrapy/parsel#210 usage of str input for creating Selector object is more RAM memory intensive

Describe alternatives you've considered

At this stage it will require to.. separately create Selector object with bytes input inside spiders callback method (and not use Response.Selector)

Additional context

Removing other usages of Response.text will significantly reduce RAM required to processing response

The text was updated successfully, but these errors were encountered:

wRAR · 2023-04-21T12:55:53Z

We've discussed this briefly yesterday and found that this may not be useful if we detect the response encoding at the same time as converting body to str (I think the relevant code is at

scrapy/scrapy/http/response/text.py

Line 105 in 5a37af1

def _body_inferred_encoding(self):

?). We didn't do any extensive analysis of this though.

GeorgeA92 · 2023-04-21T17:16:03Z

As we can see from

scrapy/scrapy/http/response/text.py

Lines 62 to 72 in 5a37af1

    
           @property 
        
           def encoding(self): 
        
               return self._declared_encoding() or self._body_inferred_encoding() 
        
           def _declared_encoding(self): 
        
               return ( 
        
                   self._encoding 
        
                   or self._bom_encoding() 
        
                   or self._headers_encoding() 
        
                   or self._body_declared_encoding() 
        
               )

Response object with valid encoding received without call of _body_inferred_encoding (with conversion to unicode str ) has _cached_ubody as None and encoding (not None).

It means that on unified Selector __init__ we can safely create bytes based selector for Response objects that match values mentioned above (and parsel v.1.8.0) and probably maintain current behaviour for other (w. encoding received from _body_inferred_encoding call) Response objects

scrapy/scrapy/selector/unified.py

Lines 66 to 83 in 5a37af1

    
           def __init__(self, response=None, text=None, type=None, root=None, **kwargs): 
        
               if response is not None and text is not None: 
        
                   raise ValueError( 
        
                       f"{self.__class__.__name__}.__init__() received " 
        
                       "both response and text" 
        
                   ) 
        
               st = _st(response, type) 
        
               if text is not None: 
        
                   response = _response_from_text(text, st) 
        
               if response is not None: 
        
                   text = response.text 
        
                   kwargs.setdefault("base_url", response.url) 
        
               self.response = response 
        
               super().__init__(text=text, type=st, root=root, **kwargs)

Gallaecio added enhancement performance labels May 3, 2023

wRAR mentioned this issue May 4, 2023

fix: Handle Parsel > 1.7.0 warning #5918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

GeorgeA92 commented Apr 21, 2023

wRAR commented Apr 21, 2023

GeorgeA92 commented Apr 21, 2023

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

Add possibility to use Selector (bytes) added in parsel 1.8.0. #5906

Comments

GeorgeA92 commented Apr 21, 2023

Summary

Motivation

Describe alternatives you've considered

Additional context

wRAR commented Apr 21, 2023

GeorgeA92 commented Apr 21, 2023