Support huge_tree=False? #5700

wRAR · 2022-11-01T10:57:18Z

The upcoming parsel 1.7.0 exposes, and flips, the lxml flag that controls the protection described here, so it's now possible to scrape certain large pages but presumably malicious pages can DoS the parser. So it would make sense to be able to disable huge_tree, re-enabling the protection, but as it's an argument for Selector.__init__(), it's unclear how to do that in Scrapy: response.xpath() uses a hidden self._cached_selector = Selector(response=self) and there is nowhere to pass custom arguments.

The text was updated successfully, but these errors were encountered:

GeorgeA92 · 2023-01-16T18:28:29Z

Here in scrapy.selector.unified.Selector(subclass of original parsel/selector) - we are using scrapy Response object to create Selector object:

scrapy/scrapy/selector/unified.py

Lines 67 to 82 in 4af5a06

    
           def __init__(self, response=None, text=None, type=None, root=None, **kwargs): 
        
               if response is not None and text is not None: 
        
                   raise ValueError(f'{self.__class__.__name__}.__init__() received ' 
        
                                    'both response and text') 
        
               st = _st(response, type) 
        
               if text is not None: 
        
                   response = _response_from_text(text, st) 
        
               if response is not None: 
        
                   text = response.text 
        
                   kwargs.setdefault('base_url', response.url) 
        
               self.response = response 
        
               super().__init__(text=text, type=st, root=root, **kwargs)

Inside it's init we can call response.request.callback that links to spider parse. callback method (returns bound method object).
As I've recently discovered on stackoverflow question How to find instance of a bound method in Python? we can call response.request.callback.__self__ to get spider instance object. Yes, it looks non conventional but it works.
If we can access spider - settings is accessible now by: response.request.callback.__self__.crawler.settings.getbool('HUGE_TREE', False) call. - and after we can call __init__ of selector with specified huge_tree argument "received" from settings.

Gallaecio · 2023-01-17T12:10:41Z

Yes, it looks non conventional but it works.

Only as long as callback is not None or an unbound method. The latter is allowed as long as you do not need serialization, and I think some middlewares do it, wrapping the original callback with its own function or method (bound to the middleware, not to the spider).

Gallaecio · 2023-01-17T12:19:45Z

As for alternative approaches, I think we may need to make it so that Response and subclasses accept an optional crawler or settings object, and change calling code to pass it when possible.

Maybe we can make Response a “Scrapy component” (i.e. instantiable with create_instance), so that it can define a from_crawler class method, and define it for lxml-based XmlResponse and HtmlResponse.

wRAR added enhancement discuss labels Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support huge_tree=False? #5700

Support huge_tree=False? #5700

wRAR commented Nov 1, 2022

GeorgeA92 commented Jan 16, 2023

Gallaecio commented Jan 17, 2023 •

edited

Gallaecio commented Jan 17, 2023

Support huge_tree=False? #5700

Support huge_tree=False? #5700

Comments

wRAR commented Nov 1, 2022

GeorgeA92 commented Jan 16, 2023

Gallaecio commented Jan 17, 2023 • edited

Gallaecio commented Jan 17, 2023

Gallaecio commented Jan 17, 2023 •

edited