Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support huge_tree=False? #5700

Open
wRAR opened this issue Nov 1, 2022 · 3 comments
Open

Support huge_tree=False? #5700

wRAR opened this issue Nov 1, 2022 · 3 comments

Comments

@wRAR
Copy link
Member

wRAR commented Nov 1, 2022

The upcoming parsel 1.7.0 exposes, and flips, the lxml flag that controls the protection described here, so it's now possible to scrape certain large pages but presumably malicious pages can DoS the parser. So it would make sense to be able to disable huge_tree, re-enabling the protection, but as it's an argument for Selector.__init__(), it's unclear how to do that in Scrapy: response.xpath() uses a hidden self._cached_selector = Selector(response=self) and there is nowhere to pass custom arguments.

@GeorgeA92
Copy link
Contributor

Here in scrapy.selector.unified.Selector(subclass of original parsel/selector) - we are using scrapy Response object to create Selector object:

def __init__(self, response=None, text=None, type=None, root=None, **kwargs):
if response is not None and text is not None:
raise ValueError(f'{self.__class__.__name__}.__init__() received '
'both response and text')
st = _st(response, type)
if text is not None:
response = _response_from_text(text, st)
if response is not None:
text = response.text
kwargs.setdefault('base_url', response.url)
self.response = response
super().__init__(text=text, type=st, root=root, **kwargs)

  1. Inside it's init we can call response.request.callback that links to spider parse. callback method (returns bound method object).
  2. As I've recently discovered on stackoverflow question How to find instance of a bound method in Python? we can call response.request.callback.__self__ to get spider instance object. Yes, it looks non conventional but it works.
  3. If we can access spider - settings is accessible now by: response.request.callback.__self__.crawler.settings.getbool('HUGE_TREE', False) call. - and after we can call __init__ of selector with specified huge_tree argument "received" from settings.

@Gallaecio
Copy link
Member

Gallaecio commented Jan 17, 2023

Yes, it looks non conventional but it works.

Only as long as callback is not None or an unbound method. The latter is allowed as long as you do not need serialization, and I think some middlewares do it, wrapping the original callback with its own function or method (bound to the middleware, not to the spider).

@Gallaecio
Copy link
Member

As for alternative approaches, I think we may need to make it so that Response and subclasses accept an optional crawler or settings object, and change calling code to pass it when possible.

Maybe we can make Response a “Scrapy component” (i.e. instantiable with create_instance), so that it can define a from_crawler class method, and define it for lxml-based XmlResponse and HtmlResponse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants