Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parse_with_rules method and log a warning when _parse_response is… #5856

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

DevInTheShell-BR
Copy link

@DevInTheShell-BR DevInTheShell-BR commented Mar 24, 2023

… called directly or when it is overriden.

This is a working in progress. Although the code is working when I tested it manually, it lacks automated tests. I don't know yet how to make the test work with async methods.

Fixes #4463

scrapy/spiders/crawl.py Outdated Show resolved Hide resolved
scrapy/spiders/crawl.py Outdated Show resolved Hide resolved
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead."
)
cb_kwargs.pop("direct_call", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting strategy. I think it might be OK to go this route, as long as we use a more descriptive and less public name, e.g. _direct_parse_response_call.

However, there may be better approaches here.

For example, you could define a new private method, move the body of _parse_response to that method, and call that method from both _parse_method and parse_with_rules. In _parse_method, however, you also log a warning. Then you do not need to mess up with cb_kwargs, which is meant to end users, not for internal handlings.

It might also be good to log these warnings only once, and not once per callback call. So, you could run the method inheritance check on __init__, and define a private boolean attribute in __init__ as well to keep track of whether or not _parse_response has been called, to only log a warning on the first of such calls.

Finally, I think the parse_with_rules should be async as well. If you tried that already and had issues, mind that I think you need to await your internal call to _parse_response (or to the new private method) for things to work as expected. That is, rather than:

    async def parse_with_rules(…):
        return self._parse_response(…)

I think you need:

    async def parse_with_rules(…):
        return await self._parse_response(…)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the approach that you suggested but I was not able to make parse_with_rules and _parse_response async. It raises this error

return await self._parse_with_rules(response, callback, cb_kwargs, follow)
TypeError: object async_generator can't be used in 'await' expression

if I put async/await in the methods. I think it is because of yield on the body of the original function. It is working with only the _parse_with_rules being async but I don't know how much the others functions not being async would affect the operation of other parts of the code.

I also created 3 tests to check if the warnings would be raising properly. Should I add more to test other stuff?

Copy link
Member

@wRAR wRAR Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_with_rules is an async generator so you cannot await it, you need:

async def _parse_response(self, response, callback, cb_kwargs, follow=True):
    ...
    async for request_or_item in self._parse_with_rules(response, callback, cb_kwargs, follow):
        yield request_or_item

@codecov
Copy link

codecov bot commented Mar 24, 2023

Codecov Report

Merging #5856 (639f616) into master (96033ce) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 639f616 differs from pull request most recent head b3fbf88. Consider uploading reports for the commit b3fbf88 to get more accurate results

@@            Coverage Diff             @@
##           master    #5856      +/-   ##
==========================================
+ Coverage   88.85%   88.87%   +0.01%     
==========================================
  Files         162      162              
  Lines       11057    11072      +15     
  Branches     1801     1803       +2     
==========================================
+ Hits         9825     9840      +15     
  Misses        954      954              
  Partials      278      278              
Impacted Files Coverage Δ
scrapy/spiders/crawl.py 95.14% <100.00%> (+0.82%) ⬆️

Comment on lines +74 to +78
warnings.warn(
"The CrawlSpider._parse_response method is deprecated: it "
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mentioning where the problem is? Users may not realize they are subclassing CrawlSpider, or where, or they may be using a Scrapy plugin, or subclassing a CrawlSpider subclass from a third-party, etc.

Suggested change
warnings.warn(
"The CrawlSpider._parse_response method is deprecated: it "
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead."
)
cls_path = f"{cls.__module__}.{cls.__qualname__}"
warnings.warn(
f"The CrawlSpider._parse_response method, that the "
f"{cls_path} class reimplements, is deprecated: it will be "
f"removed in future Scrapy releases. Please use the "
f"CrawlSpider.parse_with_rules method instead."
)

cls = self.__class__
self._is_warned = False
is_overriden = method_is_overridden(cls, CrawlSpider, "_parse_response")
if is_overriden:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maintain backward compatibility: if they indeed override _parse_response, we need to call it where needed

I think something like this may work:

Suggested change
if is_overriden:
if is_overriden:
self.parse_with_rules = self._parse_response

Comment on lines +147 to +148
if not self._is_warned:
self._is_warned = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to distinguish warning from overriding and warning from use.

If you override, you get 1 warning. If you use, you get 1 warning. If you do both, you get 2 warnings.

Comment on lines +149 to +153
warnings.warn(
"The CrawlSpider._parse_response method is deprecated: it "
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should make the warning appear on the called line, the one that actually calls the method, rather than on the method itself.

Suggested change
warnings.warn(
"The CrawlSpider._parse_response method is deprecated: it "
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead."
)
warnings.warn(
"The CrawlSpider._parse_response method is deprecated: it "
"will be removed in future Scrapy releases. Please use the "
"CrawlSpider.parse_with_rules method instead.",
stacklevel=2,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

start_requests bypassing rules while working with CrawlSpider
3 participants