-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow crawling of items outside of domain #14
Conversation
docs/setup.rst
Outdated
@@ -70,6 +70,11 @@ The following additional settings are recommended: | |||
``"zyte_crawlers.middlewares.CrawlingLogsMiddleware": 1000``, to log crawl | |||
data in JSON format for debugging purposes. | |||
|
|||
- Update :setting:`SPIDER_MIDDLEWARES <scrapy:SPIDER_MIDDLEWARES>` to include | |||
``"zyte_crawlers.middlewares.ItemOffsiteMiddleware": 500`` and remove | |||
``"scrapy.spidermiddlewares.offsite.OffsiteMiddleware"``. This allows for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving this to Scrapy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a proposal for Scrapy in scrapy/scrapy#6151
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #14 +/- ##
==========================================
- Coverage 98.56% 98.39% -0.18%
==========================================
Files 9 9
Lines 488 498 +10
==========================================
+ Hits 481 490 +9
- Misses 7 8 +1
|
@@ -49,6 +49,10 @@ class BaseSpiderParams(BaseModel): | |||
"widget": "request-limit", | |||
}, | |||
) | |||
allow_items_outside_domains: Optional[bool] = Field( | |||
description="When set to True, items outside of the domains will be crawled.", | |||
default=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's feasible NOT to add this option, but make it a default instead? It looks safe enough, at least for "navigation" and "pagination_only" strategies. Maybe it's fine for "full" as well. We might need to do some experiment to ensure it's safe though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some analysis, it turns out we can enabled crawling products outside of the domain by default. I guess the ultimate switch by users here would be enabling the AllowOffsiteMiddleware
in settings.
6644253
to
5901775
Compare
5901775
to
2a19089
Compare
Co-authored-by: Adrián Chaves <adrian@chaves.io>
TODO: