Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow extraction of Item data without discovery process #35

Closed
VMRuiz opened this issue Feb 8, 2024 · 5 comments · Fixed by #39
Closed

Allow extraction of Item data without discovery process #35

VMRuiz opened this issue Feb 8, 2024 · 5 comments · Fixed by #39

Comments

@VMRuiz
Copy link
Contributor

VMRuiz commented Feb 8, 2024

Description:

Currently, the ecommerce spider in the repository accepts an input URL for an ecommerce website as input and then performs a discovery process to find Products. There are some use cases where it is required to directly input a Product URL to retrieve specific product information without the need for website crawling. For example, for monitoring stock or price changes in a particular product, or to decouple the discovery and extraction processes.

Proposed Solution:

  1. Update Ecommerce spider: Add a new crawling strategy EcommerceCrawlStrategy named disabled to bypass any further navigation on the provided page. The spider should return the expected output Item type, Product in this case.

  2. Create a new Spider: Instead of extending the current existing spider, we can build a new spider that is only focused on extraction, without any crawling functionality. In this case, we could drop the crawl_strategy and max_requests input parameters. Additionally, we can make the spider work with other item types a part from Product by adding an Output Type Selector with values like: Product, ProductNavigation, Article, etc...

I prefer the second solution because it's more versatile and keeps the alignment of their original concept on the ecommerce spider.

@Gallaecio
Copy link
Contributor

Option 2 sounds good to me.

@Gallaecio
Copy link
Contributor

We could call it ProductSpider, and it may be an argument for calling the upcoming article spider ArticlesSpider, to leave room for an ArticleSpider.

@VMRuiz
Copy link
Contributor Author

VMRuiz commented Feb 9, 2024

Does it make sense to have one different spider per item type? We may end up with tens of different templates just for this.

@BurnzZ
Copy link
Contributor

BurnzZ commented Feb 15, 2024

There are some use cases where it is required to directly input a Product URL to retrieve specific product information without the need for website crawling.

What would be the main advantage of using zyte-spider-templates compared to simply using Zyte API directly to perform product extraction? example: https://docs.zyte.com/zyte-api/usage/extract.html

or is this more of like leveraging the features in Scrapy Cloud like periodic jobs, logging, storing items in Hubstorage, etc?

@Gallaecio
Copy link
Contributor

It can be helpful for a no-code path, specially when in combination with #36.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants