Allow extraction of Item data without discovery process #35

VMRuiz · 2024-02-08T10:36:45Z

Description:

Currently, the ecommerce spider in the repository accepts an input URL for an ecommerce website as input and then performs a discovery process to find Products. There are some use cases where it is required to directly input a Product URL to retrieve specific product information without the need for website crawling. For example, for monitoring stock or price changes in a particular product, or to decouple the discovery and extraction processes.

Proposed Solution:

Update Ecommerce spider: Add a new crawling strategy EcommerceCrawlStrategy named disabled to bypass any further navigation on the provided page. The spider should return the expected output Item type, Product in this case.
Create a new Spider: Instead of extending the current existing spider, we can build a new spider that is only focused on extraction, without any crawling functionality. In this case, we could drop the crawl_strategy and max_requests input parameters. Additionally, we can make the spider work with other item types a part from Product by adding an Output Type Selector with values like: Product, ProductNavigation, Article, etc...

I prefer the second solution because it's more versatile and keeps the alignment of their original concept on the ecommerce spider.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2024-02-08T10:55:45Z

Option 2 sounds good to me.

Gallaecio · 2024-02-08T16:42:50Z

We could call it ProductSpider, and it may be an argument for calling the upcoming article spider ArticlesSpider, to leave room for an ArticleSpider.

VMRuiz · 2024-02-09T08:02:54Z

Does it make sense to have one different spider per item type? We may end up with tens of different templates just for this.

BurnzZ · 2024-02-15T13:12:42Z

There are some use cases where it is required to directly input a Product URL to retrieve specific product information without the need for website crawling.

What would be the main advantage of using zyte-spider-templates compared to simply using Zyte API directly to perform product extraction? example: https://docs.zyte.com/zyte-api/usage/extract.html

or is this more of like leveraging the features in Scrapy Cloud like periodic jobs, logging, storing items in Hubstorage, etc?

Gallaecio · 2024-02-15T13:27:51Z

It can be helpful for a no-code path, specially when in combination with #36.

Gallaecio mentioned this issue Feb 15, 2024

Implement crawl_strategy=direct_product #39

Merged

Gallaecio closed this as completed in #39 Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow extraction of Item data without discovery process #35

Allow extraction of Item data without discovery process #35

VMRuiz commented Feb 8, 2024 •

edited

Loading

Gallaecio commented Feb 8, 2024

Gallaecio commented Feb 8, 2024

VMRuiz commented Feb 9, 2024

BurnzZ commented Feb 15, 2024

Gallaecio commented Feb 15, 2024

Allow extraction of Item data without discovery process #35

Allow extraction of Item data without discovery process #35

Comments

VMRuiz commented Feb 8, 2024 • edited Loading

Description:

Proposed Solution:

Gallaecio commented Feb 8, 2024

Gallaecio commented Feb 8, 2024

VMRuiz commented Feb 9, 2024

BurnzZ commented Feb 15, 2024

Gallaecio commented Feb 15, 2024

VMRuiz commented Feb 8, 2024 •

edited

Loading