Allow multiple URLs as input to the spiders #36

VMRuiz · 2024-02-08T10:50:58Z

Description:

The current ecommerce spider in the repository accepts a single input URL. This means that crawling different categories in a website requires the creation of multiple spiders. However, this approach is impractical for several reasons:

Handling a large number of inputs becomes tedious and challenging to monitor.
Merging datasets requires additional effort.
Running numerous small spiders is inefficient in resource utilization.

Proposed Solution:

Implementing this feature can be achieved through various methods

Allow multiple URLs in the url field using a separator like |.
Enable the use of an HTTPs URL to a CSV file with one input URL per line. This extensible to other protocols like S3, or SFTP.
Integrate with external services such as Google Spreadsheets or Airtable.

My prefered solution is number 2 as is usually the simpler and more versatile of all of them. It allows to update the URLs set on the file without having to re-configure the spider and there is a huge number of options to host such files without getting handcuffed into a specific provider.

Gallaecio · 2024-02-08T11:00:07Z

1 is friendlier with a no-code approach.

If we are OK with only supporting HTTP/HTTPS start requests, we could automatically interpret input URLs with a different protocol as links to lists of input URLs, which would allow to support both 1 and 2 with a single spider parameter. Or we could follow an approach similar to that of pip for VCS, e.g. if the protocol is prefixed with urls+ we treat it as a link to a list of input URLs, supporting all protocols, even HTTP or HTTPS, to point to a URL list.

VMRuiz · 2024-02-08T11:03:05Z

What if we add a checkbox to indicate the URL is a input file instead ? It seems easier to use than having to prefix urls+.

Gallaecio · 2024-02-08T11:06:19Z

urls+ seems more flexible, but it’s a flexibility we probably will never need. A boolean option sounds good to me.

Gallaecio · 2024-02-15T11:41:44Z

What if we had a separate input_urls field, in addition to urls, to specify the URLs that point to a list of URLs? We get the flexibility of urls+ without a custom schema, and having urls + some boolean option does not seem that different from having 2 URLs options, in the end is 2 fields in both cases.

VMRuiz · 2024-02-15T12:02:49Z

The downside is that now you have to validate that people is not using url and ìnput_urls at the same time.
Also, I think they could easily be confused, resulting in a bad user experience.

VMRuiz · 2024-02-15T12:06:23Z

Alternatively, we could have a single field and decide what to do with it based on the response headers from a HEAD request:

If response is a txt file, treat it as input_urls
If responses is a html, treat it as current url

The downside is that this method requires one additional requests, and there could network / bans issues that affects its outcome.

Gallaecio · 2024-02-15T12:09:41Z

Yeah, then I think the boolean option is the way to go. Any naming suggestions?

Gallaecio mentioned this issue Feb 15, 2024

Add an input URL list parameter #38

Open

4 tasks

This was referenced Feb 15, 2024

Add a seed URL parameter #41

Merged

Allow extraction of Item data without discovery process #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple URLs as input to the spiders #36

Allow multiple URLs as input to the spiders #36

VMRuiz commented Feb 8, 2024

Gallaecio commented Feb 8, 2024 •

edited

Loading

VMRuiz commented Feb 8, 2024

Gallaecio commented Feb 8, 2024

Gallaecio commented Feb 15, 2024

VMRuiz commented Feb 15, 2024

VMRuiz commented Feb 15, 2024 •

edited

Loading

Gallaecio commented Feb 15, 2024

Allow multiple URLs as input to the spiders #36

Allow multiple URLs as input to the spiders #36

Comments

VMRuiz commented Feb 8, 2024

Description:

Proposed Solution:

Gallaecio commented Feb 8, 2024 • edited Loading

VMRuiz commented Feb 8, 2024

Gallaecio commented Feb 8, 2024

Gallaecio commented Feb 15, 2024

VMRuiz commented Feb 15, 2024

VMRuiz commented Feb 15, 2024 • edited Loading

Gallaecio commented Feb 15, 2024

Gallaecio commented Feb 8, 2024 •

edited

Loading

VMRuiz commented Feb 15, 2024 •

edited

Loading