Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple URLs as input to the spiders #36

Open
VMRuiz opened this issue Feb 8, 2024 · 7 comments
Open

Allow multiple URLs as input to the spiders #36

VMRuiz opened this issue Feb 8, 2024 · 7 comments

Comments

@VMRuiz
Copy link
Contributor

VMRuiz commented Feb 8, 2024

Description:

The current ecommerce spider in the repository accepts a single input URL. This means that crawling different categories in a website requires the creation of multiple spiders. However, this approach is impractical for several reasons:

  1. Handling a large number of inputs becomes tedious and challenging to monitor.
  2. Merging datasets requires additional effort.
  3. Running numerous small spiders is inefficient in resource utilization.

Proposed Solution:

Implementing this feature can be achieved through various methods

  1. Allow multiple URLs in the url field using a separator like |.
  2. Enable the use of an HTTPs URL to a CSV file with one input URL per line. This extensible to other protocols like S3, or SFTP.
  3. Integrate with external services such as Google Spreadsheets or Airtable.

My prefered solution is number 2 as is usually the simpler and more versatile of all of them. It allows to update the URLs set on the file without having to re-configure the spider and there is a huge number of options to host such files without getting handcuffed into a specific provider.

@Gallaecio
Copy link
Contributor

Gallaecio commented Feb 8, 2024

1 is friendlier with a no-code approach.

If we are OK with only supporting HTTP/HTTPS start requests, we could automatically interpret input URLs with a different protocol as links to lists of input URLs, which would allow to support both 1 and 2 with a single spider parameter. Or we could follow an approach similar to that of pip for VCS, e.g. if the protocol is prefixed with urls+ we treat it as a link to a list of input URLs, supporting all protocols, even HTTP or HTTPS, to point to a URL list.

@VMRuiz
Copy link
Contributor Author

VMRuiz commented Feb 8, 2024

What if we add a checkbox to indicate the URL is a input file instead ? It seems easier to use than having to prefix urls+.

@Gallaecio
Copy link
Contributor

urls+ seems more flexible, but it’s a flexibility we probably will never need. A boolean option sounds good to me.

@Gallaecio
Copy link
Contributor

What if we had a separate input_urls field, in addition to urls, to specify the URLs that point to a list of URLs? We get the flexibility of urls+ without a custom schema, and having urls + some boolean option does not seem that different from having 2 URLs options, in the end is 2 fields in both cases.

@VMRuiz
Copy link
Contributor Author

VMRuiz commented Feb 15, 2024

The downside is that now you have to validate that people is not using url and ìnput_urls at the same time.
Also, I think they could easily be confused, resulting in a bad user experience.

@VMRuiz
Copy link
Contributor Author

VMRuiz commented Feb 15, 2024

Alternatively, we could have a single field and decide what to do with it based on the response headers from a HEAD request:

  • If response is a txt file, treat it as input_urls
  • If responses is a html, treat it as current url

The downside is that this method requires one additional requests, and there could network / bans issues that affects its outcome.

@Gallaecio
Copy link
Contributor

Yeah, then I think the boolean option is the way to go. Any naming suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants