Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genspider prepend http:// without checking it in the domain #3553

Closed
ThunderMind2019 opened this issue Dec 28, 2018 · 8 comments · Fixed by #5808
Closed

Genspider prepend http:// without checking it in the domain #3553

ThunderMind2019 opened this issue Dec 28, 2018 · 8 comments · Fixed by #5808

Comments

@ThunderMind2019
Copy link

genspider prepend http:// But when i enter address like https://example.com it becomes http://https://example.com that, when run scrapy crawl throws an error.
What it should do, it should first check the receiving domain than take decision according to the passing domain whether it needs a http:// or nothing.

@ambarmendez
Copy link

Hello everybody! I would like to contribute to Scrapy, specifically with this issue and I already fork the repo and made the changes but before issuing a pull request I would like to discuss it and reach to an agreement.

I made the changes over 5 files:

  • scrapy/commands/genspider.py. Where urllib.parse.urlparse is being used on the domain parameter and prepend http scheme in the case is not there. But, this is a very tricky situation because of the command syntax, which asks for a domain and not an URL, so we're just putting a very big bet that it should be HTTP but what about if we ask for a URL instead of a domain, then the contradiction is solved right away since we can extract the domain without making any assumption. Don't you think fellas?
  • Each template file. In order to receive domain and URL values.
    I checked the test just for tests/test_commands.py, I didn't add test because I don't know how to make a test for this particular scenario where two attributes are set based on user input on a class created through a template. If someone has an idea how it could be made please I'm all ears

Lastly and maybe more important, I don't know how to try the code it's my first time and I don't have any idea how to execute it. So any help or guide would be really appreciated.

Thank you in advance and happy coding!!!! =D

@Gallaecio
Copy link
Member

Gallaecio commented Sep 30, 2019

@ambarmendez I would suggest you open a pull request already, discussions over actual code are usually easier.

Also, make sure you check out previous pull requests (#3554, #3558). Feedback there, specially @kmike’s, could answer some of your points.

ambarmendez added a commit to ambarmendez/scrapy that referenced this issue Oct 1, 2019
@ambarmendez
Copy link

Thank you @Gallaecio! Let's see how it goes =D

@Gallaecio
Copy link
Member

Now URLs are supported as input: #4439

However, we still simply extract the domain from them. Which means that the input protocol is not respected for start_urls, and instead it is always forced to be HTTP (as opposed to, for example, HTTPS).

@Gallaecio
Copy link
Member

#3558 seems to aim to address the start_urls issue. The pull request needs some work, but if not resumed, maybe it can be taken as inspiration for alternative implementations.

@msenior85
Copy link

Hi @Gallaecio, can I be assigned to work on this issue?

@Gallaecio
Copy link
Member

@msenior85 No need. Feel free to open a pull request and include Fixes #3553 in its description so that it shows up in this thread.

alexpdev added a commit to alexpdev/scrapy that referenced this issue Jan 24, 2023
This was referenced Jan 24, 2023
@alexpdev
Copy link
Contributor

Hello, I saw the note about this issue in the documentation and decided to write a simple patch for it. PR #5808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants