New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obey Robots.txt #180
Comments
+1 |
same problem here...
|
Robots.txt is read at the start of crawling. You could disable that feature
from settings or to write a middleware apropos robots.
https://docs.scrapy.org/en/latest/topics/settings.html
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots
https://stackoverflow.com/questions/37274835/getting-forbidden-by-robots-txt-scrapy
El sáb., 16 feb. 2019 21:53, Tobias Keller <notifications@github.com>
escribió:
… same problem here
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#180 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Agwyu8nVvM1LYh3fAQOf1qQ2mh2wuNMUks5vOG_bgaJpZM4U-YZk>
.
|
I disabled the robotstxt midware, sub-classed it and changed the line that loads the file in the first place. So it took the right URL and worked. In my case, I wanted to obey the robots.txt file. Just turn it off was not a solution. |
can you share this? |
|
|
@ArthurJ where did you add this code, though? I'm quite a newbie on web crawling and I have been having huge trouble with my crawler not returning what it should. |
The same thing happens to me where the spider first downloads the correct robots and then tries to download localhost robots. However I still see on my logs that some links are |
Is scrapy-splash not compatible with obeying robots.txt? Everytime I make a query it attempts to download the robots.txt from the docker instance of scrapy-splash. The below is my settings file. I'm thinking it may be a misordering of the middlewares, but I'm not sure what it should look like.
The text was updated successfully, but these errors were encountered: