Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robots.txt is not reloaded when uri scheme is changed (http/https) #17

Closed
askoutaris opened this issue Mar 21, 2018 · 1 comment
Closed

Comments

@askoutaris
Copy link

Hello,

We found Abot a few days ago, and we try its free version to see if it can meet our needs.

Everything worked fine until we noticed that it crawls urls which were 'Disallow' in robots.txt.

After some debugging, we ended up that it binds robots.txt with the initial uri scheme which is provided with the site to crawl, eg for the site https://mysite.com disallowed urls works only for https. If there is a link to http://mysite.com/somepage, Abot will ignore robots.txt and will crawl it.

*Assuming we have the following Robots.txt
User-agent: *
Disallow: /somepage

Could you help us how to deal with this issue?
Thank you

@sjdirect
Copy link
Owner

Duplicated on abot thread sjdirect/abot#182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants