Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Web Archivist to override robots.txt setting from W3ACT #503

Closed
nicolabingham opened this issue Mar 18, 2016 · 6 comments
Closed

Comments

@nicolabingham
Copy link

nicolabingham commented Mar 18, 2016

Could the override functionality be enabled for Archvists please so that we can change the robots.txt policy at Target level. The placeholder for this is already in the Target record: Target > Crawl policy and Schedule > Ignore Robots.txt. To be restricted to the Archivist role and higher please.

@ldbiz
Copy link
Contributor

ldbiz commented Apr 28, 2016

w3act already specifically allows Archivists to change the policy setting. Apparently the issue is that the flag is being ignored at a later part of the process outside w3act. I've asked @anjackson to clarify requirements.

@ldbiz ldbiz added the Question label Apr 28, 2016
@anjackson
Copy link
Contributor

anjackson commented May 4, 2016

Yes, currently neither the production crawl launcher (called w3start.py) or the new one (launch.py) actually modify the crawl to ignore robots.txt. We have the script fragments we need for this, but we still need to finish this up.

The basic processes are in https://github.com/ukwa/python-shepherd/blob/master/python-w3act/w3act/job.py but I'd rather move to using the h3cc script approach currently under development here: https://github.com/ukwa/python-shepherd/blob/master/agents/h3cc.py

(I've added an issue there, but we can leave this issue open until the end-to-end is proven)

@ldbiz ldbiz removed the Question label May 5, 2016
@anjackson anjackson modified the milestones: 2.1.0 DDHAPT Production Deployment, 2.0.0 DDHAPT Initial Production Deployment May 24, 2016
@anjackson anjackson modified the milestones: 2.1.0 DDHAPT Production Deployment, 2.0.0 DDHAPT Initial Production Deployment Jun 13, 2016
@GilHoggarth GilHoggarth modified the milestones: 2.0.0 W3ACT release, 2.1.0 DDHAPT Production Deployment, 2.0.1 W3ACT release Jun 28, 2016
@anjackson
Copy link
Contributor

Please contact us to work around this issue manually for now. Sorry!

@anjackson anjackson removed this from the X.X.X Future Production Release milestone Feb 14, 2019
@anjackson
Copy link
Contributor

Not actually a W3ACT change - this is down to deploying the new crawl engine: https://github.com/ukwa/ukwa-heritrix

@anjackson
Copy link
Contributor

Just checking the code, and the current W3ACT does allow 'expert users' to change this flag. This will be changed to restrict it to Archivists and SysAdmins.

@anjackson
Copy link
Contributor

Also, note the the crawl engine should be obeying the no Robots.txt directive now (for a while, in fact).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants