Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure block list gets updated from W3ACT to the FC #36

Open
anjackson opened this issue May 19, 2021 · 2 comments
Open

Ensure block list gets updated from W3ACT to the FC #36

anjackson opened this issue May 19, 2021 · 2 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented May 19, 2021

The archivist role can add the problematic URL to W3ACT already, under a Black List field.

Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).

After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.

(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?

https://www.bl.uk/?mobile=on
#twitter\.com/.*?lang=#

Hmm.

Also, take ukwa/ukwa-heritrix#85 into account

@anjackson
Copy link
Contributor Author

The current crawl engine has the block regex's in a file that is read once on startup. But also, regex blocking is quite dangerous, in that it's easy to make a mistake that breaks things or blocks too much. i.e. deployment should be somewhat manual rather than direct from W3ACT. This means it's not clear how best to manage them at present.

One option would be to have the code to generate the list from W3ACT, but have a separate file that gets mapped into the crawler (via ukwa-services) and update that occasionally. We'd need to update it via an API script too.

@anjackson
Copy link
Contributor Author

Okay, so this is two questions. Managing blocks from W3ACT, and deploying this specific Regex for www.bl.uk. The latter can go in quick. The rest is part of https://trello.com/c/2rtXl07h/29-roll-out-w3act-on-prod-swarm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant