Ensure block list gets updated from W3ACT to the FC #36

anjackson · 2021-05-19T10:18:37Z

The archivist role can add the problematic URL to W3ACT already, under a Black List field.

Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).

After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.

(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?

https://www.bl.uk/?mobile=on
#twitter\.com/.*?lang=#

Hmm.

Also, take ukwa/ukwa-heritrix#85 into account

The text was updated successfully, but these errors were encountered:

anjackson · 2021-05-19T10:18:41Z

The current crawl engine has the block regex's in a file that is read once on startup. But also, regex blocking is quite dangerous, in that it's easy to make a mistake that breaks things or blocks too much. i.e. deployment should be somewhat manual rather than direct from W3ACT. This means it's not clear how best to manage them at present.

One option would be to have the code to generate the list from W3ACT, but have a separate file that gets mapped into the crawler (via ukwa-services) and update that occasionally. We'd need to update it via an API script too.

anjackson · 2021-05-19T10:18:49Z

Okay, so this is two questions. Managing blocks from W3ACT, and deploying this specific Regex for www.bl.uk. The latter can go in quick. The rest is part of https://trello.com/c/2rtXl07h/29-roll-out-w3act-on-prod-swarm

anjackson mentioned this issue May 19, 2021

Stop crawler visiting BL mobile site. #37

Open

anjackson mentioned this issue Jun 8, 2023

Be more systematic about excluding web archives from crawl activity #117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure block list gets updated from W3ACT to the FC #36

Ensure block list gets updated from W3ACT to the FC #36

anjackson commented May 19, 2021 •

edited

Loading

anjackson commented May 19, 2021

anjackson commented May 19, 2021

Ensure block list gets updated from W3ACT to the FC #36

Ensure block list gets updated from W3ACT to the FC #36

Comments

anjackson commented May 19, 2021 • edited Loading

anjackson commented May 19, 2021

anjackson commented May 19, 2021

anjackson commented May 19, 2021 •

edited

Loading