You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The archivist role can add the problematic URL to W3ACT already, under a Black List field.
Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).
After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.
(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?
The current crawl engine has the block regex's in a file that is read once on startup. But also, regex blocking is quite dangerous, in that it's easy to make a mistake that breaks things or blocks too much. i.e. deployment should be somewhat manual rather than direct from W3ACT. This means it's not clear how best to manage them at present.
One option would be to have the code to generate the list from W3ACT, but have a separate file that gets mapped into the crawler (via ukwa-services) and update that occasionally. We'd need to update it via an API script too.
The archivist role can add the problematic URL to W3ACT already, under a Black List field.
Then, we need to pick up
white_list,black_list
URLs fromtargets.csv
and include them in the crawl feeds. Should be combined with the in-scope andnevercrawl
lists (respectively).After that, we need to check the crawler will pick up changes to the scope and block files, and add a
w3act_export
service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow
#
-delimited lines for RegEx?Hmm.
Also, take ukwa/ukwa-heritrix#85 into account
The text was updated successfully, but these errors were encountered: