-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-populate Instances for a given Target based on what's in Wayback #231
Comments
How's that, Peter? |
@anjackson wondering whether to dynamically check for each target record when opened by user might not slow things down. Perhaps just on creation of a new record ? So: (i) User creates new target record -> ACT checks for instances of that target in Wayback > creates new instance record for each AND (ii) nightly, ACT checks the Wayback XML endpoint for instances of any targets for which there are not existing instance records (ie checking instances with timestamps since the last check) > creates new instance records for each. @kinmanli how much work do you think this might be ? |
@anjackson if i was to deal with a Target with a url "https://www.gov.uk/government/publications". Is the url to wayback As it gives me "Resource Not In Archive" or is it the domain? "http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=https://www.gov.uk/" Cheers |
Hello, yes, that was the right URL for the public instance, which you can use for testing. Unfortunately, there was a problem with it, but it's fixed now. However, note that in production, this should be pointing to our internal QA Wayback (which gets to see much more content much earlier than the public Wayback) so the location of this endpoint should be configurable. |
Checks if instance is already imported, configurable wayback url, date formatting and conversion
@kinmanli I created a new target, saved the record; and then hit Get New Instances. I know this site, and doubt it is in the public Wayback - so this may have returned no new instances - is that why I get? Execution exception[[RuntimeException: uk.bl.exception.ActException: javax.xml.bind.UnmarshalException - with linked exception: [java.io.FileNotFoundException: http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.fishbournepreschool.org.uk/]]] There should be a better messaging if the check returns no instances. |
@anjackson @kinmanli just to confirm that it would be good if this endpoint was the QA Wayback soonish, so I can test the workflows |
@anjackson @kinmanli is it going to be possible to see the daily check in action before we go into production ? I imagine it might not be straightforward, but I don't think it is running so far - I would expect lots of instances to show up at http://www.webarchive.org.uk/actdev/qa/list?f=bbc.co.uk if it was |
Better import wayback message and set to QA wayback url
Friendlier message when there are no new instances or bad url. Also, set to QA wayback url |
instances in Wayback and saves them in w3act #231
@kinmanli to answer @peterwebster's question, is there a hook URL that will look for new instances for all targets? |
@anjackson @peterwebster I've created a script to do it (was thinking CRON) https://github.com/ukwa/w3act/blob/schemarefactor/waybackinstance_import.sh which calls a class Can easily attach it to a URL? |
Oh, that's fine then. I was suggesting a web hook because I thought that might be easier, but you've already written a script so that's simpler to cron. Let's ask @GilHoggarth if he'll add an act-dev hook that will run your script nightly... |
So, I should configure this 'waybackinstance_import.sh' script to run nightly. @kinmanli Given /actdev redeploys/restarts daily at 17, when should i schedule this script to run? |
@GilHoggarth depends when @peterwebster will want to start testing this functionality? |
@GilHoggarth @kinmanli well, I guess that out of normal hours is the best time, so this evening, after the service has redeployed ? No preference as to exact time |
@kinmanli Looking at the actual 'waybackinstance_import.sh' script, I'm afraid that that won't work. The PLAY_HOME envar won't resolve as this script lives as /opt/w3act/ - so3 levels up is actually to /, and I don't have a /tools/ directory for play 2.2.1 (even though, to clarify, we are using v2.2.1). I'm assuming therefore that your dev environment is different from the server environment. |
@GilHoggarth fixed (needed to run import) |
/actdev is a "play target" service so it's libraries are scattered whereas w3act is a "play dist" service with all the libraries in one directory. So the triggering script (that runs the actual script) is slightly different on /actdev compared to w3act. Anyhoo, this is scheduled on both servers as a late night crontab, and both are being run at the moment to test that they work as expected. |
/actdev script was referencing the prod.conf not application.conf, so changed this and rerunning. If run completes as expected, I'll close this ticket. |
Import on /actdev worked as expected. Closing ticket. |
Reopening so the same test is performed on w3act. |
Ran the |
@GilHoggarth if you can make my user an archivist then I'll be able to see the results. I've just checked the old listing code for instances and the results are only viewable by sysadm and archivist. Is this correct or should it be viewable by all roles @peterwebster? |
@kinmanli at which point in the UI is the view you're talking about ? |
@kinmanli Done, you're now an w3act archivist. |
@GilHoggarth this looks good with the latest ones being 24/02/2015 There are 3 from 25/02/2015 on http://www.webarchive.org.uk/actdev/wayback/*/http://www.theguardian.com/ but I guess they will get imported on the next run |
I think instances should be visible to everyone, surely? |
Yes, as users at all levels will want to see what has been crawled and when. |
@kinmanli Following Nicola's above comment I've tested https://www.webarchive.org.uk/act/instances/listbytarget/?t=790 to be accessible by the admin user and by me (gil.hoggarth, just user!) I couldn't see the results as me |
@GilHoggarth @nicolabingham my user with role 'viewer' can now view and search instances. checked in Also, I removed all trailing slashes '/' (after 'listbytarget'). So:
Is now:
Same for:
|
Note To be tested after new w3act deployment |
|
Script run completed and new imports seen - closing ticket. |
The original intention was that W3ACT would take the 'lead' URL for a site and auto-populate the instances based on what's in Wayback. e.g.
http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.bl.uk/bibliographic/ukmarc.html
Similarly, our internal Wayback instance has an XML query endpoint that allows you to look up the dates we have instances for.
The idea would be to automatically scan for new instances when a Target is visited, but to also have a special URL we can call every night that checks for new instances of all Targets.
The text was updated successfully, but these errors were encountered: