Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-populate Instances for a given Target based on what's in Wayback #231

Closed
anjackson opened this issue Jan 12, 2015 · 93 comments
Closed

Comments

@anjackson
Copy link
Contributor

The original intention was that W3ACT would take the 'lead' URL for a site and auto-populate the instances based on what's in Wayback. e.g.

http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.bl.uk/bibliographic/ukmarc.html

Similarly, our internal Wayback instance has an XML query endpoint that allows you to look up the dates we have instances for.

The idea would be to automatically scan for new instances when a Target is visited, but to also have a special URL we can call every night that checks for new instances of all Targets.

@anjackson
Copy link
Contributor Author

How's that, Peter?

@peterwebster
Copy link

@anjackson wondering whether to dynamically check for each target record when opened by user might not slow things down. Perhaps just on creation of a new record ? So:

(i) User creates new target record -> ACT checks for instances of that target in Wayback > creates new instance record for each

AND

(ii) nightly, ACT checks the Wayback XML endpoint for instances of any targets for which there are not existing instance records (ie checking instances with timestamps since the last check) > creates new instance records for each.

@kinmanli how much work do you think this might be ?

@kinmanli
Copy link
Contributor

@anjackson if i was to deal with a Target with a url "https://www.gov.uk/government/publications". Is the url to wayback

"http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=https://www.gov.uk/government/publications" ????

As it gives me "Resource Not In Archive"

or is it the domain?

"http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=https://www.gov.uk/"

Cheers

@anjackson
Copy link
Contributor Author

Hello, yes, that was the right URL for the public instance, which you can use for testing. Unfortunately, there was a problem with it, but it's fixed now.

However, note that in production, this should be pointing to our internal QA Wayback (which gets to see much more content much earlier than the public Wayback) so the location of this endpoint should be configurable.

kinmanli added a commit that referenced this issue Jan 20, 2015
Checks if instance is already imported, configurable wayback url, date
formatting and conversion
@peterwebster
Copy link

@kinmanli I created a new target, saved the record; and then hit Get New Instances.

I know this site, and doubt it is in the public Wayback - so this may have returned no new instances - is that why I get?

Execution exception[[RuntimeException: uk.bl.exception.ActException: javax.xml.bind.UnmarshalException - with linked exception: [java.io.FileNotFoundException: http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?url=http://www.fishbournepreschool.org.uk/]]]

There should be a better messaging if the check returns no instances.

@peterwebster
Copy link

@anjackson @kinmanli just to confirm that it would be good if this endpoint was the QA Wayback soonish, so I can test the workflows

@peterwebster
Copy link

@anjackson @kinmanli is it going to be possible to see the daily check in action before we go into production ? I imagine it might not be straightforward, but I don't think it is running so far - I would expect lots of instances to show up at http://www.webarchive.org.uk/actdev/qa/list?f=bbc.co.uk if it was

@peterwebster peterwebster assigned kinmanli and unassigned anjackson Jan 22, 2015
@kinmanli
Copy link
Contributor

kinmanli added a commit that referenced this issue Jan 26, 2015
Better import wayback message and set to QA wayback url
@kinmanli
Copy link
Contributor

Friendlier message when there are no new instances or bad url. Also, set to QA wayback url

@kinmanli kinmanli assigned peterwebster and unassigned kinmanli Jan 26, 2015
kinmanli added a commit that referenced this issue Jan 26, 2015
instances in Wayback and saves them in w3act

#231
@anjackson
Copy link
Contributor Author

@kinmanli to answer @peterwebster's question, is there a hook URL that will look for new instances for all targets?

@kinmanli
Copy link
Contributor

@anjackson @peterwebster I've created a script to do it (was thinking CRON) https://github.com/ukwa/w3act/blob/schemarefactor/waybackinstance_import.sh which calls a class

Can easily attach it to a URL?

@anjackson
Copy link
Contributor Author

Oh, that's fine then. I was suggesting a web hook because I thought that might be easier, but you've already written a script so that's simpler to cron. Let's ask @GilHoggarth if he'll add an act-dev hook that will run your script nightly...

@GilHoggarth
Copy link
Contributor

So, I should configure this 'waybackinstance_import.sh' script to run nightly. @kinmanli Given /actdev redeploys/restarts daily at 17, when should i schedule this script to run?

@kinmanli
Copy link
Contributor

@GilHoggarth depends when @peterwebster will want to start testing this functionality?

@peterwebster
Copy link

@GilHoggarth @kinmanli well, I guess that out of normal hours is the best time, so this evening, after the service has redeployed ? No preference as to exact time

@GilHoggarth
Copy link
Contributor

@kinmanli Looking at the actual 'waybackinstance_import.sh' script, I'm afraid that that won't work. The PLAY_HOME envar won't resolve as this script lives as /opt/w3act/ - so3 levels up is actually to /, and I don't have a /tools/ directory for play 2.2.1 (even though, to clarify, we are using v2.2.1). I'm assuming therefore that your dev environment is different from the server environment.

@kinmanli
Copy link
Contributor

@GilHoggarth fixed (needed to run import)

@GilHoggarth
Copy link
Contributor

/actdev is a "play target" service so it's libraries are scattered whereas w3act is a "play dist" service with all the libraries in one directory. So the triggering script (that runs the actual script) is slightly different on /actdev compared to w3act. Anyhoo, this is scheduled on both servers as a late night crontab, and both are being run at the moment to test that they work as expected.

@GilHoggarth
Copy link
Contributor

/actdev script was referencing the prod.conf not application.conf, so changed this and rerunning. If run completes as expected, I'll close this ticket.

@GilHoggarth
Copy link
Contributor

Import on /actdev worked as expected. Closing ticket.

@GilHoggarth
Copy link
Contributor

Reopening so the same test is performed on w3act.

@GilHoggarth GilHoggarth reopened this Feb 26, 2015
@GilHoggarth GilHoggarth assigned kinmanli and unassigned GilHoggarth Feb 26, 2015
@GilHoggarth
Copy link
Contributor

Ran the waybackinstance_import.sh script on w3act - it logged its progress and claimed "finished". @kinmanli As you know what this does, can you check that it has done what it should do via the UI please.

@kinmanli
Copy link
Contributor

@GilHoggarth if you can make my user an archivist then I'll be able to see the results.

I've just checked the old listing code for instances and the results are only viewable by sysadm and archivist. Is this correct or should it be viewable by all roles @peterwebster?

@kinmanli kinmanli assigned GilHoggarth and unassigned kinmanli Feb 26, 2015
@peterwebster
Copy link

@kinmanli at which point in the UI is the view you're talking about ?

@GilHoggarth
Copy link
Contributor

@kinmanli Done, you're now an w3act archivist.

@kinmanli
Copy link
Contributor

@GilHoggarth this looks good with the latest ones being 24/02/2015

screen shot 2015-02-27 at 11 00 47

There are 3 from 25/02/2015 on http://www.webarchive.org.uk/actdev/wayback/*/http://www.theguardian.com/ but I guess they will get imported on the next run

@anjackson
Copy link
Contributor Author

I think instances should be visible to everyone, surely?

@nicolabingham
Copy link

Yes, as users at all levels will want to see what has been crawled and when.

@GilHoggarth GilHoggarth assigned kinmanli and unassigned GilHoggarth Feb 27, 2015
@GilHoggarth
Copy link
Contributor

@kinmanli Following Nicola's above comment I've tested https://www.webarchive.org.uk/act/instances/listbytarget/?t=790 to be accessible by the admin user and by me (gil.hoggarth, just user!) I couldn't see the results as me

@kinmanli
Copy link
Contributor

@GilHoggarth @nicolabingham my user with role 'viewer' can now view and search instances.

checked in

screen shot 2015-02-27 at 13 00 24
screen shot 2015-02-27 at 12 58 54

Also, I removed all trailing slashes '/' (after 'listbytarget'). So:

https://www.webarchive.org.uk/act/instances/listbytarget/?t=790

Is now:

https://www.webarchive.org.uk/act/instances/listbytarget?t=790

Same for:

https://www.webarchive.org.uk/act/instances/list?f=http%3A%2F%2Fwww.islamic-relief.or

@kinmanli kinmanli assigned GilHoggarth and unassigned kinmanli Feb 27, 2015
@GilHoggarth
Copy link
Contributor

Note To be tested after new w3act deployment

@GilHoggarth
Copy link
Contributor

https://www.webarchive.org.uk/act/instances/listbytarget?t=790 can now be viewed by a 'user' account, the admin account, but not by a not-logged-in visitor.
Running import script again to check new data still comes in...

@GilHoggarth
Copy link
Contributor

Script run completed and new imports seen - closing ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants