Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest corporate archive images #1126

Open
jcateswellcome opened this issue May 1, 2024 · 12 comments
Open

Ingest corporate archive images #1126

jcateswellcome opened this issue May 1, 2024 · 12 comments
Labels

Comments

@jcateswellcome
Copy link

jcateswellcome commented May 1, 2024

To support the accession of corporate photography into the archive, we would like to find a way to automatically ingest them in Archivematica.

https://www.notion.so/wellcometrust/Ingest-corporate-archive-images-09d2b2fc47b846a0a377900a6c7e386d?pvs=4

@paul-butcher
Copy link
Contributor

Questions - just make sure I understand what it is I'm supposed to be doing:

  1. Broadly speaking, this task is to write something that will...
    • iterate over the list of shoots, fetching each one from S3 as a folder - for each one
      • throw away the two redundant metadata files
      • create the appropriate metadata file for ingest
      • zip that all up
      • stick it in the right place on S3 for Archivematica to consume it
  2. There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

@paul-butcher
Copy link
Contributor

paul-butcher commented May 23, 2024

Regarding the Glacier aspect, I think we can trigger a Bulk retrieval then use Notifications to trigger the next step

@paul-butcher
Copy link
Contributor

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

@jcateswellcome
Copy link
Author

@paul-butcher my understanding is that the next, and most likely, ingest would be when the next year of corporate photography shoots is accessioned, so it is likely to be a largely similar/uniform kind of thing from a similar sort of place - if that is vaguely precise enough for now.

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

@jcateswellcome
Copy link
Author

Questions - just make sure I understand what it is I'm supposed to be doing:

  1. Broadly speaking, this task is to write something that will...

    • iterate over the list of shoots, fetching each one from S3 as a folder - for each one

      • throw away the two redundant metadata files
      • create the appropriate metadata file for ingest
      • zip that all up
      • stick it in the right place on S3 for Archivematica to consume it
  2. There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

  1. yes, sounds right
  2. I think suck it and see - this will come down to an Archivematica ingest limitations. I can try and find out from Ashley if there is a known or approximate number here.

@paul-butcher
Copy link
Contributor

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

That's great. At the lowest level, the main thing I was wondering about is whether Glacier normally be involved, or if it's just going to be involved occasionally, might it be easiest to do that bit manually? I don't need an answer on this, as I'll probably work it out as I go along.

@paul-butcher
Copy link
Contributor

paul-butcher commented May 24, 2024

Are there any folders in this format that have are currently not in Glacier (i.e. less than 6 months old). Not necessarily on the list. Just something where I can run a realistic not-quite-end-to-end test?
No worries - I've just put a folder under ST for it.

@jcateswellcome
Copy link
Author

Don't know, assume not. I am out on Tuesday/Wednesday - so suggest you get in touch with Ashley?

@paul-butcher
Copy link
Contributor

Ah. I've just spotted the minor wrinkle that both the buckets are in different accounts. That's a little bit of a pain.

@paul-butcher
Copy link
Contributor

paul-butcher commented May 29, 2024

Archivematica can be a bit flaky when ingesting large amounts of data. We may need to do some kind of retry.

The limit on ingest is by number of files per packet - Ashley recalls that there is a maximum of probably 500. I will set the maximum 250 in order to steer well clear of that.

If we have to retry because of ephemeral issues, I'd like to be pretty sure we aren't also failing because the packages are too big.

@paul-butcher
Copy link
Contributor

paul-butcher commented Jun 11, 2024

I assume that the target for these is wellcomecollection-workflow-(stage-)?upload, with the presence of stage- depending on whether the process is being run for real life or just to try it out.

Should it go into some (new?) subfolder in that bucket?

When ephemeral failures occur, is it just a matter of moving a zip from failed, back to where it was originally uploaded, or do I have to store the zips elsewhere in order to resubmit them?

If they fail for a legitimate reason, would it be appropriate to download the failed zip from there, modify/split it, then upload? (as opposed to storing the zips elsewhere and fetching the one that corresponds to the failure)

@aray-wellcome
Copy link

aray-wellcome commented Jun 13, 2024

If we want to practice this (and I think we should because it's really hard to delete things in storage if you mess up) then it'll need to go into the born-digital-accession folder of /wellcomecollection-archivematica-staging-transfer-source. No other subfolders are needed, just all zips into born-digital-accessions

The /wellcomecollection-archivematica-staging-transfer-source doesn't have a failed folder. You'll either get a success or failure log for each zip you put in. In the case of a failure, you can open the log and see what's wrong with it. The Lambda that produces the logs looks for issues with the metadata.csv and the structure of the zip I think.

If you need to resubmit the zip because it failed in Archivematica (usually because Archviematica fell over rather than anything legitimately wrong with the zip), you can just copy the zip into the same location and it'll overwrite and the Lambda should pick it up again.

If they fail for a legitimate reason then you should be able to pick it back up out of /wellcomecollection-archivematica-staging-transfer-source but we should check that it's not set to automatically clean up the successful items. I feel like Alex Chan did have some sort of cleanup code on this bucket but I have no idea if that's actually there and if it is, it's still working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Epics
Development

No branches or pull requests

3 participants