Ingest corporate archive images #1126

jcateswellcome · 2024-05-01T15:24:07Z

To support the accession of corporate photography into the archive, we would like to find a way to automatically ingest them in Archivematica.

https://www.notion.so/wellcometrust/Ingest-corporate-archive-images-09d2b2fc47b846a0a377900a6c7e386d?pvs=4

paul-butcher · 2024-05-22T16:03:29Z

Questions - just make sure I understand what it is I'm supposed to be doing:

Broadly speaking, this task is to write something that will...
- iterate over the list of shoots, fetching each one from S3 as a folder - for each one
  - throw away the two redundant metadata files
  - create the appropriate metadata file for ingest
  - zip that all up
  - stick it in the right place on S3 for Archivematica to consume it
There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

paul-butcher · 2024-05-23T13:35:09Z

Regarding the Glacier aspect, I think we can trigger a Bulk retrieval then use Notifications to trigger the next step

paul-butcher · 2024-05-24T08:39:51Z

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

jcateswellcome · 2024-05-24T12:39:59Z

@paul-butcher my understanding is that the next, and most likely, ingest would be when the next year of corporate photography shoots is accessioned, so it is likely to be a largely similar/uniform kind of thing from a similar sort of place - if that is vaguely precise enough for now.

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

jcateswellcome · 2024-05-24T12:42:32Z

Questions - just make sure I understand what it is I'm supposed to be doing:

Broadly speaking, this task is to write something that will...

iterate over the list of shoots, fetching each one from S3 as a folder - for each one

throw away the two redundant metadata files

create the appropriate metadata file for ingest

zip that all up

stick it in the right place on S3 for Archivematica to consume it

There is something about "if a shoot needs to be broken up" - is that a size constraint? If so, what is it, or do I have to just suck it and see?

yes, sounds right
I think suck it and see - this will come down to an Archivematica ingest limitations. I can try and find out from Ashley if there is a known or approximate number here.

paul-butcher · 2024-05-24T13:54:26Z

There is a mention of this forming a "Repeatable pipeline is established for future ingests from S3 buckets", are there some other ingests expected in the near future? Knowing this would help establish what kinds of options/parameters I might need to establish.

That's great. At the lowest level, the main thing I was wondering about is whether Glacier normally be involved, or if it's just going to be involved occasionally, might it be easiest to do that bit manually? I don't need an answer on this, as I'll probably work it out as I go along.

paul-butcher · 2024-05-24T15:56:30Z

Are there any folders in this format that have are currently not in Glacier (i.e. less than 6 months old). Not necessarily on the list. Just something where I can run a realistic not-quite-end-to-end test?
No worries - I've just put a folder under ST for it.

jcateswellcome · 2024-05-24T16:06:32Z

Don't know, assume not. I am out on Tuesday/Wednesday - so suggest you get in touch with Ashley?

paul-butcher · 2024-05-28T14:04:35Z

Ah. I've just spotted the minor wrinkle that both the buckets are in different accounts. That's a little bit of a pain.

paul-butcher · 2024-05-29T12:31:39Z

Archivematica can be a bit flaky when ingesting large amounts of data. We may need to do some kind of retry.

The limit on ingest is by number of files per packet - Ashley recalls that there is a maximum of probably 500. I will set the maximum 250 in order to steer well clear of that.

If we have to retry because of ephemeral issues, I'd like to be pretty sure we aren't also failing because the packages are too big.

paul-butcher · 2024-06-11T10:41:02Z

I assume that the target for these is wellcomecollection-workflow-(stage-)?upload, with the presence of stage- depending on whether the process is being run for real life or just to try it out.

Should it go into some (new?) subfolder in that bucket?

When ephemeral failures occur, is it just a matter of moving a zip from failed, back to where it was originally uploaded, or do I have to store the zips elsewhere in order to resubmit them?

If they fail for a legitimate reason, would it be appropriate to download the failed zip from there, modify/split it, then upload? (as opposed to storing the zips elsewhere and fetching the one that corresponds to the failure)

aray-wellcome · 2024-06-13T07:20:32Z

If we want to practice this (and I think we should because it's really hard to delete things in storage if you mess up) then it'll need to go into the born-digital-accession folder of /wellcomecollection-archivematica-staging-transfer-source. No other subfolders are needed, just all zips into born-digital-accessions

The /wellcomecollection-archivematica-staging-transfer-source doesn't have a failed folder. You'll either get a success or failure log for each zip you put in. In the case of a failure, you can open the log and see what's wrong with it. The Lambda that produces the logs looks for issues with the metadata.csv and the structure of the zip I think.

If you need to resubmit the zip because it failed in Archivematica (usually because Archviematica fell over rather than anything legitimately wrong with the zip), you can just copy the zip into the same location and it'll overwrite and the Lambda should pick it up again.

If they fail for a legitimate reason then you should be able to pick it back up out of /wellcomecollection-archivematica-staging-transfer-source but we should check that it's not set to automatically clean up the successful items. I feel like Alex Chan did have some sort of cleanup code on this bucket but I have no idea if that's actually there and if it is, it's still working.

jcateswellcome added the epic label May 1, 2024

paul-butcher mentioned this issue May 24, 2024

Convert Corporate Photography folders into Archivematica-compatible zips #1127

Closed

paul-butcher mentioned this issue Jun 7, 2024

CI etc for editorial-photography-ingest wellcomecollection/editorial-photography-ingest#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest corporate archive images #1126

Ingest corporate archive images #1126

jcateswellcome commented May 1, 2024 •

edited

Loading

paul-butcher commented May 22, 2024

paul-butcher commented May 23, 2024 •

edited

Loading

paul-butcher commented May 24, 2024

jcateswellcome commented May 24, 2024

jcateswellcome commented May 24, 2024

paul-butcher commented May 24, 2024

paul-butcher commented May 24, 2024 •

edited

Loading

jcateswellcome commented May 24, 2024

paul-butcher commented May 28, 2024

paul-butcher commented May 29, 2024 •

edited

Loading

paul-butcher commented Jun 11, 2024 •

edited

Loading

aray-wellcome commented Jun 13, 2024 •

edited

Loading

Ingest corporate archive images #1126

Ingest corporate archive images #1126

Comments

jcateswellcome commented May 1, 2024 • edited Loading

paul-butcher commented May 22, 2024

paul-butcher commented May 23, 2024 • edited Loading

paul-butcher commented May 24, 2024

jcateswellcome commented May 24, 2024

jcateswellcome commented May 24, 2024

paul-butcher commented May 24, 2024

paul-butcher commented May 24, 2024 • edited Loading

jcateswellcome commented May 24, 2024

paul-butcher commented May 28, 2024

paul-butcher commented May 29, 2024 • edited Loading

paul-butcher commented Jun 11, 2024 • edited Loading

aray-wellcome commented Jun 13, 2024 • edited Loading

jcateswellcome commented May 1, 2024 •

edited

Loading

paul-butcher commented May 23, 2024 •

edited

Loading

paul-butcher commented May 24, 2024 •

edited

Loading

paul-butcher commented May 29, 2024 •

edited

Loading

paul-butcher commented Jun 11, 2024 •

edited

Loading

aray-wellcome commented Jun 13, 2024 •

edited

Loading