Analysis to investigate requirement for implementing OAI-PMH #795

amyehodge · 2023-11-07T21:02:53Z

In particular, we want to test our implementation to make sure content is getting picked up by Unpaywall, which uses OAI-PMH. Rochelle mentioned that "Unpaywall feeds OA publications to Web of Science and Scopus, among others, so having a connection here would make a big impact."

We also want our OAI-PMH implementation to meet the requirements for CORE (https://core.ac.uk). These requirements can be found at https://docs.google.com/document/d/1sc8RSAhJT4kmYxUKgKvNbPSegejosr83VIqOo6CR7mY/edit.

This is some information I collected while researching requirements for Unpaywall that may be useful. The idea was to get them to pick up the OA versions of published articles that we have in SDR. Note that we may not actually have any items that have the necessary metadata, but we may be able to create/enhance a few sample items to have something for testing purposes.

Unpaywall (which uses OAI-PMH) provides information about the format of the data they provide.
Having the article's relation to the version published by a journal explicitly stated in the metadata appears to be required for harvesting of content by Unpaywall.
They are using DRIVER Guidelines v2.0 that specify how the relation is expressed.
There is a vocabulary for the versions.
There is a full example of implementation of this.
This form can be used to request that Unpaywall track our content.
The above form includes this info:
Your repository's OAI-PMH endpoint
Test the URL using "https://api.unpaywall.org/repository/endpoint/test/YOURURL" and make sure it says "SUCCESS" for both checks. Example of a working test for the OAI-PMH endpoint http://serval.unil.ch/oaiprovider can be seen at https://api.unpaywall.org/repository/endpoint/test/http://serval.unil.ch/oaiprovider

lwrubel · 2023-11-07T21:08:45Z

This is a research / analysis ticket to determine what would be involved in setting up an OAI-PMH endpoint that would support Unpaywall's harvesting requirements above.

justinlittman · 2023-11-08T13:59:24Z

I'm confused. Does Unpaywell need the DOI / identifier of the published article? If so, how/where would that be recorded in SDR?

amyehodge · 2023-11-08T15:18:46Z

@justinlittman I know that currently we aren't collecting that info in H2. I'm wondering if @arcadiafalcone or @andrewjbtw know of any items that might currently have this metadata in the right format.

I have some preliminary designs for implementing the collection of this information via H2 that I worked on with Rochelle, but that work did not make it into the last work cycle. But if we can get OAI-PMH set up, then when we start collecting the metadata there it can immediately be harvested.

justinlittman · 2023-11-08T17:13:30Z

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

justinlittman · 2023-11-08T17:33:28Z

Also, if I'm not mistaken implementing OAI-PMH would require some sort a datastore. Since PURL doesn't have a datastore by design, this would need to be a completely separate application and would require some mechanism for keeping in sync with publishing.

justinlittman · 2023-11-08T17:49:06Z

https://github.com/code4lib/ruby-oai

amyehodge · 2023-11-08T19:28:42Z

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

There are two examples in the Driver guidelines at https://wiki.surfnet.nl/display/DRIVERguidelines/DC+-+RELATION+and+Linking+related+objects of implementations of the relationship metadata.

There is also a vocabulary for expressing the types of article versions:
https://wiki.surfnet.nl/display/DRIVERguidelines/Version+vocabulary

Here's one of the examples for a paper that has been submitted for peer review:
<oai_dc:dc >
<de:identifier>[http://hdl.handle.net/1234/1111]</dc:identifier>
<dc:type>info:eu-repo/semantics/paper</dc:type>
<dc:type>info:eu-repo/semantics/submittedVersion</dc:type>
<dc:relation>[http://hdl.handle.net/1234/2222]</dc:relation>
</oai_dc:dc>

justinlittman · 2023-11-08T19:54:15Z

I meant in cocina and/or mods.

amyehodge · 2023-11-08T20:55:42Z

@arcadiafalcone do you know the answer to @justinlittman 's question above?

andrewjbtw · 2023-11-08T20:59:50Z

I'm not aware of metadata where the DOI of the published version/version of record is specifically identified in Cocina. I'm sure people have included it in related item links at times but probably not in a way that sets it apart from other related items.

andrewjbtw · 2023-11-08T21:19:22Z

Here's an example where the article links to the published version in the "related items": https://purl.stanford.edu/bw723vz5327 (also links to it in the abstract)

amyehodge · 2023-11-08T21:20:06Z

If anyone is interested, I have two related requirements docs for H2 around this point at https://docs.google.com/document/d/1kk-jHgkovZ6ghKxPvmcEcwHthDYwJrnLyAH5ods8YWE/edit and https://docs.google.com/document/d/1Ci5BFpTfhw5QyDWkOfACfyvznm3YkHYOpq3XXsFqlYo/edit. I think the first one is what we would need here for H2 content. But it didn't make it into the last H2 work cycle and may be too complex to include here. It never got to the point of discussion with developers, and maybe not even with Arcadia, so likely needs work still to be actionable.

arcadiafalcone · 2023-11-09T17:28:25Z

It would be possible to represent the above example in MODS, if the metadata is collected from the user or derived from the linked resource.

lwrubel · 2023-11-14T19:41:54Z

Current status is waiting on information about other aggregators or use cases OAI-PMH would support (now or in the future). @amyehodge is finding out about the frequency of content needing to be updated. Design will be dependent on what the requirements are. Preliminary design discussion included @justinlittman.

edsu · 2023-11-14T19:45:35Z

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

It's kind of a long shot, but it would be nice!

amyehodge · 2023-11-14T19:48:01Z

Notes from OAI-PMH implementation discussion https://docs.google.com/document/d/1LYB_0ynJoHLsEobxJgdhU7q4jWoO0wDi4XF4omOEYTk/edit

amyehodge · 2023-11-16T17:10:14Z

I have confirmed that a monthly frequency for updating of the content would be acceptable. She has also expressed interest in support core.ac.uk, and I have received an introduction to those folks and am waiting to hear who my contact there will be to get the technical information we require, since I can't find it on the web.

lwrubel · 2023-11-28T19:20:44Z

We need to figure out how to identify the items that are needed for each service before implementation.

justinlittman · 2023-12-01T17:14:08Z

Assuming that:

The purl filesystem is available on the OAI-PMH server for indexing.
Either the existing DC metadata could be used or a mapping could be created to the metadata formats required by CORE and Unpaywall from MODS or Cocina.
The items to be included in the CORE and Unpaywall sets can be identified in the Cocina or public XML.

One can imagine an implementation using ruby-oai that:

Performed periodic indexing from the purl file as described below.
Stored set membership and pre-generated metadata in a postgres database.
Exposed an OAI-PMH with separate sets for CORE and Unpaywall.

To index:

The purl filesystem would be crawled. For each crawled item, if a record already exists it will be updated. If it does not exist and the item is in any set, a record will be created. If it does not exist and is not in any set, no record will be created or updated.
When creating or updating a record, the set membership (included, not included, deleted) for each set is recorded and any necessary metadata is pre-generated and stored.
For each existing record which has not been recently crawled (as determined by the updated timestamp) and is the member of any sets, the purl filesystem will be checked for the item. If the item exists, the record will be updated as described above. If the item does not exist, the set memberships will be set to deleted.

The risk in this implementation is ruby-oia's unclear support for sets:

There is some code written to support oai-pmh "sets" in the ActiveRecord::Wrapper, but it's somewhat inflexible, and not well-documented, and as I write this I don't understand it enough to say more. See code4lib/ruby-oai#67

See https://github.com/code4lib/ruby-oai/blob/master/lib/oai/provider.rb#L266C30-L266C76

justinlittman · 2023-12-04T20:00:10Z

@amyehodge I'm closing this as the technical analysis is complete. I'd suggest that metadata analysis is still required. If you disagree on closing, feel free to re-open.

amyehodge · 2023-12-04T20:12:41Z

That's fine @justinlittman . I haven't had a chance to look at this in detail yet, but I'll try to sort out next steps to move this along. Thanks.

amyehodge · 2024-01-11T15:17:43Z

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

I just checked on this note and Unpaywall did respond to say that they don't support collecting data from sitemaps. Thanks for checking @edsu

lwrubel added the analysis label Nov 7, 2023

lwrubel added the blocked label Nov 14, 2023

lwrubel removed the blocked label Nov 28, 2023

justinlittman closed this as completed Dec 4, 2023

amyehodge mentioned this issue Dec 6, 2023

Metadata mappings for CORE OAI-PMH #866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis to investigate requirement for implementing OAI-PMH #795

Analysis to investigate requirement for implementing OAI-PMH #795

amyehodge commented Nov 7, 2023 •

edited

Loading

lwrubel commented Nov 7, 2023 •

edited

Loading

justinlittman commented Nov 8, 2023 •

edited

Loading

amyehodge commented Nov 8, 2023

justinlittman commented Nov 8, 2023 •

edited

Loading

justinlittman commented Nov 8, 2023

justinlittman commented Nov 8, 2023

amyehodge commented Nov 8, 2023

justinlittman commented Nov 8, 2023

amyehodge commented Nov 8, 2023

andrewjbtw commented Nov 8, 2023

andrewjbtw commented Nov 8, 2023

amyehodge commented Nov 8, 2023

arcadiafalcone commented Nov 9, 2023

lwrubel commented Nov 14, 2023

edsu commented Nov 14, 2023 •

edited

Loading

amyehodge commented Nov 14, 2023

amyehodge commented Nov 16, 2023

lwrubel commented Nov 28, 2023

justinlittman commented Dec 1, 2023

justinlittman commented Dec 4, 2023

amyehodge commented Dec 4, 2023

amyehodge commented Jan 11, 2024

Analysis to investigate requirement for implementing OAI-PMH #795

Analysis to investigate requirement for implementing OAI-PMH #795

Comments

amyehodge commented Nov 7, 2023 • edited Loading

lwrubel commented Nov 7, 2023 • edited Loading

justinlittman commented Nov 8, 2023 • edited Loading

amyehodge commented Nov 8, 2023

justinlittman commented Nov 8, 2023 • edited Loading

justinlittman commented Nov 8, 2023

justinlittman commented Nov 8, 2023

amyehodge commented Nov 8, 2023

justinlittman commented Nov 8, 2023

amyehodge commented Nov 8, 2023

andrewjbtw commented Nov 8, 2023

andrewjbtw commented Nov 8, 2023

amyehodge commented Nov 8, 2023

arcadiafalcone commented Nov 9, 2023

lwrubel commented Nov 14, 2023

edsu commented Nov 14, 2023 • edited Loading

amyehodge commented Nov 14, 2023

amyehodge commented Nov 16, 2023

lwrubel commented Nov 28, 2023

justinlittman commented Dec 1, 2023

justinlittman commented Dec 4, 2023

amyehodge commented Dec 4, 2023

amyehodge commented Jan 11, 2024

amyehodge commented Nov 7, 2023 •

edited

Loading

lwrubel commented Nov 7, 2023 •

edited

Loading

justinlittman commented Nov 8, 2023 •

edited

Loading

justinlittman commented Nov 8, 2023 •

edited

Loading

edsu commented Nov 14, 2023 •

edited

Loading