Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis to investigate requirement for implementing OAI-PMH #795

Closed
amyehodge opened this issue Nov 7, 2023 · 22 comments
Closed

Analysis to investigate requirement for implementing OAI-PMH #795

amyehodge opened this issue Nov 7, 2023 · 22 comments
Labels

Comments

@amyehodge
Copy link
Collaborator

amyehodge commented Nov 7, 2023

In particular, we want to test our implementation to make sure content is getting picked up by Unpaywall, which uses OAI-PMH. Rochelle mentioned that "Unpaywall feeds OA publications to Web of Science and Scopus, among others, so having a connection here would make a big impact."

We also want our OAI-PMH implementation to meet the requirements for CORE (https://core.ac.uk). These requirements can be found at https://docs.google.com/document/d/1sc8RSAhJT4kmYxUKgKvNbPSegejosr83VIqOo6CR7mY/edit.

This is some information I collected while researching requirements for Unpaywall that may be useful. The idea was to get them to pick up the OA versions of published articles that we have in SDR. Note that we may not actually have any items that have the necessary metadata, but we may be able to create/enhance a few sample items to have something for testing purposes.

@lwrubel
Copy link
Contributor

lwrubel commented Nov 7, 2023

This is a research / analysis ticket to determine what would be involved in setting up an OAI-PMH endpoint that would support Unpaywall's harvesting requirements above.

@justinlittman
Copy link
Contributor

justinlittman commented Nov 8, 2023

I'm confused. Does Unpaywell need the DOI / identifier of the published article? If so, how/where would that be recorded in SDR?

@amyehodge
Copy link
Collaborator Author

@justinlittman I know that currently we aren't collecting that info in H2. I'm wondering if @arcadiafalcone or @andrewjbtw know of any items that might currently have this metadata in the right format.

I have some preliminary designs for implementing the collection of this information via H2 that I worked on with Rochelle, but that work did not make it into the last work cycle. But if we can get OAI-PMH set up, then when we start collecting the metadata there it can immediately be harvested.

@justinlittman
Copy link
Contributor

justinlittman commented Nov 8, 2023

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

@justinlittman
Copy link
Contributor

Also, if I'm not mistaken implementing OAI-PMH would require some sort a datastore. Since PURL doesn't have a datastore by design, this would need to be a completely separate application and would require some mechanism for keeping in sync with publishing.

@justinlittman
Copy link
Contributor

@amyehodge
Copy link
Collaborator Author

Is there a specification of what "this metadata in the right format" means? That would be a helpful place to start.

There are two examples in the Driver guidelines at https://wiki.surfnet.nl/display/DRIVERguidelines/DC+-+RELATION+and+Linking+related+objects of implementations of the relationship metadata.

There is also a vocabulary for expressing the types of article versions:
https://wiki.surfnet.nl/display/DRIVERguidelines/Version+vocabulary

Here's one of the examples for a paper that has been submitted for peer review:
<oai_dc:dc >
<de:identifier>[http://hdl.handle.net/1234/1111]</dc:identifier>
<dc:type>info:eu-repo/semantics/paper</dc:type>
<dc:type>info:eu-repo/semantics/submittedVersion</dc:type>
<dc:relation>[http://hdl.handle.net/1234/2222]</dc:relation>
</oai_dc:dc>

@justinlittman
Copy link
Contributor

I meant in cocina and/or mods.

@amyehodge
Copy link
Collaborator Author

@arcadiafalcone do you know the answer to @justinlittman 's question above?

@andrewjbtw
Copy link

I'm not aware of metadata where the DOI of the published version/version of record is specifically identified in Cocina. I'm sure people have included it in related item links at times but probably not in a way that sets it apart from other related items.

@andrewjbtw
Copy link

Here's an example where the article links to the published version in the "related items": https://purl.stanford.edu/bw723vz5327 (also links to it in the abstract)

@amyehodge
Copy link
Collaborator Author

If anyone is interested, I have two related requirements docs for H2 around this point at https://docs.google.com/document/d/1kk-jHgkovZ6ghKxPvmcEcwHthDYwJrnLyAH5ods8YWE/edit and https://docs.google.com/document/d/1Ci5BFpTfhw5QyDWkOfACfyvznm3YkHYOpq3XXsFqlYo/edit. I think the first one is what we would need here for H2 content. But it didn't make it into the last H2 work cycle and may be too complex to include here. It never got to the point of discussion with developers, and maybe not even with Arcadia, so likely needs work still to be actionable.

@arcadiafalcone
Copy link
Collaborator

It would be possible to represent the above example in MODS, if the metadata is collected from the user or derived from the linked resource.

@lwrubel
Copy link
Contributor

lwrubel commented Nov 14, 2023

Current status is waiting on information about other aggregators or use cases OAI-PMH would support (now or in the future). @amyehodge is finding out about the frequency of content needing to be updated. Design will be dependent on what the requirements are. Preliminary design discussion included @justinlittman.

@edsu
Copy link

edsu commented Nov 14, 2023

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

It's kind of a long shot, but it would be nice!

@amyehodge
Copy link
Collaborator Author

Notes from OAI-PMH implementation discussion https://docs.google.com/document/d/1LYB_0ynJoHLsEobxJgdhU7q4jWoO0wDi4XF4omOEYTk/edit

@amyehodge
Copy link
Collaborator Author

I have confirmed that a monthly frequency for updating of the content would be acceptable. She has also expressed interest in support core.ac.uk, and I have received an introduction to those folks and am waiting to hear who my contact there will be to get the technical information we require, since I can't find it on the web.

@lwrubel
Copy link
Contributor

lwrubel commented Nov 28, 2023

We need to figure out how to identify the items that are needed for each service before implementation.

@lwrubel lwrubel removed the blocked label Nov 28, 2023
@justinlittman
Copy link
Contributor

Assuming that:

  1. The purl filesystem is available on the OAI-PMH server for indexing.
  2. Either the existing DC metadata could be used or a mapping could be created to the metadata formats required by CORE and Unpaywall from MODS or Cocina.
  3. The items to be included in the CORE and Unpaywall sets can be identified in the Cocina or public XML.

One can imagine an implementation using ruby-oai that:

  • Performed periodic indexing from the purl file as described below.
  • Stored set membership and pre-generated metadata in a postgres database.
  • Exposed an OAI-PMH with separate sets for CORE and Unpaywall.

To index:

  1. The purl filesystem would be crawled. For each crawled item, if a record already exists it will be updated. If it does not exist and the item is in any set, a record will be created. If it does not exist and is not in any set, no record will be created or updated.
  2. When creating or updating a record, the set membership (included, not included, deleted) for each set is recorded and any necessary metadata is pre-generated and stored.
  3. For each existing record which has not been recently crawled (as determined by the updated timestamp) and is the member of any sets, the purl filesystem will be checked for the item. If the item exists, the record will be updated as described above. If the item does not exist, the set memberships will be set to deleted.

The risk in this implementation is ruby-oia's unclear support for sets:

There is some code written to support oai-pmh "sets" in the ActiveRecord::Wrapper, but it's somewhat inflexible, and not well-documented, and as I write this I don't understand it enough to say more. See code4lib/ruby-oai#67

See https://github.com/code4lib/ruby-oai/blob/master/lib/oai/provider.rb#L266C30-L266C76

@justinlittman
Copy link
Contributor

@amyehodge I'm closing this as the technical analysis is complete. I'd suggest that metadata analysis is still required. If you disagree on closing, feel free to re-open.

@amyehodge
Copy link
Collaborator Author

That's fine @justinlittman . I haven't had a chance to look at this in detail yet, but I'll try to sort out next steps to move this along. Thanks.

@amyehodge
Copy link
Collaborator Author

Just an aside: I dropped a note on the Unpaywall discussion list to see if they support (or plan to support) collecting data from sitemaps, since that's something we are adding for Google et al.

https://groups.google.com/u/1/g/unpaywall/c/AT-GkGIcoMQ

I just checked on this note and Unpaywall did respond to say that they don't support collecting data from sitemaps. Thanks for checking @edsu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants