# Retrieval of German Open Access repositories that hold software records

This notebook retrieves all German Open Access repositories via Sherpa APIs, then checks against all of them if they contain software records.

In [1]:
import os

from dotenv import load_dotenv

# Prepare the Sherpa API key
load_dotenv()

SHERPA_TOKEN = os.getenv("SHERPA_TOKEN")

assert len(SHERPA_TOKEN) > 0

First, we need to get all German open access repositories from the Sherpa API.

In [2]:
import json
import requests

endpoint = "https://v2.sherpa.ac.uk/cgi/retrieve"
typ = "repository"
fmt = "Json"
lmt = 100
flt = "[[\"country\",\"equals\",\"de\"]]"
srt = "id"

base_query = f"{endpoint}?item-type={typ}&api-key={SHERPA_TOKEN}&format={fmt}&limit={lmt}&filter={flt}&order={srt}"

offset = 0

repos_de = []

for offset in range(20):
  off = offset * 100  # For paging through result with limit = 100
  resp = requests.get(base_query + f"&offset={off}")

  if resp.status_code == 200:
    data = resp.json()
    if len(data["items"]) == 0:
      break
    else:
      repos_de.extend(data["items"])

Now we have a list of German open access repositories saved in `repos_de`.

For each repository, we are interested in:

- its name
- its type (one of `undetermined`, `institutional`, `disciplinary`, `aggregating`, `governmental`)
- its URL
- the URL of its primary OAI interface
- the software it uses
- its content types (e.g., does it have any of the relevant `software`, `datasets`, `other_special_item_types`)
- its content subjects

We can now reduce the data to just those we are interested in:

- Relevant repository types: any that are not `aggregating`
- Repositories that contain one or more of the relevant content types
- Only the data points we care for

In [3]:
relevant_content_types = {"software", "datasets", "other_special_item_types"}

relevant_repos = []

for repo in repos_de:
  metadata = repo["repository_metadata"]
  if metadata["type"] == "aggregating":
    continue
  if "software" not in metadata["content_types"]:
    continue
  else:
    name = metadata["name"][0]["name"]
    oai_url = metadata["oai_url"] if "oai_url" in metadata else None
    repo_data = {
      "name": name,
      "type": metadata["type"], 
      "url": metadata["url"], 
      "oai_url": oai_url,
      "software": metadata["software"],
      "content_types": metadata["content_types"],
      "subjects": metadata["content_subjects"]
    }
    relevant_repos.append(repo_data)

Now we have a list of relevant repositories in `relevant_repos`.

Print them, sorted by type, then software. 

In [4]:
def create_nested_dict(set1, set2):
    result = {}
    for item1 in set1:
        result[item1] = {item2: None for item2 in set2}
    return result

types = {r["type"] for r in relevant_repos}
software = {r["software"]["name"] for r in relevant_repos}
data = create_nested_dict(types, software)

for repo in relevant_repos:
    data[repo["type"]][repo["software"]["name"]] = [repo["name"], repo["url"], [ct for ct in repo["content_types"] if ct in relevant_content_types]]
    
for t in data:
    print(t)
    for s in data[t]:
        print(" " * 4, s)
        r = data[t][s]
        if r is not None:
            print(" " * 8, r[0], " -- ", r[1], " -- ", r[2])

institutional
     dspace
         DaKS - University of Kassel's research data repository  --  https://daks.uni-kassel.de/  --  ['datasets', 'software']
     other
         UFZ - Publikationsverzeichnis  --  http://www.ufz.de/publikationen  --  ['datasets', 'software']
     dspace_cris
         REPOSIT  --  https://reposit.haw-hamburg.de/  --  ['datasets', 'software', 'other_special_item_types']
     eprints
         Open Access LMU  --  http://epub.ub.uni-muenchen.de/  --  ['software', 'other_special_item_types']
     mycore
         Duisburg-Essen Publications Online  --  https://duepublico2.uni-due.de  --  ['datasets', 'software', 'other_special_item_types']
disciplinary
     dspace
         PsychArchives  --  https://www.psycharchives.org/  --  ['datasets', 'software', 'other_special_item_types']
     other
     dspace_cris
     eprints
     mycore
         KartDok - Cartography Repository  --  https://kartdok.staatsbibliothek-berlin.de/  --  ['datasets', 'software', 'other_special