# Harvesting links from VIAF (via Libraries Australia)

Trove provides links to identifiers from Libraries Australia (the Australian Bibliographic Network). VIAF, in turn, aggregates identifiers from a range of blibliographic control systems including LA. Using the LA identifiers we can look for matches in VIAF. If there are VIAF records, we can download all the linked records from other systems.

There are a couple of steps:

- use the LA identifier to construct a VIAF `sourceID` url (note the id numbers are left-padded with zeros to make a 12 digit number)
- when you request this url, you're redirected to a VIAF record, so you can get the VIAF id for the entity
- you can use the VIAF id to construct a url to the `justlinks.json` file that contains the links to other systems

So, for example, we use the LA id `58809035` to construct the url `https://viaf.org/viaf/sourceID/NLA%7C000058809035` which redirects to `https://viaf.org/viaf/93485427/`.

Representations of the VIAF record include:

- RDF: `https://viaf.org/viaf/93485427/rdf.xml`
- JSON (just the linked identifiers): `https://viaf.org/viaf/93485427/justlinks.json`

I'm only harvesting the linked identifiers below, but you could also load the RDF into RDFlib and extract things like name and dates. Also the JSON file only returns ids rather than full urls to sources such as ULAN, so you'll need to construct the urls from the ids.

In [None]:
import datetime
import json
from pathlib import Path
from urllib.parse import urlparse

import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

In [None]:
def harvest_viaf_links(trove_links):
    viaf_links = []
    acde_source_name = trove_links[0]["acde_source"]
    for link in tqdm(trove_links):
        if link["related_source"] == "AuCNLKIN":
            # print(link["related_source_name"])
            viaf_cluster = f"https://viaf.org/viaf/sourceID/NLA%7C{link['related_source_id'].zfill(12)}"
            # print(viaf_cluster)
            # When we request the cluster url we're redirected to the VIAF record
            try:
                r = s.get(viaf_cluster)
                r.raise_for_status()
            except:
                print(f"Not found: {link['related_source_id']}")
            else:
                acde_source = {
                    "acde_source": link["acde_source"],
                    "or_id": link["or_id"],
                }
                viaf_url = r.url
                related_source = {
                    "related_source": "VIAF",
                    "related_source_id": viaf_url,
                }
                viaf_links.append({**acde_source, **related_source})
                # print(viaf_url)
                links_url = f"{viaf_url}justlinks.json"
                r = s.get(links_url)
                for k, v in r.json().items():
                    if isinstance(v, list):
                        related_source_id = " | ".join(v)
                    else:
                        related_source_id = v
                    related_source = {
                        "related_source": k,
                        "related_source_id": related_source_id,
                    }
                    viaf_links.append({**acde_source, **related_source})
    with Path(
        f"{acde_source_name.lower()}_viaf_links_{datetime.datetime.now().strftime('%Y%m%d')}.json"
    ).open("w") as json_file:
        json.dump(viaf_links, json_file)

In [None]:
# Load the already harvested DAAO Trove links
daao_trove_links = json.loads(Path("daao_trove_links_20221004.json").read_text())
harvest_viaf_links(daao_trove_links)

In [None]:
# Load the already harvested AusStage Trove links
ausstage_trove_links = json.loads(
    Path("ausstage_trove_links_20221005.json").read_text()
)
harvest_viaf_links(ausstage_trove_links)