# Exploring Identifiers

Identifiers and identifier systems play an important role in data management and curation. They are also central to the FAIR Guiding Principles for scientific data and software management. In this exercise, we'll explore a few examples of commonly used identifiers and identifier systems including:

* Universally unique identifiers (UUID)
* Internet and web standard identifiers including Uniform Resource Identifiers (URIs), Uniform Resource Names (URNs), and Uniform Resource Locators (URLs)
* URL redirection services including Persistent URLs (PURLs)
* Authority-based persistent identifiers including Digital Object Identifiers (DOIs)
* Content-based identification using hashes

Identifiers and identifier systems are both technical and social. The goal of this exercise is to explore the techinical side of identifiers--i.e., interacting with identifiers and identifier systems through associated software and services.

## Universally Unique Identifiers (UUIDs)

The first identifier we'll look at is the Universally Unique Identifier (UUID). Defined in [IETF RFC 4122](https://datatracker.ietf.org/doc/html/rfc4122.html), UUIDs (also sometimes called Globally Unique Identifiers or GUIDs) are fixed-sized identifiers that are guaranteed to be unique (there is a very low likelihood of collision). 

UUIDs are generated via a set of algorithms and rely on a combination of current time, network hardware identifiers, sequence numbers, names, and hashes. The RFC defines four different algorithms:
* Version 1: Based on a host ID (e.g., hardware address), current time, and sequence number
* Version 3: Based on the MD5 hash of a namespace (e.g., DNS) and a name (e.g., "domain.org")
* Version 4: Host ID, time, and sequence number are randomly generated values
* Version 5: Based on SHA-1 hash of a namespace (e.g., DNS) and a name (e.g., "domain.org")

In Python, RFC 4122 UUIDs are implemented in the built-in [uuid](https://docs.python.org/3/library/uuid.html) library.

In [1]:
import uuid

By default, UUID version 1 generates an identifier based on the [MAC address](https://en.wikipedia.org/wiki/MAC_address) of a network interface, the current time, and a sequence number. You can use the `getnode()` method and compare the returned value to your network interface MAC address (via `ifconfig` or similar).

In [2]:
print(hex(uuid.getnode()))

0x6a68b35c3e54


Each time we generated a V1 or V4 UUID, we get a new value:

In [3]:
print(uuid.uuid1()) 

print(uuid.uuid1())

print(uuid.uuid4())

print(uuid.uuid4())

a9c7b03a-9d51-11ef-8924-6a68b35c3e54
a9c7b206-9d51-11ef-8924-6a68b35c3e54
4a99a7f5-c740-4d80-873d-d07778bd00dd
ec9fbbd8-dbeb-4f7c-9cca-24cfd4d54413


UUID version 3 and 5 generate the UUID based on a specified namespace and name. Unlike version 1 and 4 UUIDs, version 3 and 5 UUIDs are deterministic. Given the same set of input values, you will get the same UUID.

Supported namespaces include domains, URLs, ISO object IDs, and X.500 distinguished names.

In [4]:
print(uuid.uuid3(uuid.NAMESPACE_DNS, 'illinois.edu'))

print(uuid.uuid3(uuid.NAMESPACE_DNS, 'illinois.edu'))

print(uuid.uuid3(uuid.NAMESPACE_URL, 'https://ischool.illinois.edu/'))

print(uuid.uuid3(uuid.NAMESPACE_URL, 'https://ischool.illinois.edu/'))

7f641484-d6c7-3402-9054-daa5b550e1f6
7f641484-d6c7-3402-9054-daa5b550e1f6
d74714bd-ed7f-3570-8601-70ad5e613524
d74714bd-ed7f-3570-8601-70ad5e613524



Unlike a simple sequence or accession number, UUIDs are intended to be globally unique across systems and time. As we will see below, they are also part of a set of core Web standards. 

UUIDs are also URNs (discussed below). A central drawback of UUIDs is that they are not globally resolvable. 

## Uniform Resource Identifiers (URI)

Next, we'll look at a class of identifiers based on a set of Web standards. 

Uniform Resource Identifier (URI) is an internet standard that defines a generic syntax for the creation of Web-based identifiers (see [RFC 3896](https://datatracker.ietf.org/doc/html/rfc3986).  URIs are defined as a compact sequence of characters used to identify abstract ot physical resources. URIs can be classified as a locator, a name, or both. Below are examples of commonly used URI schemes:

```
https://datatracker.ietf.org/doc/html/rfc3986
mailto:john.doe@email.com
urn:isbn:006251587X
```

URIs describe both Uniform Resource Locators (URLs), commonly thought of as web addresses and Uniform Resource Names (URNs). URLs can be used to both identify (name) and locate a resource on a network while URNs are used only for naming.

URIs are defined by their `schemes`, which are registered with [IANA](https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml). Schemes of interest to us include:
* `http` and `https`: Hypertext Transfer Protocol
* `urn`: Uniform Resource Name
* `doi`: DOI System (DOIs are generally represented as URLs, but have their own URI scheme)

In this section, we'll look at examples of URIs--both URNs and URLs--in the context of data curation.

### Uniform Resource Name (URN)

Defined in [IETF RFC 2141](https://datatracker.ietf.org/doc/html/rfc2141), URNs were originally intended to be "persistent, location-indepenent resource identifiers" within specific namespaces. Official namespaces are managed by the [Internet Assigned Numbers Authority](https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml). 

Namespaces of interest to us include:
* `uuid`: [Universally Unique Identifiers](https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/14/) 
* `doi`: [Digital Object Identifer](https://www.iana.org/assignments/urn-formal/doi) (more below)
* `isbn`: [International Standard Book Number](https://www.iana.org/assignments/urn-formal/isbn)

UUIDs, discussed above, are defined within the `uuid` URN namespace.  We can easily format our generated UUID as a URN as follows:

In [None]:
uuid.uuid4().urn

By formatting the UUID as a URN, users that encounter these values at least know that they are UUIDs and not some other type of identifier.

In research data management, UUIDs are often used as identifiers for datasets and items within datasets. For example, [Metacat](https://nceas.github.io/metacatui/), scientific data repository software used by the DataONE network, uses UUIDs at multiple levels.

Explore the dataset https://arcticdata.io/catalog/view/doi:10.18739/A20Z70Z1H
* Note that the it has been assigned UUID `urn:uuid:23d7c5c2-e695-4aae-97c9-e84fa41368ba`
* Look at the file `CMDL_fluxes_data_readme.csv` and note that it has UUID `urn:uuid:988f6846-5378-45d2-9a15-8db398ea2d4a`

(For more information see [Identifiers in DataONE](https://releases.dataone.org/online/api-documentation-v2.0.1/design/PIDs.html)).

An important detail about UUIDs (and URNs in general, more below) is that they are not globally resolvable. There is no global service one can use to look up all assigned URNs. DataONE provides its own resolution service must be used to translate the UUID to something that can be accessed.

The [Data Documentation Initiative (DDI)](https://ddialliance.org/) is an international standard for social, behavioral, and economics research data. DDI has recently proposed a [URN namespace for DDI](https://www.rfc-editor.org/rfc/rfc9517.html). DDI 3.3 includes support for assigning URNs to a variety of resources including variables, survey question items, controlled vocabulary terms, etc. 

### Uniform Resource Locator (URL)

Uniform Resource Locators (URLs)--or web addresses--are compact strings used to identify and locate resources available via the Internet. URLs are defined by a scheme (e.g., `https`) followed by scheme-specific information. In this exercise, we are primarily concerned with HTTP(S) URLs, which take the general form:

```
http(s)://<host>:<port>/<path>?<query>#<fragment>
```

We can use Python's `urllib` to illustrate the differences between URNs, URLs, and other types of URIs, in this case a DOI represented as a URI. A key distinction between these two is the `netloc` value. As discussed above, URNs have no network location and are not resolvable. Network locations (host and port) are central to URLs.

In [5]:
from urllib.parse import urlparse

# Use the urlparse() method to parse each of the following and print it
print(urlparse("urn:uuid:23d7c5c2-e695-4aae-97c9-e84fa41368ba"))
print(urlparse("https://arcticdata.io/catalog/view/doi:10.18739/A20Z70Z1H"))

ParseResult(scheme='urn', netloc='', path='uuid:23d7c5c2-e695-4aae-97c9-e84fa41368ba', params='', query='', fragment='')
ParseResult(scheme='https', netloc='arcticdata.io', path='/catalog/view/doi:10.18739/A20Z70Z1H', params='', query='', fragment='')


The URL includes a network location (host `arcticdata.io` with default HTTPS port 443) that can be used to actually locate the resource with the specified URL, in this case the landing page for a dataset.

Unlike URLs, URNs require an external resolution service. DataONE operates a separate resolver accessible via the URL `https://cn.dataone.org/cn/v2/resolve/`.

In the next cell, we'll use the `requests` package to resolve the URN of the dataset.

In [6]:
import requests
from IPython.display import Code

## UUID of CMDL_fluxes_data_readme.csv
uuid  = "urn:uuid:988f6846-5378-45d2-9a15-8db398ea2d4a"
response = requests.get(f"https://cn.dataone.org/cn/v2/resolve/{uuid}")
Code(response.text[0:2000])

In this case, the DataONE resolver returns the contents (CSV) of the file with the assigned URN.

It is important to note that URNs do not need to resolve. They are intended as identifiers that may outlive the services and resources they represent. However, in many cases we want to be able to resolve and access identified resources when they are still available.

## The Problem with URLs: Link Rot and Content Drift

While URLs are widely-used identifiers and locators for web-based resources, long-term access requires that individuals and organizations maintain them. [Link rot](https://en.wikipedia.org/wiki/Link_rot) is used to describe the problem that, over time, many URLs (or hyperlinks) are no longer accessible. Resources are moved or renamed, domain names registrations expire, etc.

For example, the W3C Recommendation [Architecture of the World Wide Web](https://www.w3.org/TR/webarch/) cites the following paper:

> [Eng90] [Knowledge-Domain Interoperability and an Open Hyperdocument System](http://www.bootstrap.org/augment/AUGMENT/132082.html), D. C. Engelbart, June 1990.

With the URL `http://www.bootstrap.org/augment/AUGMENT/132082.html`.

The domain `bootstrap.org` now redirects to `dougengelbart.org` and the cited paper is now available under a different path `https://dougengelbart.org/content/view/114/`.  The result of accessing the original URL is the HTTP 404 "Not Found" error.

In [7]:
requests.get("http://www.bootstrap.org/augment/AUGMENT/132082.html")

<Response [404]>

Over the decades, a number of solutions (or interventions) have been proposed to address the problem of link rot.
* Use of redirection methods (such as HTTP 301) to automatically redirect useres to relocated content.
* Use of persistent identifiers such as Archival Resource Keys (ARKs), Persistent URLs (PURLs), Handles, or Digital Object Identifiers (DOIs).



## Persistent URLs (PURL)

One of the earliest approaches to addressing the problem of broken links was the use of URL redirection services. Persistent URLs (PURLs) were introduced by OCLC in the mid-1990s as an intermediate solution while the W3C worked on the standardization of URNs. PURLs themselves are HTTP URLs that, via a PURL resolver, are configured to redirect to another location. Administrators create a PURL **domain** and can configure various redirection mechanisms for **subdomains** (or subpaths). 

`purl.org` accounts are free to create and domain administrators have control over how URLs are redirected.

Perhaps one of the more widely used PURL domains is https://purl.org/dc, used by the Dublic Core Metadata Initiative. In the example below, we can see how requests to `purl.org` go through a series of HTTP redirects before reaching the final target at `dublincore.org`.

In [8]:
# down as of 2024.11.7
response = requests.get("https://purl.org/dc/elements/1.1/")
for resp in response.history:
    print(resp.status_code, resp.url)

print(response.status_code, response.url)

KeyboardInterrupt: 

**Historical note**: The primary PURL resolver at https://purl.org was operated by OCLC from 1995 - 2016 when [responsibility was transfer to the  Internet Archive](https://blog.archive.org/2016/09/27/persistent-url-service-purl-org-now-run-by-the-internet-archive/). OCLC was no longer able to maintain the service, which was unavailable for a period of time. During this time, PURLs did not resolve.

## Digital Object Identifier (DOI)

A digital object identifier (DOI) is a persistent identifier widely used in publishing of journal articles and later research data. DOIs are implemented using the Handle System.

For example, the following paper published in the *Data Science Journal* has been assigned the DOI `10.5334/dsj-2017-009`:

> Klump, J., & Huber, R. (2017). 20 Years of Persistent Identifiers – Which Systems are Here to Stay? *Data Science Journal*, 16, 9–9. https://doi.org/10.5334/dsj-2017-009

Using a web browser, when you click on the DOI link in the refence above, you should be taken to the article published at the following URL:

https://datascience.codata.org/articles/10.5334/dsj-2017-009

This process is called **resolution**. The DOI `10.5334/dsj-2017-009` is submitted to a service at https://doi.org that returns information about the identified paper, which is by default a redirection to the URL of the paper hosted on the publisher's website.

In [14]:
response = requests.get("https://doi.org/10.5334/dsj-2017-009")
for resp in response.history:
    print(resp.status_code, resp.url)

print(response.status_code, response.url)

302 https://doi.org/10.5334/dsj-2017-009
301 http://datascience.codata.org/articles/10.5334/dsj-2017-009/
308 https://datascience.codata.org/articles/10.5334/dsj-2017-009/
200 https://datascience.codata.org/articles/10.5334/dsj-2017-009


Beyond simple redirection, DOI infrastructure supports [content negotation](https://citation.crosscite.org/docs.html), a mechanism for serving different representations of the identified resource. For example, it is possible to request a formatted citation for the resource:

In [10]:
headers = {
    'Accept': 'text/x-bibliography; style=apa'
}
resp = requests.get("https://doi.org/10.5334/dsj-2017-009", headers = headers)
Code(resp.text)

It is also possible to request only the **metadata** describing the resource:


In [11]:
headers = {
    'Accept': 'application/vnd.crossref.unixref+xml'
}
resp = requests.get("https://doi.org/10.5334/dsj-2017-009", headers = headers)
Code(resp.text[0:2000])


Returning to the DataONE example above, we now recogize that the URL for the dataset contains a DOI. This dataset was registered via Datacite, so we can  request the DataCite Kernel Metadata:

In [12]:
headers = {
    'Accept': 'application/vnd.datacite.datacite+json'
}
resp = requests.get("https://doi.org/10.18739/A20Z70Z1H", headers = headers)
Code(resp.text)

Unlike UUIDs, URNs, and URIs, DOIs cannot be self-generated. The process of generating a DOI, sometimes called **minting**, is done via a *registration agency*. Registration agencies also define the required metadata. Current registration agencies include [Crossref]() and [Datacite](). These are membership-driven and fee-based organizations.

## Content-Based Identifiers

The final type of identifier we explore here are content-based (or content-addressable) identifiers.  Unlike authority-based identifiers, content-based identifiers are derived from the data or content that they refer to. Content-based identifiers serve multiple purposes: they identify the same object, can serve as a means to version data, and can be used to validate content. 

Returning again to the DataONE example above: https://arcticdata.io/catalog/view/doi:10.18739/A20Z70Z1H, recall that the file `CMDL_fluxes_data_readme.csv` was assigned UUID `urn:uuid:988f6846-5378-45d2-9a15-8db398ea2d4a`, which we can use for resolution. Looking at the record for the file, we notice that it contains an MD5 hash--used for validation purposes--with the value `09657a90ba2af3a1c31ec9d3c8f9c2a7`. The purpose of the hash is to enable the user to confirm that they have the correct file. MD5 is one of a number of commonly used hashing or digest algorithms.

In the following example, we retrieve the file contents based on the UUID using the DataONE resolver and calculate the MD5 hash, which is the same as recorded in the metadata record.

In [13]:
import hashlib

## UUID of CMDL_fluxes_data_readme.csv
uuid  = "urn:uuid:988f6846-5378-45d2-9a15-8db398ea2d4a"
response = requests.get(f"https://cn.dataone.org/cn/v2/resolve/{uuid}")

hashlib.md5(response.content).hexdigest()


'09657a90ba2af3a1c31ec9d3c8f9c2a7'

Now that we are familiar with UUIDs, URNs, URLs, DOIs, and content-based identifiers, let's consider the above DataONE example in detail:

> Donatella Zona. (2023). Greenhouse gas flux measurements at the zero curtain, North Slope, Alaska, 2012-2023. Arctic Data Center. [doi:10.18739/A20Z70Z1H](https://doi.org/10.18739/A20Z70Z1H), version: urn:uuid:23d7c5c2-e695-4aae-97c9-e84fa41368ba.

We see the following:
* A dataset is assigned a DOI (via Datacite)
* A dataset is also identified within DataONE by a UUID (URN)
* Each file in the dataset is assigned a UUID
* DataONE operates a resolver (https://cn.dataone.org/cn/v2/resolve/) that can be used to get the contents of a file based on the UUID
* Computed hashes (MD5 or SHA1) are also stored for each file for validation and integrity purposes 

Its has been proposed that, instead of assigning UUIDs and storing the computed hash for verification purposes, that the hash itself can be used as an identifier. While not commonly used in data curation, content-based identifiers are widely used in other systems. For example, the Git version control system uses the concept of `commit hashes` to identify every version of every change in the repository.  Object stores often store files named based on a hash of their contents for storage efficiency, preventing storage of duplicate copies.

## Summary



Identifiers and identifier systems are both technical and social. Identifiers are standards that are generally implemented in software and often require significant technical infrastructure. Identifiers differ in a number of ways:
* Authority-based, non-authority based, and content-based
* Support for location resolution
* Decoupling of content identification from content location (i.e., data can move without affecting resolvability)

Simply using a persistent identifier (or what Kunze calls a *persistable* identifier) does not guarantee persistence. Social and organization infrastructure is still required to both correctly apply identifer schemes and ensure their persistence. Auditing of identifiers is still necessary.

## Discussion Questions


Post your responses to the following questions to the Canvas discussion:
* UUIDs can optionally be represented as URNs. What information does the URN format contain that the non-URN UUID format does not?
* Describe two key differences between UUIDs and DOIs.
* Given a dataset identified using the SHA-256 of its contents, what happens if the file changes?
