This notebook works through the process of retrieving information from the ORCID registry for ORCID identifiers recorded in the GeoKB as external identifier claims. Similar to the process employed for USGS Staff Profiles, we cache the information we want to use in further processing to the item discussion pages for the person items with ORCID claims. After some experimentation with other methods, we pull the full ORCID-specific JSON structure from the ORCID API, filter that to the parts we can leverage (dropping some of the null values), and store this as YAML for later retrieval and processing.

Note: There is a lot of extraneous metadata that comes along with the ORCID response that we don't really need, and I may eventually run some additional pre-processing to clean up the cruft.

In [1]:
import os
from wbmaker import WikibaseConnection
import requests
import yaml
from datetime import datetime
from joblib import Parallel, delayed
from tqdm import tqdm

geokb = WikibaseConnection('GEOKB_CLOUD')

# ORCID API
For periodic updates or one-off checks, we can use the expedient of content negotaiton on ORCID URLs using an accept header without any type of authenticatiobn. However, this process will throw rate throttling after some number of requests from the same host. I set up an application under my own user account to provide client ID and secret information for obtaining a token for API use. We'll have to look at the best way to operationalize this eventually.

In [None]:
orcid_auth_params = {
    'client_id': os.environ['ORCID_CLIENT_ID'],
    'client_secret': os.environ['ORCID_CLIENT_SECRET'],
    'grant_type': 'client_credentials',
    'scope': '/read-public'
}

orcid_auth = requests.post('https://pub.orcid.org/oauth/token', data=orcid_auth_params).json()

orcid_search_params = {
    'Access token': orcid_auth['access_token'],
    'Authorization type': 'Bearer',
    'Accept': 'application/json'
}

# ORCID identifiers in GeoKB
In the following query, we pull all items with ORCID external identifier claims. We optionally pull two qualifiers that are used to indicate when the ORCID information was last retrieved and cached in the item talk pages and what the HTTP status code was from the request to the ORCID system. This can be used to drive a message queue approach in future where we decide on when to periodically check for updates. There is an ORCID API method that can be used for that.

We also optionally pull the reference URL value for these person items. This is an indication of which persons are or at least have been USGS employees and those we likely want to concentrate on in pulling ORCID information. However, building more of a knowledge base for understanding other people who have "entered our orbit" as co-contributors to products may also prove useful in understanding and assessing overall capacity. For instance, are there specific skillsets or disciplines that we "obtain" through collaboration vs. having within our own staff?

In [14]:
orcid_query = """
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?orcid ?retrieved ?status_code ?ref_url
WHERE {
  ?item wdt:P106 ?orcid ;
        p:P106 ?orcid_statement .
  OPTIONAL {
    ?orcid_statement pq:P139 ?retrieved ;
                     pq:P151 ?status_code .
  }
  OPTIONAL {
    ?item wdt:P31 ?ref_url .
  }
}
"""

geokb_orcids = geokb.sparql_query(orcid_query)
geokb_orcids['qid'] = geokb_orcids['item'].apply(lambda x: x.split('/')[-1])

geokb_orcids.head()

Unnamed: 0,item,orcid,retrieved,status_code,ref_url,qid
0,https://geokb.wikibase.cloud/entity/Q44424,0000-0003-3687-5979,2023-10-20T00:00:00Z,200,https://www.usgs.gov/staff-profiles/steve-aichele,Q44424
1,https://geokb.wikibase.cloud/entity/Q44436,0000-0003-3926-6941,2023-10-20T00:00:00Z,200,https://www.usgs.gov/staff-profiles/cameron-l-...,Q44436
2,https://geokb.wikibase.cloud/entity/Q44442,0000-0002-1602-482X,2023-10-20T00:00:00Z,200,https://www.usgs.gov/staff-profiles/jason-s-al...,Q44442
3,https://geokb.wikibase.cloud/entity/Q44443,0000-0001-7708-0203,2023-10-20T00:00:00Z,200,https://www.usgs.gov/staff-profiles/katrina-alger,Q44443
4,https://geokb.wikibase.cloud/entity/Q44448,0000-0001-7426-8382,2023-10-20T00:00:00Z,200,https://www.usgs.gov/staff-profiles/andy-j-allard,Q44448


# Processing Functions
The following functions provide the basic capabilities needed here:
* Retrieve a single ORCID record using the API and include a "meta" object with the exact URL used, timestamp, and HTTP status code information. In future, we could look at other bulk methods here, but there are challenges with the various query options based on the variety of ways that USGS staff have obtained their ORCID registrations and how they are populated. There's no straightforward way to do something like pull all ORCIDs for people who are employed by the USGS without a lot of mapping work.
* Filter the raw ORCID record to just those parts we can leverage right now. We may need to revisit this to clean up some cruft and/or include other aspects of the ORCID record.
* Cache the ORCID information to the GeoKB in YAML format along with anything else stored on person discussion pages and update the ORCID claim with retrieved date and status code qualifiers.

These functions can be incorporated into some other kind of future pipeline.

In [11]:
def retrieve_orcid(orcid, params):
    meta = {
        "url": f"https://pub.orcid.org/v3.0/{orcid}/record",
        "timestamp": datetime.now().isoformat(),
        "status_code": 200
    }

    r = requests.get(meta['url'], headers=params)

    if r.status_code == 200:
        return {
            'meta': meta,
            'orcid': filter_orcic(r.json())
        }
    else:
        meta['status_code'] = r.status_code
        return {
            'meta': meta
        }

def filter_orcic(orcid_response):
    return {
        'history': orcid_response['history'],
        'person': {k:v for k,v in orcid_response['person'].items() if isinstance(v, dict) and 'last-modified-date' in v and isinstance(v['last-modified-date'], dict)},
        'activities': {k:v for k,v in orcid_response['activities-summary'].items() if isinstance(v, dict) and 'last-modified-date' in v and isinstance(v['last-modified-date'], dict)},
    }

def cache_orcid(qid, orcid, orcid_record=None):
    geokb_item = geokb.wbi.item.get(qid)
    write_qualifiers = True
    orcid_qualifiers = geokb.models.Qualifiers()
    orcid_qualifiers.add(
        geokb.datatypes.Time(
            prop_nr=geokb.prop_lookup['retrieved'],
            time=datetime.now().strftime("+%Y-%m-%dT00:00:00Z"),
        )
    )

    if orcid_record is None:
        orcid_record = retrieve_orcid(orcid)

    orcid_qualifiers.add(
        geokb.datatypes.String(
            prop_nr=geokb.prop_lookup['status code'],
            value=str(orcid_record['meta']['status_code'])
        )
    )

    if orcid_record['meta']['status_code'] == 200:
        item_talk_page = geokb.mw_site.pages[f'Item_talk:{qid}']
        person_data = yaml.safe_load(item_talk_page.text())
        if person_data is None:
            person_data = {}
        person_data['orcid'] = orcid_record

        try:
            item_talk_page.save(yaml.dump(person_data), summary='Updated person data cache with ORCID information')
            print('Saved Cache:', f"https://geokb.wikibase.cloud/wiki/Item_talk:{qid}")
        except Exception as e:
            write_qualifiers = False
            print('Problem saving cache:', qid)

    geokb_item.claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['ORCID iD'],
            value=orcid,
            qualifiers=orcid_qualifiers
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    if write_qualifiers:
        response = geokb_item.write(summary='Updated qualifiers for ORCID claim')
        print('Updated ORCID Qualifiers:', f"https://geokb.wikibase.cloud/wiki/Item:{qid}")
        return response.id
    else:
        return qid

        

# Known issue
I'm still seeing the problem that is documented in [this issue report](https://phabricator.wikimedia.org/T348256) where the SPARQL query service is not returning values that it should be. The following checks our data returned from the query service for items that have a profile URL and an ORCID, but do not show us the retrieved and status code qualifier values on the ORCID claims. Looking at some of these items, we see that they do indeed have those qualifiers and should have returned values. They also have cached ORCID data in their item discussion pages, so the original processing I ran worked just fine. We need to get this issue resolved soon if we're going to be able to count on SPARQL queries to drive things like this process.

In [20]:
geokb_orcids[
    geokb_orcids['status_code'].isna()
    &
    ~geokb_orcids['ref_url'].isna()
]

Unnamed: 0,item,orcid,retrieved,status_code,ref_url,qid
3791,https://geokb.wikibase.cloud/entity/Q46391,0000-0003-0657-0365,,,https://www.usgs.gov/staff-profiles/garth-e-gr...,Q46391
3792,https://geokb.wikibase.cloud/entity/Q46392,0000-0002-0873-0708,,,https://www.usgs.gov/staff-profiles/paul-grams,Q46392
3793,https://geokb.wikibase.cloud/entity/Q46397,0000-0002-5096-4305,,,https://www.usgs.gov/staff-profiles/alex-grant,Q46397
3794,https://geokb.wikibase.cloud/entity/Q46400,0000-0002-0761-3489,,,https://www.usgs.gov/staff-profiles/v-j-grauch,Q46400
3795,https://geokb.wikibase.cloud/entity/Q46401,0000-0001-9758-453X,,,https://www.usgs.gov/staff-profiles/robert-w-g...,Q46401
...,...,...,...,...,...,...
8395,https://geokb.wikibase.cloud/entity/Q44428,0000-0003-1534-3025,,,https://www.usgs.gov/staff-profiles/denise-m-akob,Q44428
8396,https://geokb.wikibase.cloud/entity/Q44429,0000-0002-2136-5098,,,https://www.usgs.gov/staff-profiles/robert-al-...,Q44429
8397,https://geokb.wikibase.cloud/entity/Q44431,0000-0002-6312-8269,,,https://www.usgs.gov/staff-profiles/janice-albers,Q44431
8398,https://geokb.wikibase.cloud/entity/Q44437,0000-0003-3494-6841,,,https://www.usgs.gov/staff-profiles/john-n-ale...,Q44437


# Operational Next Steps
The functions defined above handle the basic logic here of retrieving ORCID records from the API, filtering responses to what we want to cache, and writing the cached ORCID data to discussion pages coupled with updating retrieval date/status code information in the items. We have a baseline to work with that provides data on keywords, bios, creative works, professional experience, etc. for a few thousand ORCID records owned by extant USGS personnel, so we can move on to evaluation and processing steps. We will circle back once the Wikimedia team resolves the SPARQL store issue to build a regular process to check for updates and take appropriate action.