In [3]:
import pickle
from copy import copy
from pylinkedcmd import orcid, pylinkedcmd
from joblib import Parallel, delayed
import tqdm
import os


In the iSAID - Staff Profile Accumulator notebook, we started from our online set of USGS Profile Pages and scraped those to generate our basic data structure of entities and claims. For demonstration purposes here, we dropped that into a file of unique_profiles that essentially mimics what we'll see in our master indexes when we move this online. We can start from that file here to tee up and run processing to gather additional information based on ORCID identifiers.

In [2]:
profile_data = pickle.load(open("unique_profiles.p", "rb"))

The lovely thing about ORCIDs is that all we really need to trigger action is the identifier string itself. From that point, we can run through everything we need to do. ORCIDs could show up in our system through a variety of means and from different inbound vectors, and so it would be helpful to set up an API Gateway entry point that does something really simple like take the string input and tees up a message on a queue for operation. To help with both ORCID and DOI identifiers, I put a function in the pylinkedcmd.pylinkedcmd module to check for an actionable_id. This runs simple regex checks for ORCID or DOI identifiers and returns a dictionary with both the identifier form of the ID as well as it's fully qualified http resolver path (returning None if the identifier can't be validated).

In [4]:
pylinkedcmd.actionable_id("0000-0003-1682-4031")

{'orcid': '0000-0003-1682-4031',
 'url': 'https://orcid.org/0000-0003-1682-4031'}

Here, I simply pull out our list of ORCID identifiers to iterate through in a parallel processing loop.

In [5]:
orcid_list = [i["entity"]["identifiers"]["orcid"] for i in profile_data if "orcid" in i["entity"]["identifiers"]]

All of the logic needed to gather information from ORCIDs is emboddied in the Lookup class of the pylinkedcmd.orcid module. There are a number of parameters that can be used for this.

In [8]:
if os.path.exists('orcid_data.p'):
    orcid_data = pickle.load(open("orcid_data.p", "rb"))
else:
    orcid_data = list()
    
if os.path.exists('orcid_errors.p'):
    orcid_errors = pickle.load(open("orcid_errors.p", "rb"))
else:
    orcid_errors = list()

def accumulator(orcid_value):
    orcid_record = orcid.Lookup(orcid_value, return_errors=True).document()
    if "error" in orcid_record:
        orcid_errors.append(orcid_record)
    else:
        orcid_data.append(orcid_record)


In [7]:
try:
    Parallel(n_jobs=20, prefer="threads")(
        delayed(accumulator)
        (
            i
        ) for i in tqdm.tqdm(orcid_list)
    )
except Exception as e:
    pickle.dump(orcid_data, open("orcid_data.p", "wb"))
    pickle.dump(orcid_errors, open("orcid_errors.p", "wb"))
    print(e)
    
pickle.dump(orcid_data, open("orcid_data.p", "wb"))
pickle.dump(orcid_errors, open("orcid_errors.p", "wb"))


100%|██████████| 2040/2040 [01:51<00:00, 18.31it/s]


Using return_errors will tell us when something goes wrong and put those somewhere for later action. Most of these are incorrect ORCIDs that would require some work on underlying records. We may want to put these somewhere in our index so that we can expose them as necessary corrections or use them to filter our ORCID identifiers coming from other sources that we've shown to be invalid.

In [9]:
orcid_errors

[{'orcid': '0000-0001-9942-8602',
  'error': requests.exceptions.ConnectionError(urllib3.exceptions.MaxRetryError('None: Max retries exceeded with url: /v3.0/0000-0001-9942-8602 (Caused by None)'))},
 {'orcid': '0000-0000-9485-3332', 'error': 'HTTP Status Code: 404'},
 {'orcid': '0000-0002-8816-5920',
  'error': requests.exceptions.ConnectionError(urllib3.exceptions.MaxRetryError('None: Max retries exceeded with url: /v3.0/0000-0002-8816-5920 (Caused by None)'))},
 {'orcid': '0000-0002-3239-6909', 'error': 'HTTP Status Code: 404'},
 {'orcid': '0000-0002-1398-0894', 'error': 'HTTP Status Code: 404'}]

The end result of ORCID processing also aligns with our entity/claims model. Claims here may or may not be unique to this source, as we also process publication records from the USGS Publications Warehouse, where pubs are one of the major clues we pick up from ORCID records. We will also be teeing up publication entities from this process based on DOIs as object identifiers developed from ORCID records, so we will be able to verify whether or not we know about an entity already over time. At this stage, we can probably throw all claims from this process into the index and sort things out further in the future.

Entity information here should already exist but may be slightly better coming from the ORCID source. In the immediate term, we can focus solely on contributing claims.

In [10]:
print(len(orcid_data))
display(orcid_data[:5])

2035


[{'entity': {'identifiers': {'orcid': '0000-0003-0274-5180',
    'url': 'https://orcid.org/0000-0003-0274-5180'},
   'instance_of': 'Person',
   'entity_created': '2020-11-20T17:29:12.865460',
   'entity_source': 'ORCID',
   'reference': 'https://orcid.org/0000-0003-0274-5180',
   'name': 'Sean Ahdi',
   'alternateName': None,
   'url': None},
  'claims': []},
 {'entity': {'identifiers': {'orcid': '0000-0003-2988-9993',
    'url': 'https://orcid.org/0000-0003-2988-9993'},
   'instance_of': 'Person',
   'entity_created': '2020-11-20T17:29:12.870513',
   'entity_source': 'ORCID',
   'reference': 'https://orcid.org/0000-0003-2988-9993',
   'name': 'Phyllis Altheide',
   'alternateName': None,
   'url': None},
  'claims': [{'claim_created': '2020-11-20T17:29:14.167347',
    'claim_source': 'ORCID',
    'reference': 'https://orcid.org/0000-0003-2988-9993',
    'subject_instance_of': 'Person',
    'subject_label': 'Phyllis Altheide',
    'subject_identifiers': {'orcid': '0000-0003-2988-9993'