One of the use cases we are exploring through this code project is how we can use WikiData to help augment a process for auto-assembling research records for USGS staff. After writing a series of functions for interacting with both the ScienceBase Directory as our starting point, working out linkages to WikiData IDs, and a set of functions for working with WikiData, we can start exploring what routes to useful information WikiData records linked to USGS person records might produce. This notebook works through the necessary functions and provides a comprehensive look at the individual properties of USGS people in WikiData. These properties are considered statements or claims about a person and include identifiers for crossreference to other systems that may also be useful.

In [1]:
import pylinkedcmd
from IPython.display import display
import time

sb = pylinkedcmd.pylinkedcmd.Sciencebase()
wd = pylinkedcmd.pylinkedcmd.Wikidata()

First, we need to get relevant person records from the USGS ScienceBase Directory as our starting point. The SB Directory contains a record of everyone who currently works for the USGS or who has worked for the USGS in the years since it was established (~2010). It also contains records for collaborators from other organizations and institutions who have been involved in USGS work and obtained an account with us at some point. It's not a perfect resource, but it represents the best publicly available source of structured information on USGS staff that is reasonably comprehensive. The sb_people function is pretty specialized for our use cases here as it filters to just people with a WikiData ID and simplifies the output to just a select number of fields that we care about for this exercise. The function takes a little bit of time to run through, so we time it for reference.

In [3]:
%%time
people = sb.sb_people(fields="simple")
print(len(people))
display(people[99])

2158


{'uri': 'https://www.sciencebase.gov/directory/person/69061',
 'displayName': 'Courtney A Creamer',
 'email': 'ccreamer@usgs.gov',
 'identifiers': [{'id': 2232, 'key': '0000-0001-8270-9387', 'type': 'ORCID'},
  {'id': 2233, 'key': 'Q57408270', 'type': 'WikiData'}]}

CPU times: user 212 ms, sys: 72.1 ms, total: 284 ms
Wall time: 10.5 s


Because we are going to want to come back and reference the SB record in a minute when it comes to exploring individual properties on people, this helper function retrieves the URL that will let us get back into the ScienceBase Directory to see who we are talking about.

In [6]:
def sb_person_link_by_wd_id(people, wd_id):
    sb_link = next(
        (
            p["uri"] for p in people if next(
                (i for i in p["identifiers"] if i["key"] == wd_id), None) 
            is not None), 
        None)
    
    return sb_link

One of the functions we built into the WikiData class in this package will retrieve a given WikiData Item ("Q" identifier) and run some lookups to build out a reasonable document representation of the full record in WikiData. It runs through the statements/claims on a person and resolves property identifiers to provide a human-readable view of the properties of an item. The following codeblock demonstrates this for a slice of our list of USGS people with WikiData IDs. Each of the calles to wd_item() do take a little bit of time with this method based on resolving the properties.

Note: There are likely better ways of handling this function fully with SPARQL that we'll figure out eventually.

In [7]:
%%time
for person in people[20:22]:
    wd_id = next((i["key"] for i in person["identifiers"] if i["type"] == "WikiData"), None)
    display(wd.wd_item(wd_id, include_aliases=True))

{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/Q57414377.json',
 'lastrevid': 1202763177,
 'modified': '2020-06-09T06:30:58Z',
 'label_en': 'Muruleedhara N Byappanahalli',
 'description_en': 'researcher ORCID id 0000-0001-5376-597X',
 'statements': [{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P496.json',
   'title': 'Property:P496',
   'id': 'P496',
   'label_en': 'ORCID iD',
   'description_en': 'identifier for a person',
   'formatter_url': 'https://orcid.org/$1',
   'url': 'https://orcid.org/0000-0001-5376-597X',
   'value': '0000-0001-5376-597X'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P31.json',
   'title': 'Property:P31',
   'id': 'P31',
   'label_en': 'instance of',
   'description_en': 'that class of which this subject is a particular example and member',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Special:EntityData/Q5',
   'value': 'human'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P1153.j

{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/Q20675641.json',
 'lastrevid': 1191864531,
 'modified': '2020-05-27T21:58:07Z',
 'label_en': 'Mark Petersen',
 'description_en': 'seismologist',
 'aliases': ['Mark D. Petersen',
  'Mark Petersen',
  'Mark D Petersen',
  'M. D. Petersen',
  'M.D. Petersen'],
 'statements': [{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P31.json',
   'title': 'Property:P31',
   'id': 'P31',
   'label_en': 'instance of',
   'description_en': 'that class of which this subject is a particular example and member',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Special:EntityData/Q5',
   'value': 'human'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P27.json',
   'title': 'Property:P27',
   'id': 'P27',
   'label_en': 'country of citizenship',
   'description_en': 'the object is a country that recognizes the subject as its citizen',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Specia

CPU times: user 746 ms, sys: 57.5 ms, total: 804 ms
Wall time: 42.8 s


What we are really interested in through this immediate exercise is an examination of the breadth of properties in WikiData on USGS people. We want to understand what we can exploit from those claims made about our people and get some indication of where those statements have come from. The WikiData class contains another function that will take a list of specific item IDs and return all of the properties with English labels and values for those IDs. We can use this to comprehensively run through every USGS-affiliated WikiData ID and get a set of distinct properties to examine further.

To be kind to the WikiData API and deal with limitations in how large a SPARQL query we can make, we batch process and slow down our requests in sending all 2K+ IDs.

In [8]:
%%time
wd_ids = [next(i["key"] for i in p["identifiers"] if i["type"] == "WikiData") for p in people]

batch_size = 100
all_props = list()

for i in range(0, len(wd_ids), batch_size):
    this_run = wd_ids[i:i+batch_size]
    all_props.extend(wd.wd_item_properties(this_run, return_result="values"))
    time.sleep(0.2)
    
distinct_properties = list(set([p["property"] for p in all_props]))
display(distinct_properties)

['academic degree',
 'sex or gender',
 'country of citizenship',
 'GND ID',
 'Nationale Thesaurus voor Auteurs ID',
 'Bibliothèque nationale de France ID',
 'described at URL',
 'doctoral advisor',
 'University of Barcelona authority ID',
 'nickname',
 'NKCR AUT ID',
 'affiliation',
 'award received',
 'CONOR ID',
 'Loop ID',
 'ZooBank author ID',
 'NLA Trove ID',
 'RERO ID',
 'NORAF ID',
 'occupation',
 'Library of Congress authority ID',
 'IPNI author ID',
 'DBLP ID',
 'on focus list of Wikimedia project',
 'field of work',
 'date of birth',
 'Publons author ID',
 'ResearchGate contributions ID',
 'Dimensions author ID',
 'official website',
 'CiNii author ID (books)',
 'work location',
 'Academic Tree ID',
 'ISNI',
 'Harvard Index of Botanists ID',
 'employer',
 'Scopus Author ID',
 'NUKAT ID',
 'Microsoft Academic ID',
 'honorific prefix',
 'ORCID iD',
 'different from',
 'Mendeley person ID',
 'given name',
 'notable work',
 'VIAF ID',
 'educated at',
 'BHL creator ID',
 'residenc

CPU times: user 607 ms, sys: 118 ms, total: 725 ms
Wall time: 56.2 s


Now that we have our suite of distinct properties, we can start looking for interesting tidbits of information, figure out where things came from, and think about what we might be able to do with the stuff we can get back through the USGS person to WikiData linkage. There seems to be some pretty interesting stuff in the collection of properties. The following codeblock can be tweaked to look at specific properties of interest. It takes a property name and provides a clickable set of URLs that will take you to the WikiData identity with that property and to the corresponding ScienceBase Directory item. At this point, we are just visually exploring the information to see what's happening with the data.

In [12]:
property_name = "field of work"
for record in [i for i in all_props if i["property"] == property_name]:
    print(f"https://www.wikidata.org/wiki/{record['item_id']}", sb_person_link_by_wd_id(people, record['item_id']))

https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820
https://www.wikidata.org/wiki/Q21502710 https://www.sciencebase.gov/directory/person/1754
https://www.wikidata.org/wiki/Q21502710 https://www.sciencebase.gov/directory/person/1754
https://www.wikidata.org/wiki/Q21502710 https://www.sciencebase.gov/directory/person/1754
https://www.wikidata.org/wiki/Q21502710 https://www.sciencebase.gov/directory/person/1754
https://www.wikidata.org/wiki/Q21502710 https://www.sciencebase.gov/directory/person/1754
https://ww

# Observations
* I was interested in the fact that "date of birth" showed up as a property on some records as that's often one of those "personally identifiable information" bits that we would never even have a property for in our government system. Most of these appear to be of lower than month/year precision, which is something that we can verify programmatically based on how WikiData handles date properties. An interesting case here that I glanced through is David M. Miller (https://www.wikidata.org/wiki/Q57729525, https://www.sciencebase.gov/directory/person/7005) who is now an emeritus scientist in USGS and who I worked with some time ago on a publication.
* Another interesting case is Luke Iwanowicz (https://www.wikidata.org/wiki/Q42260508, https://www.sciencebase.gov/directory/person/4946). His USGS record indicates that he is an active research ecologist at the Leetown Science Center, but his WikiData record does not indicate that he is an employee of the USGS. A view of the item's WikiData history shows that it has come about through a series of actions from 2017-2020 by both bots and what appear to be individual person edits. When I ran the process to link USGS people with ORCIDs to WikiData identities, I looked at only a match on ORCID and full name and did not filter out anyone who was not affiliated with the USGS in their WikiData record.
* Another interesting property I hadn't seen previously is "different from." A view there shows a cool disambiguation made in WikiData between the USGS Janet Carter (a Bureau Approving Official; https://www.wikidata.org/wiki/Q57408499 https://www.sciencebase.gov/directory/person/1620) and a Janet Carter in New Zealand in a completely different profession. That's pretty cool! The different from disambiguation was made by a user, Daniel Mietchen, in June 2019 according to the history. That tracks to when Janet Carter from New Zeland was created, pointing to part of the power of the WikiData platform in helping to force disambiguation with items of similar name are added to the graph.
* A really interesting property that holds a lot of promise for one of our use cases dealing with capacity assessment is "field of work." I looked at records for Robert Stallard (https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820), who's a pretty famous ecologist who I believe is still based in Boulder. His WikiData record shows only his affiliation with the Smithsonian and provides field of work identifiers with references to his profile page there - https://stri.si.edu/scientist/robert-stallard. Sweet! We have another connection through to some other mineable asset. The values for field of work are relatively high level in terms of understanding specialties, but they are fully linked to associated depth within Wikipedia, which gives them a bit more weight and added context. In Robert's case, we see that he does work in reforestation among other things, which is a deeper level concept than we're going to see in anything USGS has classified out at this point.
* Another interesting case is Neal Woodman (https://www.wikidata.org/wiki/Q21339166 https://www.sciencebase.gov/directory/person/11434) who is a zoologist with the Patuxent Wildlife Research Center. Neal's field of work properties include some pretty specific references to species and not just general sciences, which is another interesting dynamic in itself that I'm exploring in WikiData. He links to the family of shrews and similar mammals, [Soricidae](https://www.wikidata.org/wiki/Q41631), which is a fairly specific reference. (I'm also exploring the phenomenon that WikiData has over 570,000 references to ITIS TSN identifiers that are linked to many other identification systems that we're interested in for other work I'm doing on taxa information assembly.) A view of the history of Neal's WikiData record shows that field of work values were added by a [real person](https://www.wikidata.org/wiki/User:Uncommon_fritillary), Diane Shaw, at the Smithsonian's Biodiversity Heritage Library. Cool!