One of the use cases we are exploring through this code project is how we can use WikiData to help augment a process for auto-assembling research records for USGS staff. After writing a series of functions for interacting with both the ScienceBase Directory as our starting point, working out linkages to WikiData IDs, and a set of functions for working with WikiData, we can start exploring what routes to useful information WikiData records linked to USGS person records might produce. This notebook works through the necessary functions and provides a comprehensive look at the individual properties of USGS people in WikiData. These properties are considered statements or claims about a person and include identifiers for crossreference to other systems that may also be useful.

In [1]:
import pylinkedcmd
from IPython.display import display
import time

sb = pylinkedcmd.pylinkedcmd.Sciencebase()
wd = pylinkedcmd.pylinkedcmd.Wikidata()

First, we need to get relevant person records from the USGS ScienceBase Directory as our starting point. The SB Directory contains a record of everyone who currently works for the USGS or who has worked for the USGS in the years since it was established (~2010). It also contains records for collaborators from other organizations and institutions who have been involved in USGS work and obtained an account with us at some point. It's not a perfect resource, but it represents the best publicly available source of structured information on USGS staff that is reasonably comprehensive. The sb_people function is pretty specialized for our use cases here as it filters to just people with a WikiData ID and simplifies the output to just a select number of fields that we care about for this exercise. The function takes a little bit of time to run through, so we time it for reference.

In [2]:
%%time
people = sb.sb_people(fields="simple")
print(len(people))
display(people[99])

2155


{'uri': 'https://www.sciencebase.gov/directory/person/4417',
 'displayName': 'James R Hein',
 'email': 'jhein@usgs.gov',
 'identifiers': [{'id': 1350, 'key': '0000-0002-5321-899X', 'type': 'ORCID'},
  {'id': 1351, 'key': 'Q67225029', 'type': 'WikiData'}]}

CPU times: user 255 ms, sys: 91.9 ms, total: 347 ms
Wall time: 7.91 s


Because we are going to want to come back and reference the SB record in a minute when it comes to exploring individual properties on people, this helper function retrieves the URL that will let us get back into the ScienceBase Directory to see who we are talking about.

In [3]:
def sb_person_link_by_wd_id(people, wd_id):
    sb_link = next(
        (
            p["uri"] for p in people if next(
                (i for i in p["identifiers"] if i["key"] == wd_id), None) 
            is not None), 
        None)
    
    return sb_link

One of the functions we built into the WikiData class in this package will retrieve a given WikiData Item ("Q" identifier) and run some lookups to build out a reasonable document representation of the full record in WikiData. It runs through the statements/claims on a person and resolves property identifiers to provide a human-readable view of the properties of an item. The following codeblock demonstrates this for a slice of our list of USGS people with WikiData IDs. Each of the calles to wd_item() do take a little bit of time with this method based on resolving the properties.

Note: There are likely better ways of handling this function fully with SPARQL that we'll figure out eventually.

In [4]:
%%time
for person in people[20:22]:
    wd_id = next((i["key"] for i in person["identifiers"] if i["type"] == "WikiData"), None)
    display(wd.wd_item(wd_id, include_aliases=True))

{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/Q57414377.json',
 'lastrevid': 1202763177,
 'modified': '2020-06-09T06:30:58Z',
 'label_en': 'Muruleedhara N Byappanahalli',
 'description_en': 'researcher ORCID id 0000-0001-5376-597X',
 'statements': [{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P496.json',
   'title': 'Property:P496',
   'id': 'P496',
   'label_en': 'ORCID iD',
   'description_en': 'identifier for a person',
   'formatter_url': 'https://orcid.org/$1',
   'url': 'https://orcid.org/0000-0001-5376-597X',
   'value': '0000-0001-5376-597X'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P31.json',
   'title': 'Property:P31',
   'id': 'P31',
   'label_en': 'instance of',
   'description_en': 'that class of which this subject is a particular example and member',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Special:EntityData/Q5',
   'value': 'human'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P1153.j

{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/Q20675641.json',
 'lastrevid': 1191864531,
 'modified': '2020-05-27T21:58:07Z',
 'label_en': 'Mark Petersen',
 'description_en': 'seismologist',
 'aliases': ['Mark D. Petersen',
  'Mark Petersen',
  'Mark D Petersen',
  'M. D. Petersen',
  'M.D. Petersen'],
 'statements': [{'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P31.json',
   'title': 'Property:P31',
   'id': 'P31',
   'label_en': 'instance of',
   'description_en': 'that class of which this subject is a particular example and member',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Special:EntityData/Q5',
   'value': 'human'},
  {'uri': 'https://www.wikidata.org/wiki/Special:EntityData/P27.json',
   'title': 'Property:P27',
   'id': 'P27',
   'label_en': 'country of citizenship',
   'description_en': 'the object is a country that recognizes the subject as its citizen',
   'formatter_url': None,
   'url': 'https://www.wikidata.org/wiki/Specia

CPU times: user 651 ms, sys: 48.9 ms, total: 699 ms
Wall time: 27.1 s


What we are really interested in through this immediate exercise is an examination of the breadth of properties in WikiData on USGS people. We want to understand what we can exploit from those claims made about our people and get some indication of where those statements have come from. The WikiData class contains another function that will take a list of specific item IDs and return all of the properties with English labels and values for those IDs. We can use this to comprehensively run through every USGS-affiliated WikiData ID and get a set of distinct properties to examine further.

To be kind to the WikiData API and deal with limitations in how large a SPARQL query we can make, we batch process and slow down our requests in sending all 2K+ IDs.

In [5]:
%%time
wd_ids = [next(i["key"] for i in p["identifiers"] if i["type"] == "WikiData") for p in people]

batch_size = 100
all_props = list()

for i in range(0, len(wd_ids), batch_size):
    this_run = wd_ids[i:i+batch_size]
    all_props.extend(wd.wd_item_properties(this_run, return_result="values"))
    time.sleep(0.2)
    
distinct_properties = list(set([p["property"] for p in all_props]))
display(distinct_properties)

['Loop ID',
 'on focus list of Wikimedia project',
 'languages spoken, written or signed',
 'member of',
 'honorific suffix',
 'FAST ID',
 'Project Gutenberg author ID',
 'Bibliothèque nationale de France ID',
 'field of work',
 'academic degree',
 'LinkedIn personal profile ID',
 'occupation',
 'Google Scholar author ID',
 'family name',
 'nickname',
 'ResearchGate contributions ID',
 'NORAF ID',
 'residence',
 'DBLP ID',
 'doctoral advisor',
 'employer',
 'Publons author ID',
 'IdRef ID',
 'botanist author abbreviation',
 'official website',
 'ISNI',
 'educated at',
 'ZooBank author ID',
 'GND ID',
 'ResearcherID',
 'Libraries Australia ID',
 'work location',
 'MusicBrainz artist ID',
 'IPNI author ID',
 'NKCR AUT ID',
 'different from',
 'Mendeley person ID',
 'Microsoft Academic ID',
 'BHL creator ID',
 'University of Barcelona authority ID',
 'Biblioteca Nacional de España ID',
 'Library of Congress authority ID',
 'CiNii author ID (books)',
 'given name',
 'WorldCat Identities ID

CPU times: user 560 ms, sys: 120 ms, total: 679 ms
Wall time: 1min


Now that we have our suite of distinct properties, we can start looking for interesting tidbits of information, figure out where things came from, and think about what we might be able to do with the stuff we can get back through the USGS person to WikiData linkage. There seems to be some pretty interesting stuff in the collection of properties. The following codeblock can be tweaked to look at specific properties of interest. It takes a property name and provides a clickable set of URLs that will take you to the WikiData identity with that property and to the corresponding ScienceBase Directory item. At this point, we are just visually exploring the information to see what's happening with the data.

In [7]:
property_name = "sex or gender"
ids = list()
for record in [i for i in all_props if i["property"] == property_name]:
    ids.append(record["item_id"])
    print(f"https://www.wikidata.org/wiki/{record['item_id']}", sb_person_link_by_wd_id(people, record['item_id']))
    
print(len(list(set(ids))))

https://www.wikidata.org/wiki/Q20675641 https://www.sciencebase.gov/directory/person/8045
https://www.wikidata.org/wiki/Q40203735 https://www.sciencebase.gov/directory/person/24855
https://www.wikidata.org/wiki/Q42789708 https://www.sciencebase.gov/directory/person/948
https://www.wikidata.org/wiki/Q42814439 https://www.sciencebase.gov/directory/person/6949
https://www.wikidata.org/wiki/Q50597603 https://www.sciencebase.gov/directory/person/6551
https://www.wikidata.org/wiki/Q54963192 https://www.sciencebase.gov/directory/person/2828
https://www.wikidata.org/wiki/Q56948802 https://www.sciencebase.gov/directory/person/60525
https://www.wikidata.org/wiki/Q57408099 https://www.sciencebase.gov/directory/person/15304
https://www.wikidata.org/wiki/Q57408268 https://www.sciencebase.gov/directory/person/68963
https://www.wikidata.org/wiki/Q57408275 https://www.sciencebase.gov/directory/person/58061
https://www.wikidata.org/wiki/Q57408402 https://www.sciencebase.gov/directory/person/8217
https:

https://www.wikidata.org/wiki/Q57077931 https://www.sciencebase.gov/directory/person/67396
https://www.wikidata.org/wiki/Q57408085 https://www.sciencebase.gov/directory/person/6867
https://www.wikidata.org/wiki/Q57408124 https://www.sciencebase.gov/directory/person/5586
https://www.wikidata.org/wiki/Q57408370 https://www.sciencebase.gov/directory/person/7260
https://www.wikidata.org/wiki/Q57408397 https://www.sciencebase.gov/directory/person/5791
https://www.wikidata.org/wiki/Q57408710 https://www.sciencebase.gov/directory/person/10895
https://www.wikidata.org/wiki/Q57412745 https://www.sciencebase.gov/directory/person/1016
https://www.wikidata.org/wiki/Q57414191 https://www.sciencebase.gov/directory/person/5420
https://www.wikidata.org/wiki/Q57414298 https://www.sciencebase.gov/directory/person/8789
https://www.wikidata.org/wiki/Q57414503 https://www.sciencebase.gov/directory/person/9562
https://www.wikidata.org/wiki/Q57415470 https://www.sciencebase.gov/directory/person/68957
https:/

https://www.wikidata.org/wiki/Q57844535 https://www.sciencebase.gov/directory/person/2745
https://www.wikidata.org/wiki/Q57845302 https://www.sciencebase.gov/directory/person/13818
https://www.wikidata.org/wiki/Q57899265 https://www.sciencebase.gov/directory/person/1772
https://www.wikidata.org/wiki/Q58040422 https://www.sciencebase.gov/directory/person/23535
https://www.wikidata.org/wiki/Q58220769 https://www.sciencebase.gov/directory/person/8340
https://www.wikidata.org/wiki/Q59815920 https://www.sciencebase.gov/directory/person/6850
https://www.wikidata.org/wiki/Q59821959 https://www.sciencebase.gov/directory/person/2654
https://www.wikidata.org/wiki/Q59832003 https://www.sciencebase.gov/directory/person/4383
https://www.wikidata.org/wiki/Q59832391 https://www.sciencebase.gov/directory/person/10922
https://www.wikidata.org/wiki/Q59832494 https://www.sciencebase.gov/directory/person/71843
https://www.wikidata.org/wiki/Q59832987 https://www.sciencebase.gov/directory/person/2044
https:

https://www.wikidata.org/wiki/Q68630131 https://www.sciencebase.gov/directory/person/685
https://www.wikidata.org/wiki/Q38802717 https://www.sciencebase.gov/directory/person/5729
https://www.wikidata.org/wiki/Q55187045 https://www.sciencebase.gov/directory/person/14527
https://www.wikidata.org/wiki/Q55189542 https://www.sciencebase.gov/directory/person/2208
https://www.wikidata.org/wiki/Q57408147 https://www.sciencebase.gov/directory/person/1395
https://www.wikidata.org/wiki/Q57414151 https://www.sciencebase.gov/directory/person/65351
https://www.wikidata.org/wiki/Q57414157 https://www.sciencebase.gov/directory/person/2514
https://www.wikidata.org/wiki/Q57414340 https://www.sciencebase.gov/directory/person/63590
https://www.wikidata.org/wiki/Q57414615 https://www.sciencebase.gov/directory/person/7980
https://www.wikidata.org/wiki/Q57777399 https://www.sciencebase.gov/directory/person/4275
https://www.wikidata.org/wiki/Q57844733 https://www.sciencebase.gov/directory/person/9998
https://

# Observations
* I was interested in the fact that "date of birth" showed up as a property on some records as that's often one of those "personally identifiable information" bits that we would never even have a property for in our government system. Most of these appear to be of lower than month/year precision, which is something that we can verify programmatically based on how WikiData handles date properties. An interesting case here that I glanced through is David M. Miller (https://www.wikidata.org/wiki/Q57729525, https://www.sciencebase.gov/directory/person/7005) who is now an emeritus scientist in USGS and who I worked with some time ago on a publication.
* Another interesting case is Luke Iwanowicz (https://www.wikidata.org/wiki/Q42260508, https://www.sciencebase.gov/directory/person/4946). His USGS record indicates that he is an active research ecologist at the Leetown Science Center, but his WikiData record does not indicate that he is an employee of the USGS. A view of the item's WikiData history shows that it has come about through a series of actions from 2017-2020 by both bots and what appear to be individual person edits. When I ran the process to link USGS people with ORCIDs to WikiData identities, I looked at only a match on ORCID and full name and did not filter out anyone who was not affiliated with the USGS in their WikiData record.
* Another interesting property I hadn't seen previously is "different from." A view there shows a cool disambiguation made in WikiData between the USGS Janet Carter (a Bureau Approving Official; https://www.wikidata.org/wiki/Q57408499 https://www.sciencebase.gov/directory/person/1620) and a Janet Carter in New Zealand in a completely different profession. That's pretty cool! The different from disambiguation was made by a user, Daniel Mietchen, in June 2019 according to the history. That tracks to when Janet Carter from New Zeland was created, pointing to part of the power of the WikiData platform in helping to force disambiguation when items of similar name are added to the graph.
* A really interesting property that holds a lot of promise for one of our use cases dealing with capacity assessment is "field of work." I looked at records for Robert Stallard (https://www.wikidata.org/wiki/Q58358553 https://www.sciencebase.gov/directory/person/9820), who's a pretty famous ecologist who I believe is still based in Boulder. His WikiData record shows only his affiliation with the Smithsonian and provides field of work identifiers with references to his profile page there - https://stri.si.edu/scientist/robert-stallard. Sweet! We have another connection through to some other mineable asset. The values for field of work are relatively high level in terms of understanding specialties, but they are fully linked to associated depth within Wikipedia, which gives them a bit more weight and added context. In Robert's case, we see that he does work in reforestation among other things, which is a deeper level concept than we're going to see in anything USGS has classified out at this point.
* Another interesting case is Neal Woodman (https://www.wikidata.org/wiki/Q21339166 https://www.sciencebase.gov/directory/person/11434) who is a zoologist with the Patuxent Wildlife Research Center. Neal's field of work properties include some pretty specific references to species and not just general sciences, which is another interesting dynamic in itself that I'm exploring in WikiData. He links to the family of shrews and similar mammals, [Soricidae](https://www.wikidata.org/wiki/Q41631), which is a fairly specific reference. (I'm also exploring the phenomenon that WikiData has over 570,000 references to ITIS TSN identifiers that are linked to many other identification systems that we're interested in for other work I'm doing on taxa information assembly.) A view of the history of Neal's WikiData record shows that field of work values were added by a [real person](https://www.wikidata.org/wiki/User:Uncommon_fritillary), Diane Shaw, at the Smithsonian's Biodiversity Heritage Library. Cool!
* MusicBrainz artist ID - That sounds like an interesting property. That led me to discover Lori Phillips (https://www.wikidata.org/wiki/Q59832181 https://www.sciencebase.gov/directory/person/14202), a cartographer with NGTOC in Rolla, MO who also appears to be an accomplished classical musician. That brings up the dynamic of WikiData blending professional and personal connections that our corporate information systems would never accomplish, which will seem like a two-edged sword to some folks, I'm sure. One of the other interesting dynamics in Lori's record is the temporal qualifiers on employer, indicating a role with the Florida Department of Transportation between 2007 and 2009 before starting with USGS. That ability to track affiliation over time is something that we have not even yet tackled well in our internal system, and it may be reasonable to project that information out into a third party platform with a more suitable data model.
* The member of property is another really interesting one for a number of reasons, including potential to influence our capacity assessment use case. Looking at Sarah Minson as an example (https://www.wikidata.org/wiki/Q59676610 https://www.sciencebase.gov/directory/person/58642), she actually has a number of really interesting properties in play that have come about through bots and humans, including [this user](https://www.wikidata.org/wiki/User:Trilotat) who I encountered already when modifying the USGS record. I can see from the WikiData record that Sarah is one of our [PECASE](https://www.wikidata.org/wiki/Q7241433) scientists, something that is not currently available as structured data anywhere from a USGS source, currently. I have a connection to her doctoral advisor, which opens up another line of inquiry. And two professional society memberships indicate areas for data mining on conference involvement that may or may not have hit official publication channels.
* And then there's the wonderful, Jayne Belnap (https://www.wikidata.org/wiki/Q68630131 https://www.sciencebase.gov/directory/person/685), also accessed via "member of" claims. Jayne is one of the world's foremost soil ecologists who is currently a USGS senior scientist (and an all together wonderful person to hang out with)! Jayne's record has an interesting dynamic with an image property, something I've seen for others as well. Both "member of" and "award received," in this case, point to useful characteristics that could be used for further data mining.
* Examining the "described at URL" property led me to look at James Skinner (https://www.wikidata.org/wiki/Q62019888 https://www.sciencebase.gov/directory/person/9537), a research geologist at the USGS Astrogeology Science Center. The URL reference dynamic is an interesting one that we'll have to do some work through in terms of either semantic alignment or some data wrangling. There are many different properties that could have a URL pointing to a resource directly associated with a person and usable for further data mining and info gathering. Having them everywhere could be a real challenge to try and nail down and exploit. One approach I may pursue is to simply go after everything that is a subproperty of [URL](https://www.wikidata.org/wiki/Property:P2699), assuming that most relevant properties have been at least classified to that level, and see if that gets us somewhere. James is also an interesting case (as am I), in that he has a whole lot of aliases listed in his WikiData identity, something else that is quite useful in this type of work. The "official website" property is more well populated for USGS people. A look at the values shows that the majority point to our official staff profiles system, which is unfortunately not on a software-friendly platform currently.
* A look at USGS people with "ResearchGate profile ID" properties points out that a lot of us have them and maybe begs the question on why we're doing this work if ResearchGate has already figured it out. ResearchGate, Academia.edu, and some other platforms have done some great work in both a lot of this mining work and in providing a platform with enough incentives to get folks interested in direct contributions of links and knowledge connections. Unfortunately, despite the occasional [attempt](https://github.com/kaleguy/scraper-api) at hacking their systems, they are closed, proprietary systems with business models that won't support open access. I'm not aiming to recreate what those guys have done at all, but I do want to explore the limits of what we can get to via linked open data channels and how we can project appropriate parts of our own closed corporate information (not proprietary because of policy but because of a lack of technological attention) out into those channels.
* And as one last point of curiosity, I was interested to find Barb Ralston (https://www.wikidata.org/wiki/Q14949164 https://www.sciencebase.gov/directory/person/8388), another of our USGS Bureau Approving Officials, via a quite specific property - "botanist author abbreviation." A look at the [history of that property](https://www.wikidata.org/wiki/Property_talk:P428) points to an interesting dynamic that we need to consider as we explore exploiting WikiData for our various use cases. Part of the power of the open dialog that drives everything in the "Wikiverse," is the traceability through the deliberative process. We will need to be equally deliberate in making choices about what properties we advocate for use in expounding on USGS items, leveraging fully the properties, definitions, and constraints being laid down by other WikiData contributors while lending our own expertise and judgment. And there are areas we need to weigh in. For instance, in a conversation sparked by editing the USGS page, I've made a modification to the [DOI prefix](https://www.wikidata.org/wiki/Q18340367)'s formatterUrl parameter to point at a more robust resolver. I also need to weigh in on the single value constraint that WikiDataBots are flagging for that property where I've added both the CrossRef and DataCite DOI ID spaces to our USGS record as that is perfectly legitimate, but to do so, I'll have to reach out to a Ukrainian group that seems to be the main users of that property.