This notebook handles the process of scraping through the USGS staff profiles paginated interface to pull together a current inventory in structured data and dropping that to the Mediawiki discussion page for the item representing that source in the GeoKB. The inventory is then processed with a separate algorithm to deal with the content.

After figuring out there's a limit on the size of data I can push, I stripped this process way back to simply scrape all of the unique profile URLs (and really just the name part of the profile URL). I really only need these pointers over time to figure out what the GeoKB already knows about and what profiles need to be pulled and added.

In [253]:
from joblib import Parallel, delayed
from tqdm import tqdm
import yaml
import isaid
from wbmaker import WikibaseConnection

geokb = WikibaseConnection('GEOKB_CLOUD')

In [2]:
# I could get these dynamically, but they are the essential variables we need to run this
source_item = "Q44323"
profile_inventory_url = "https://www.usgs.gov/connect/staff-profiles"

# Scrape Inventory of Staff Profiles

In [3]:
# get the last page of the inventory
last_page_num = isaid.last_page(profile_inventory_url)
print(last_page_num)

487


In [4]:
with Parallel(n_jobs=-1, prefer='threads') as parallel:
    inventories = parallel(delayed(isaid.get_inventory_page)(page_num) for page_num in tqdm(range(0, last_page_num+1)))
    if None in inventories:
        print("Early termination condition met. Exiting loop.")


100%|██████████| 488/488 [01:13<00:00,  6.60it/s]


In [9]:
inventory_records = []
for i in inventories:
    inventory_records.extend(i)


In [72]:
inventory_records[0]

{'date': '2024-01-04T08:27:26.817250',
 'page_num': 0,
 'name': 'Brad Aagaard',
 'profile': '/staff-profiles/brad-aagaard',
 'affiliations': [{'url': '/programs/earthquake-hazards',
   'name': 'Earthquake Hazards Program'}],
 'email': 'baagaard@usgs.gov',
 'telephone': '303-273-8481',
 'titles': ['Research Geophysicist']}

In [250]:
scraped_inventory_records = [i for i in inventory_records if i['profile'] not in reincarnated_persons['profile'].tolist()]

In [251]:
len(scraped_inventory_records)

5847

In [256]:
scraped_inventory_records[0]

{'date': '2024-01-04T08:27:26.817250',
 'page_num': 0,
 'name': 'Brad Aagaard',
 'profile': '/staff-profiles/brad-aagaard',
 'affiliations': [{'url': '/programs/earthquake-hazards',
   'name': 'Earthquake Hazards Program'}],
 'email': 'baagaard@usgs.gov',
 'telephone': '303-273-8481',
 'titles': ['Research Geophysicist']}

# Cache List of Scraped Profile Names
I was trying to come up with a way to cache the entire data structure scraped from the inventory listing to the Mediawiki page for the source item, but the data content is too large even when stripped down to a line by line structure. So, I reverted to caching the list of profile names only. This gives us a snapshot through time of new names showing up and old names disappearing. The process of managing the staff profile content in the USGS Web system has been highly variable through time. Current practice seems to be to strip out the last content for a person who has separated from the USGS and modify the title of their page to indicate that they are a former employee. This at least means that the last URL used is still live as opposed to older cases where we have hundreds of people who simply "no longer exist."

In [254]:
source_page = geokb.mw_site.pages[f"Item_talk:{source_item}"]
current_profile_list = source_page.text().split(',')

In [278]:
source_page.save(",".join([i['profile'].split('/')[-1] for i in scraped_inventory_records]), summary=f'20240104')

OrderedDict([('result', 'Success'),
             ('pageid', 211045),
             ('title', 'Item talk:Q44323'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 472747),
             ('newrevid', 595723),
             ('newtimestamp', '2024-01-05T13:18:19Z'),
             ('watched', '')])

# Process New Inventory

I still have a major problem with being able to use the Wikibase instance effectively for this use case. At the moment, both the Blazegraph and Elasticsearch components of WBStack have problems. We can't turn up all records via SPARQL because the Blazegraph store is not picking up all the changes from the message queue, and there is some other problem with how the search index is functioning (resulting in an API error).

Given a set of profile name values, we have to be able to check the WB instance for those that we already know about vs. anything that has changed - profile names disappearing from the USGS web inventory and new profile names showing up.
* For "disappearing people," we need to determine if a person really is no longer with the USGS or if something about the URL changed, which might indicate that someone changed their name but could also just be a case where someone made a mistake at one point and then corrected it (I've seen this happen frequently). We will need to attempt to retrieve the profile URL itself for any cases where the URL is no longer in the inventory and decide what to do from that point.
    * If we get a redirect, we should record the new profile URL but keep the old one as well, recording the date and status code for the old URL.
    * If we get a 404, we can record that as well.
* For "new people," we will need to first determine if we already know about the individual by another identifier. If so, we can record new information on the entity from the profile. If not, we can create a new entity.

In [282]:
pd.DataFrame(scraped_inventory_records).to_parquet('../data/staff_profile_inventory.parquet')

In [285]:
query_persons = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code ?orcid ?email
WHERE {
  ?item wdt:P1 wd:Q3 .
  OPTIONAL {
    ?item wdt:P31 ?profile_url .
  }
  OPTIONAL {
    ?item wdt:P106 ?orcid .
  }
  OPTIONAL {
    ?item wdt:P109 ?email .
  }
  OPTIONAL {
    ?item p:P31 ?ref_url_statement .
    ?ref_url_statement pq:P151 ?status_code ;
                       pq:P139 ?retrieved .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""

df_geokb_people = geokb.sparql_query(query_persons)
df_geokb_people['qid'] = df_geokb_people['item'].apply(lambda x: x.split('/')[-1])
df_geokb_people['profile_name'] = df_geokb_people['profile_url'].apply(lambda x: x.split('/')[-1] if x else None)
df_geokb_people['email'] = df_geokb_people['email'].apply(lambda x: x.split(':')[-1] if x else None)

In [286]:
df_geokb_people

Unnamed: 0,item,itemLabel,profile_url,retrieved,status_code,orcid,email,qid,profile_name
0,https://geokb.wikibase.cloud/entity/Q44385,Brad Aagaard,https://www.usgs.gov/staff-profiles/brad-aagaard,2023-09-30T00:00:00Z,200,0000-0002-8795-9833,baagaard@usgs.gov,Q44385,brad-aagaard
1,https://geokb.wikibase.cloud/entity/Q44386,Lowell Abbadini,https://www.usgs.gov/staff-profiles/lowell-abb...,2023-09-30T00:00:00Z,200,,labbadin@usgs.gov,Q44386,lowell-abbadini
2,https://geokb.wikibase.cloud/entity/Q44387,Justin Abel,https://www.usgs.gov/staff-profiles/justin-abel,2023-09-30T00:00:00Z,200,,jabel@usgs.gov,Q44387,justin-abel
3,https://geokb.wikibase.cloud/entity/Q44389,Kate Ackerman,https://www.usgs.gov/staff-profiles/kate-ackerman,2023-09-30T00:00:00Z,200,0000-0003-3925-721X,kackerman@usgs.gov,Q44389,kate-ackerman
4,https://geokb.wikibase.cloud/entity/Q44392,"Amanda S Ackiss, PhD",https://www.usgs.gov/staff-profiles/amanda-s-a...,2023-09-30T00:00:00Z,200,0000-0002-8726-7423,aackiss@usgs.gov,Q44392,amanda-s-ackiss
...,...,...,...,...,...,...,...,...,...
10973,https://geokb.wikibase.cloud/entity/Q49109,Emery Saylor,https://www.usgs.gov/staff-profiles/emery-saylor,,,,esaylor@usgs.gov,Q49109,emery-saylor
10974,https://geokb.wikibase.cloud/entity/Q48081,Brianna Minor,https://www.usgs.gov/staff-profiles/brianna-minor,,,,bminor@usgs.gov,Q48081,brianna-minor
10975,https://geokb.wikibase.cloud/entity/Q49873,Taaja R Tucker-Silva,https://www.usgs.gov/staff-profiles/taaja-r-tu...,,,,ttucker-silva@usgs.gov,Q49873,taaja-r-tucker-silva
10976,https://geokb.wikibase.cloud/entity/Q47559,Tricia Lincoln,https://www.usgs.gov/staff-profiles/tricia-lin...,,,,tarenga@usgs.gov,Q47559,tricia-lincoln


In [288]:
df_geokb_people[(df_geokb_people['profile_name'].isnull()) & (df_geokb_people['orcid'].notnull())]


Unnamed: 0,item,itemLabel,profile_url,retrieved,status_code,orcid,email,qid,profile_name
6143,https://geokb.wikibase.cloud/entity/Q138416,Steve Colman,,,,0000-0002-0564-9576,,Q138416,
6144,https://geokb.wikibase.cloud/entity/Q138417,Sarah Ryker,,,,0000-0002-1004-5611,,Q138417,
6145,https://geokb.wikibase.cloud/entity/Q138418,John Brock,,,,0000-0002-5289-9332,,Q138418,
6146,https://geokb.wikibase.cloud/entity/Q138419,Daniel Roby,,,,0000-0001-9844-0992,,Q138419,
6147,https://geokb.wikibase.cloud/entity/Q138420,Michael Dettinger,,,,0000-0002-7509-7332,,Q138420,
...,...,...,...,...,...,...,...,...,...
10860,https://geokb.wikibase.cloud/entity/Q139426,Moritz Bigalke,,,,0000-0002-6793-6159,,Q139426,
10861,https://geokb.wikibase.cloud/entity/Q139446,This Rutishauser,,,,0000-0003-0561-2164,,Q139446,
10862,https://geokb.wikibase.cloud/entity/Q139425,Jean Soubestre,,,,0000-0002-9403-8821,,Q139425,
10863,https://geokb.wikibase.cloud/entity/Q139441,Carolina Pagli,,,,0000-0002-9072-3004,,Q139441,


In [289]:
df_geokb_people[df_geokb_people['itemLabel'] == 'Sarah Ryker']

Unnamed: 0,item,itemLabel,profile_url,retrieved,status_code,orcid,email,qid,profile_name
6144,https://geokb.wikibase.cloud/entity/Q138417,Sarah Ryker,,,,0000-0002-1004-5611,,Q138417,


At this point, we have records in the GeoKB for several thousand more people than what the current USGS Staff Profiles system contains. These include many cases where we have now defunct profile URLs that were valid at one time. I need to run some cleanup through these:

* Attempt to retrieve the URLs and record retrieved date and status code
* Check back to previous iSAID cache on CHS and pull in the last available profile scrapes where possible
* Consider using the internal source from FPPS to record the last pay period date for separated employees
* I need to also work through email addresses recorded for some of these records. At one time, these were valid within the USGS system, but that may no longer be the case. We still want to retain them, because there are systems where these were used as identifiers, and they could still be useful within that time-bound context.

We also have several thousand named individuals who have ORCID identifiers that come from one of our public catalogs (publications or datasets). These most likely represent co-authors that we want to pull some information together on for various analytical purposes. There may be cases where we have disambiguation problems, and some of these identified individuals may need to be merged together with other records.