# Imaginary sitemap for staff profiles
USGS Staff Profiles provide an official landing page for USGS employees, contractors, and volunteers. Their creation and maintenance is optional, and only about half of USGS employees have active profiles at any time, mostly those who are also publishing papers or contributing to other public products. They have some unique information that may only be found on those pages (expertise terms being the main thing), though much of the content may be duplicative with well-maintained ORCID records.

In this notebook, I tee up what is essentially an imaginary sitemap for staff profiles since there is nothing that serves this function. The inventory for staff is a listing element of the Drupal CMS for the USGS web. Individual listings of 12 staff can be accessed via URL paths and scraped together to provide a complete list at a point in time. There is no information in page headers or either visible or internal hidden content within the HTML pages that provides an indication of when the content for a staff profile was last updated, so we are left having to run a period process to scrape everything and then take appropriate action.

There is also a challenge in that the underlying management process for staff profiles has changed in impactful ways over time.
* When a person's name changes, they may or may not get a new URL
* Name changes that do result in a new URL may or may not end up with a redirect on an old URL; that old URL may simply disappear
* Employees that separate from the USGS (for whatever reason) may or may not have their page/URL retained; newer practice seems to be to remove the person from the inventory listing, retain the URL but eliminate information from the page other than noting the person as a former employee
* Some URLs seem to have been mistakes (e.g., an incorrect middle initial used in a URL); these may or may not be retained once created and may or may not be a redirect
* Strangely, there are some cases of redirected URLs for a name change/mistake while other cases have multiple URLs producing the same content (it's not clear if these are underlying aliases or what the Drupal mechanism used might be)
* Data quality has not always seemed to be a priority with cases of incorrect ORCID IDs (sometimes multiple people showing up with the same ORCID ID), name misspellings, and other issues

There is no one perfect publicly visible unique, persistent, and resolvable identifier for staff. Email addresses are used in places, such as the Pubs Warehouse to link to an author, and may be the most unique personal identifier for a time for some people. ORCID IDs are in increasing use but are mostly leveraged for authors or contributors to information products. The profile URLs themselves are interesting in that they do get referenced in other places (e.g., people include their USGS profile URL in their ORCID profile at times), but the problems with the management of that namespace makes them a challenge.

The only thing we can really do at this point is gather up the references to profiles over time, store those lists and keep track of them so we can retail older references and examine changes as needed, and then follow the links to scrape/harvest available details. While we could pick up some basic details (email, phone, affiliation) from the inventory pages, I chose to simply scrape all the links together and stash them in the GeoKB source item. This creates what is essentially a sitemap focused on just the staff inventory. This sets us up to then scrape individual profiles and take further action.

The code in this notebook will move to a routine process run daily to simply refresh the imaginary sitemap stored in the talk page for the source item.

In [1]:
import requests
import pandas as pd
import time
import isaid
from joblib import Parallel, delayed
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

from wbmaker import WikibaseConnection, Logger
geokb = WikibaseConnection('GEOKB_CLOUD')

In [2]:
# Set run variables
source_item = "Q44323"

# Set up logging
logger_instance = Logger('staff_profiles')
logger = logger_instance.get_logger()

logger.info('Starting staff profile inventory scrape')

In [3]:
# We need to get the range of pages to scrape
def last_page(profile_inventory_url="https://www.usgs.gov/connect/staff-profiles"):
    r_profile_inventory = requests.get(profile_inventory_url)
    soup_profile_inventory = BeautifulSoup(r_profile_inventory.content, 'html.parser')
    last_page_link = soup_profile_inventory.find('a', {'title': 'Go to last page'})['href']
    if not last_page_link:
        return

    last_page_url = "".join([profile_inventory_url, last_page_link])
    parsed_url = urlparse(last_page_url)
    query_params = parse_qs(parsed_url.query)
    last_page_num = query_params.get("page")
    if last_page_num:
        return int(last_page_num[0])

# Get a list of profile names from links on a web page
def profile_names_from_links(soup):
    profile_links = soup.find_all('a', href=lambda href: href and '/staff-profiles/' in href)
    if not profile_links:
        return None

    return list(set([l['href'].split('/')[-1] for l in profile_links]))

# Scrape profile urls from staff inventory page
def staff_profile_names(page_num):
    inventory_url = f'https://www.usgs.gov/connect/staff-profiles?node_staff_profile_type%5B141721%5D=141721&node_staff_profile_type%5B141730%5D=141730&node_staff_profile_type%5B141727%5D=141727&node_staff_profile_type%5B141728%5D=141728&node_staff_profile_type%5B141726%5D=141726&node_staff_profile_type%5B141722%5D=141722&node_staff_profile_type%5B141723%5D=141723&node_staff_profile_type%5B141719%5D=141719&node_staff_profile_type%5B141718%5D=141718&node_staff_profile_type%5B141759%5D=141759&node_staff_profile_type%5B141729%5D=141729&node_staff_profile_type%5B141717%5D=141717&node_staff_profile_type%5B141725%5D=141725&node_staff_profile_type%5B141745%5D=141745&node_staff_profile_type%5B141724%5D=141724&node_staff_profile_type%5B141720%5D=141720&node_staff_profile_type%5B141716%5D=141716&node_staff_profile_type_1=All&node_topics=All&items_per_page=12&node_states=&search_api_fulltext=&page={str(page_num)}'
    r = requests.get(inventory_url)
    if r.status_code != 200:
        return None

    soup = BeautifulSoup(r.text, 'html.parser')
    profiles = profile_names_from_links(soup)
    if not profiles:
        return None
    return profiles

# Put the profile name lists together
def assemble_lists(list_of_lists):
    return list(set([item for sublist in list_of_lists for item in sublist]))

# Write the profile names to the source item's wiki page
def cache_profile_names(profile_names, source_item=source_item):
    source_item_page = geokb.mw_site.pages[f'Item_talk:{source_item}']
    source_item_page.save(",".join(profile_names), summary="Updated staff profile inventory")


# Inventory scraper
Here, we need to figure out what the last page in the current inventory is by scraping the first page and picking out that element. We then tee up the range of page numbers we need to run and then run those HTTP requests in parallel to get each inventory page, pick out the appropriate links, and assemble a list of profile names (the last part of the URL).

In [4]:
# Get the last page number from the inventory
last_inventory_page = isaid.last_page()
logger.info("Last Inventory Page: %s", last_inventory_page)
print(logger_instance.get_last_message())

# Create a list of page numbers from 0 to last_inventory_page
page_numbers = list(range(last_inventory_page + 1))
logger.info("Number of inventory pages to read: %s", len(page_numbers))
print(logger_instance.get_last_message())

# Run the function in parallel for each page number
profile_name_lists = Parallel(n_jobs=-1)(delayed(staff_profile_names)(page_number) for page_number in tqdm(page_numbers))
logger.info("Number of profile name lists: %s", len(profile_name_lists))
print(logger_instance.get_last_message())

# Build one list of unique profile names
current_profile_names = assemble_lists(profile_name_lists)
logger.info("Number of unique profile names: %s", len(current_profile_names))
print(logger_instance.get_last_message())

Last Inventory Page: 484
Number of inventory pages to read: 485


100%|██████████| 485/485 [01:26<00:00,  5.58it/s]


Number of profile name lists: 485
Number of unique profile names: 5811


# Cache imaginary sitemap
In this final step, we write the list of profile names to the item in the GeoKB that represents the source. This gives us a living, versioned dataset of profile names, cached on wiki pages that can be read programmatically and parsed into a list for operational use and timestamped for when they were cached. We have a somewhat artificial periodic snapshot of some (but probably not all) profile names going back a few years based on different times we have run this overall process as it has been developing.

In [5]:
# Write staff profile names to inventory
cache_profile_names(current_profile_names)
logger.info(f"Wrote {len(current_profile_names)} staff profile names to inventory")
print(logger_instance.get_last_message())

Wrote 5811 staff profile names to inventory
