USGS profile pages represent one of the best and most personally controlled (by individual staff) locations for information about people. Unfortunately, they are part of a "closed" system that supports no real method of accessing content via software. This could be easily and appropriately resolved by embedding structured linked data on the staff profile pages, and alignment with schema.org elements would help to make the content both digestible for code and go a long way to improve public search. In the meantime, I wrote some helper functions to let us get at this information for capacity assessment and other work.

This code sequence handles the process of pulling the full inventory of staff so that we can link together known information found in other sources (email address) with the URL to the relevant staff profile pages. The inventory of staff is found at https://www.usgs.gov/connect/staff-profiles, and we have to walk through all pages from that point to get everyone. This process uses a simple parallel process to do that, hitting each page with a specific scraping function to get every listing from that page and accumulating into a combined list of dictionaries that can be saved out for later uses.

In [1]:
import pylinkedcmd.pylinkedcmd as l_cmd
from joblib import Parallel, delayed
import tqdm

usgs_web = l_cmd.UsgsWeb()

The get_staff_inventory_pages() function hits the first page in the staff profile listing and extracts the page number for the last page from a specific link on that page. It then sets up and returns a list of URLs that we need to hit to get everyone.

In [2]:
inventory_urls = usgs_web.get_staff_inventory_pages()
len(inventory_urls)

282

There are lots of ways of handling the data spit out by this process. In this case, everything is pretty small, and we can run the whole process in memory and build a list of all the records we care about. This codeblock sets up the list that we'll build and then provides a simple accumulator function that runs the core function for processing/scraping a given page for its inventory.

In [3]:
staff_inventory = list()

def accumulator(url):
    staff_inventory.extend(usgs_web.get_staff_listing(url))

JobLib is a Python package for running simple parallel processes over lists/batches of stuff we need to process. In this case, I set it to use threads and send 10 URLs at a time from our list to accumulate into the inventory. The tqdm package does a nice job of giving me a status of how far along the process is.

In [4]:
%%time
Parallel(n_jobs=20, prefer="threads")(
    delayed(accumulator)
    (
        i
    ) for i in tqdm.tqdm(inventory_urls)
)

100%|██████████| 282/282 [00:22<00:00, 12.37it/s]


CPU times: user 24.1 s, sys: 1.1 s, total: 25.2 s
Wall time: 25.4 s


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In working through the data that come back from this scraping exercise, I ran into some curious issues that make it a bit difficult to work with the information in some circumstances. I wrote the get_unique_staff function to deal with these and return a workable list of unique staff members and the best available information we can gain from this source.

* There are some cases where there are actual complete duplicate records where all attributes gathered for a given staff listing are the same, and so I reduce those duplicates, assuming these may be artifacts of the pagination mechanism used.
* There are quite a few cases where there is no email address available for the person. Many of these do have a profile page, and in checking those, I also see that no email is listed. Those individuals (or at least most of them) do have an email address in our main directory resource (publicly available on ScienceBase), and I do not know why their emails have been excluded from the staff profile system. There's not a lot we can do with these records in terms of linking them to other sources without some conjecture. To deal with this here, I assign an "identifier" property in the scraping process that is either email or name, except for another corner case where different people are using a common email (ask@usgs.gov is the only one so far). This lets me group other properties together on those somewhat unique values. The reason that I did not use profile URL as a possible identifier is that there are cases where the same individual has two profile pages.
* Some individuals do appear to have more than one profile page in the system for what is actually the same person. I assume this is actually a mistake or problem in the data platform behind this as looking through some of these cases (one where the same person apparently has 7 profile pages), I see matching other details (email and telephone) with one profile page populated and another not. My guess is that the system used for people to create their profile pages has some problems (or has in the past) that will need to be cleaned up.

To deal with these all these issues, and come up with an actionable inventory, I weed out duplicate full records, read the list of dictionaries into a Pandas dataframe, group by the identifier property putting everything else into lists of unique values, and the send it back out as either a dataframe or list of dictionaries. This means we need to process everything as lists when most properties are discrete values, but we have what we need to tie this information to other data sources.

Along the way of looking through these data, I did run into a couple issues in string values picked up. A couple of them look like HTML output artifacts, but there are a few cases where it appears that the data are likely a little wonky coming from the source. I presume that we have some kind of database where the profile/staff inventory information lives. Perhaps it's directly within the Drupal framework that manages the USGS web sites now. I'm guessing there is also no utility on that end that is making sure all the information in that database aligns with something like the DOI Active Directory, which would be a logical connecting point. This is, again, an area where we need to do some cleanup, maintenance, and consistency checks across all the systems that make up our Corporate Master Data in the absence of a single unifying system that everything uses.

In [5]:
unique_staff_inventory = usgs_web.get_unique_staff(staff_inventory)

To demonstrate some of the issues, here's a list of what appear to be the same staff person with multiple profile pages. We'll have to treat these cases with some care and decide what to do about the content we can gather.

In [6]:
[i for i in unique_staff_inventory if len(i["profile"]) > 1]

[{'identifier': 'acbooth@usgs.gov',
  'profile': ['https://usgs.gov/staff-profiles/amanda-c-booth',
   'https://usgs.gov/staff-profiles/amanda-booth'],
  'email': ['acbooth@usgs.gov'],
  'name': ['Amanda Booth', 'Amanda C Booth'],
  'organization_name': ['Caribbean Florida Water Science Center', None],
  'organization_link': ['https://www.usgs.gov/centers/car-fl-water', None],
  'telephone': ['239-275-8448'],
  'title': ['Physical Scientist']},
 {'identifier': 'acgill@usgs.gov',
  'profile': ['https://usgs.gov/staff-profiles/amy-gill',
   'https://usgs.gov/staff-profiles/amy-c-gill'],
  'email': ['acgill@usgs.gov'],
  'name': ['Amy Gill', 'Amy C Gill'],
  'organization_name': ['Lower Mississippi-Gulf Water Science Center'],
  'organization_link': ['https://www.usgs.gov/centers/lmg-water'],
  'telephone': ['334-395-4128'],
  'title': ['Hydrologist']},
 {'identifier': 'ackarlson@usgs.gov',
  'profile': ['https://usgs.gov/staff-profiles/andrew-c-karlson',
   'https://usgs.gov/staff-profil

It doesn't take that long to run this, but I'll stick things in a file for later use anyway.

In [8]:
import pickle

outfile = open("usgs_staff.pkl", "wb")
pickle.dump(unique_staff_inventory, outfile)
outfile.close()