USGS profile pages represent one of the best and most personally controlled (by individual staff) locations for information about people. Unfortunately, they are part of a "closed" system that supports no real method of accessing content via software. This could be easily and appropriately resolved by embedding structured linked data on the staff profile pages, and alignment with schema.org elements would help to make the content both digestible for code and go a long way to improve public search. In the meantime, I wrote some helper functions to let us get at this information for capacity assessment and other work.

This code sequence handles the process of pulling the full inventory of staff so that we can link together known information found in other sources (email address) with the URL to the relevant staff profile pages. The inventory of staff is found at https://www.usgs.gov/connect/staff-profiles, and we have to walk through all pages from that point to get everyone. This process uses a simple parallel process to do that, hitting each page with a specific scraping function to get every listing from that page and accumulating into a combined list of dictionaries that can be saved out for later uses.

In [1]:
import pylinkedcmd.pylinkedcmd as l_cmd
from joblib import Parallel, delayed
import tqdm

usgs_web = l_cmd.UsgsWeb()

The get_staff_inventory_pages() function hits the first page in the staff profile listing and extracts the page number for the last page from a specific link on that page. It then sets up and returns a list of URLs that we need to hit to get everyone.

In [2]:
inventory_urls = usgs_web.get_staff_inventory_pages()
len(inventory_urls)

282

There are lots of ways of handling the data spit out by this process. In this case, everything is pretty small, and we can run the whole process in memory and build a list of all the records we care about. This codeblock sets up the list that we'll build and then provides a simple accumulator function that runs the core function for processing/scraping a given page for its inventory.

In [3]:
staff_inventory = list()

def accumulator(url):
    staff_inventory.extend(usgs_web.get_staff_listing(url))

JobLib is a Python package for running simple parallel processes over lists/batches of stuff we need to process. In this case, I set it to use threads and send 10 URLs at a time from our list to accumulate into the inventory. The tqdm package does a nice job of giving me a status of how far along the process is.

In [4]:
%%time
Parallel(n_jobs=10, prefer="threads")(
    delayed(accumulator)
    (
        i
    ) for i in tqdm.tqdm(inventory_urls)
)

100%|██████████| 282/282 [00:50<00:00,  5.54it/s]


CPU times: user 23.1 s, sys: 600 ms, total: 23.7 s
Wall time: 54.6 s


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

Here's what an individual staff listing looks like.

In [6]:
staff_inventory[99]

{'name': 'Kate E Allstadt, Ph.D.',
 'profile': 'https://usgs.gov/staff-profiles/kate-e-allstadt',
 'email': 'kallstadt@usgs.gov'}

It doesn't take that long to run this, but I'll stick things in a file for later use anyway.

In [7]:
import pickle

outfile = open("usgs_staff.pkl", "wb")
pickle.dump(staff_inventory, outfile)
outfile.close()