Unfortunately, there is also no way of programmatically getting at the information that USGS staff have put together for their profile pages, so I had to come up with a scraper for that information. I put this into a single function that operates against a given staff profile URL and am currently limiting the work to anything that we can't get from other sources. The interesting bits include a set of keywords, drawn from the USGS Thesaurus, that represent self-assertions of expertise. The other main section that an individual can control themselves is a big body of HTML. We might be able to work through this content with further processing, but for now, I shove that into a string and extract all of the links from it in a list. The links often include publications that are not otherwise part of the listing coming from official channels, which can be a very useful addendum to a research record.

As an example of working with this capability, this notebook picks up on some work that Leslie Hsu started for the USGS Community for Data Integration. We focus in on CDI members from USGS with staff profiles, scrape up all of their expertise keywords, and then work through the results. There's more to do here in terms of analysis with the keywords that we'll get into elsewhere.

In [1]:
import pylinkedcmd.pylinkedcmd as l_cmd
from joblib import Parallel, delayed
import tqdm
import pickle
import json
import requests
import os
from sqlite_utils import Database

usgs_web = l_cmd.UsgsWeb()

CDI Members are all part of a group in Confluence used for the CDI wiki. Unfortunately, the API for that is not public so I had to build this in as an authenticated API connection to get the group membership. As far as I know, anyone with a myUSGS account should be able to run this. I then get just CDI members who have usgs.gov email addresses (usernames) and figure out which ones have listings in our profiles data.

In [2]:
from getpass import getpass
from requests.auth import HTTPBasicAuth

def get_group(group_name, source="file", file_name="cdi-users.json", filter_usgs=True):
    if source == "file" and os.path.exists(file_name):
        with open(file_name, "r") as f:
            results=json.load(f)
            f.close()
    else:
        session = requests.Session()
        session.headers.update({'Accept': 'application/json'})
        session.headers.update({'Content-Type': 'application/json'}) 

        digest = HTTPBasicAuth(input("User Name: "), getpass())

        conf_url = f"https://my.usgs.gov/confluence/rest/api/group/{group_name}/member"

        results = session.get(conf_url, auth=digest).json()
        
    user_list = [
        {
            "email": i["username"], 
            "name": i["displayName"]
        } for i in results["results"]
    ]
    
    if filter_usgs:
        user_list = [i for i in user_list if i["email"].split("@")[-1] in ["usgs.gov","contractor.usgs.gov"]]

    return user_list

def get_user_url(user_email):
    r = requests.get(f"https://www.sciencebase.gov/directory/people?format=json&email={user_email}").json()
    if len(r["people"]) != 1:
        return None

    return r["people"][0]["url"]

In [3]:
%%time
cdi_members_usgs = [dict(item, url=get_user_url(item["email"])) for item in get_group("cdi-users")]

CPU times: user 2.5 s, sys: 211 ms, total: 2.71 s
Wall time: 44.3 s


Taking a quick look at the numbers, we have 25 CDI members with USGS user names who do not appear to have staff profile pages.

In [4]:
cdi_members_usgs_staff_profiles = [
    i for i in cdi_members_usgs if i["url"] is not None and i["url"].find("/staff-profiles/") > 0
]

cdi_member_profile_urls = [i["url"] for i in cdi_members_usgs_staff_profiles]

print("Total CDI members from USGS: ", len(cdi_members_usgs))
print("USGS CDI members with URLs: ", len(cdi_members_usgs_staff_profiles))

Total CDI members from USGS:  131
USGS CDI members with URLs:  106


This accumulator function will build a list of dictionaries containing member profiles from our parallel process.

In [5]:
cdi_member_profiles = list()

def accumulator(url):
    cdi_member_profiles.append(usgs_web.scrape_profile(url))

In [6]:
Parallel(n_jobs=10, prefer="threads")(
    delayed(accumulator)
    (
        i
    ) for i in tqdm.tqdm(cdi_member_profile_urls)
)

100%|██████████| 106/106 [01:19<00:00,  1.34it/s]


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [7]:
db = Database("usgs_profiles.db")
db["staff_profiles"].upsert_all(cdi_member_profiles, pk="profile")

<Table staff_profiles (profile, display_name, profile_image_url, organization_name, organization_link, email, orcid, body_content_links, scraped_body_html, expertise)>

We're going to do some more work with expertise terms, so we pull those out here into their own table for further processing. I'm experimenting with harmonizing expertise terms, alignment with vocabulary assets, and extraction of additional terms from referenced sources. 

In [17]:
expertise_terms = list()
for item in [i for i in cdi_member_profiles if "expertise" in i.keys() and len(i["expertise"]) > 0]:
    for term in item["expertise"]:
        for t in term.split(","):
            d_term = {
                "term_source": "USGS Staff Profiles",
                "source_identifier": item["profile"],
                "term": t.strip()
            }
            d_term["identifier"] = ":".join(v for k,v in d_term.items())
            expertise_terms.append(d_term)
db["expertise_terms"].upsert_all(expertise_terms, pk="identifier")

<Table expertise_terms (term_source, source_identifier, term, identifier)>