In the GeoKB, we often need to build connections to USGS people from projects, publications, data assets, and other entities organized into the graph. Sticking with our approach of only dealing with public information sources, we source people records from the USGS Staff Profile listing along with clues about additional personnel that we pick up from the publication record (Pubs Warehouse, USGS Science Data Catalog, etc.). Not all of these sources agree with one another as they are all sourced from different internal data systems. However, these represent all of the people who have an essentially public record about themselved by virtue of either creating and maintaining a public profile page or publishing creative works of one kind or another under their name and identity.

Working with the USGS web site is still kind of a pain in that we need to scrape HTML. This notebook builds on previous work under the iSAID project where we first worked through the scraping logic. Having the knowledgebase to work against presents a little bit of a different dynamic in how to operate this system. We treat entities that get created in the GeoKB as a starting point from which we can regularly go after additional information for claims. One source of instantiating person entities is the inventory of staff profiles that we have to run through and scrape via a paginated process. We also pick up clues about people that didn't come from that "master" source via Pubs Warehouse, Science Data Catalog, Model Catalog, and other sources of interest where we have authors/contributors listed. In GeoKB processing, we attempt to not introduce duplicate entities anywhere, so we try to build enough information about people with reasonably unique identifiers that we can use.

In this exercise, I took a different tact on work I'd done previously in processing and caching the varous web scraping routines to less than optimally unstructured information from the web into structured data we can work against. Since we need to scrape different kinds of pages over time into different resulting data structures, including from changes in design that occur over time, I opted to start building out a running "log" of sorts. I continued to use the simplicity of a PostgreSQL/RDS table, because it is one of the simplest data store connections to deal with on CHS and it is plenty performant when operating from CHS Pangeo. (This notebook is set up to operate on a Python kernel running from the Pangeo environment.) The new "isaid_web_scrape_cache" table contains the URL scraped, a timestamp, the HTTP status code received, and a scrape structure stored as a JSON string.

I tweaked the schema I had been using for personnel profiles a little bit here in the new function contained in this notebook. I've been working on some new methods for summarizing and extracting linkable entities from "lumps of text" (documents) using LLMs. From the profile pages, this mostly consists of what I put in here as "body_html," essentially the full HTML content that some personnel have built out for their profiles. Along with the expertise terms and the new structures for education and professional experience, this is the unique content that is really only available from the Staff Profile pages themselves. The dynamic listings of publications, data, etc. included on the pages can all be built out from original catalog sources more efficiently or from other web scraping of "science activities."

In [1]:
import os
import pandas as pd
from wbmaker import WikibaseConnection
import dateutil.parser

In [2]:
geokb = WikibaseConnection("GEOKB_CLOUD")

The initial starting point here is to pull all URLs associated with people that point to the "staff-profiles" path on the USGS web. These are the URLs we will be able to scrape with the process in the staff_profile_scrape() function. The following SPARQL query gets all "instance of" person who also have a reference URL, filtering to those URLs with "staff-profiles" in the path.

In [None]:
wd_persons_w_profile_url = "https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fperson%20%3FpersonLabel%20%3Furl%0AWHERE%20%7B%0A%20%20%3Fperson%20wdt%3AP1%20wd%3AQ3%20.%0A%20%20%3Fperson%20wdt%3AP31%20%3Furl%20.%0A%20%20FILTER(contains(str(%3Furl)%2C'staff-profiles'))%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D"

df_wd_people = geokb.url_sparql_query(sparql_url=wd_persons_w_profile_url, output_format="dataframe")
df_wd_people["qid"] = df_wd_people["person"].apply(lambda x: x.split("/")[-1])

This newish function for scraping USGS Staff Profile pages is not all that dissimilar to previous methods. The main thing I'm doing here is trying to soak up as much potentially useful content from the pages as possible without over-processing. I can then pull it out of the cache as an entire JSON document or query inside the JSON using postgres casting methods. Storing these documents in more of a running log form with a timestamp should serve our purposes well for anything we want to do with this information in future.

From the GeoKB perspective, all I'm currently working up from this scrape is the ORCID identifier (not available publicly elsewhere). That gives us partial connection on authors/editors/contributors to the Pubs Warehouse. I'll revisit the long slate of "expertise" terms along with processing of text elements. Expertise terms are treated solely as tags from the USGS web perspective and do not link to any source of definition or semantics, so they will need some thought before introducing them into the knowledge graph. One approach would be to do what I did previously in iSAID, matching some terms to the USGS Thesaurus with a big assumption that those definitions would be agreed to by the person entering the terms. Another approach would be to initially dump everything into some kind of high-level classification as instances of a concept, link to them so they are in the graph, and then give them further definition and classification down the road somewhere. This is an overall philosophical approach we'll have to examine in many other cases as well.

In [None]:
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import unicodedata
import re
import json
from joblib import Parallel, delayed
from tqdm import tqdm

def staff_profile_scrape(profile_url):
    profile = {
        "url": profile_url,
        "timestamp": datetime.now().isoformat(),
        "name": None,
        "name_qualifier": None,
        "title": None,
        "organization_name": None,
        "organization_link": None,
        "email": None,
        "orcid": None,
        "intro_statements": None,
        "expertise_terms": None,
        "professional_experience": None,
        "education": None,
        "body_html": None
    }

    r = requests.get(profile_url)
    profile["status_code"] = r.status_code

    if r.status_code == 200:
        soup = BeautifulSoup(r.content, 'html.parser')

        first_h1 = soup.find('h1')
        if first_h1:
            name_text = unicodedata.normalize("NFKD", first_h1.text.strip())
            if name_text.endswith(")"):
                name_qual = re.search(r'\((.*?)\)', name_text)
                if name_qual:
                    profile["name_qualifier"] = name_qual.group(1)
                    profile["name"] = name_text.split("(")[0].strip()
                else:
                    profile["name"] = name_text
            else:
                profile["name"] = name_text

        org_div = soup.find('div', {'class': 'field-org-primary'})
        if org_div:
            microsite_div = org_div.find('div', {'class': 'field-microsite'})
            if microsite_div:
                org_link = microsite_div.find('a')
                if org_link:
                    profile["organization_name"] = org_link.text.strip()
                    profile["organization_link"] = f"https://www.usgs.gov{org_link['href']}"

            title_div = org_div.find('div', {'class': 'field-title'})
            if title_div:
                profile["title"] = unicodedata.normalize("NFKD", title_div.text.strip())

        email_div = soup.find('div', {'class': 'field-email'})
        if email_div:
            profile["email"] = email_div.text.strip()

        orcid_div = soup.find('div', {'class': 'field--name--field-staff-orcid'})
        if orcid_div:
            profile["orcid"] = orcid_div.text.strip()

        intro_div = soup.find('div', {'class': 'field-intro'})
        if intro_div:
            profile["intro_statements"] = [unicodedata.normalize("NFKD", i.text.strip()) for i in intro_div.find_all('p')]

        expertise_divs = soup.find_all('div', {'class': 'field-staff-expertise'})
        if expertise_divs:
            profile["expertise_terms"] = [unicodedata.normalize("NFKD", i.text.strip()) for i in expertise_divs]

        professional_experience_items = soup.find_all('li', {'class': 'field-professional-experience'})
        if professional_experience_items:
            profile["professional_experience"] = [unicodedata.normalize("NFKD", i.text.strip()) for i in professional_experience_items]

        education_items = soup.find_all('li', {'class': 'field-education'})
        if education_items:
            profile["education"] = [unicodedata.normalize("NFKD", i.text.strip()) for i in education_items]

        body_div = soup.find('div', {'class': 'body'})
        if body_div:
            profile["body_html"] = body_div.prettify()

    return {
                "url": profile["url"],
                "timestamp": profile["timestamp"],
                "status_code": profile["status_code"],
                "scrape": json.dumps(profile)
            }


We don't need to run these through all that often. My initial cut here was to refresh my index of scraped profiles for every person entity I've confirmed and created in the GeoKB instance. I've also gone back through to rework the last available cache I had for people who are no longer showing up with a profile page (there is inconsistency in how these cases are handled) to put those into the same structure for historical record. I'll work up a concept of operations for this in future that uses some kind of refresh cycle.

In [None]:
scraped_profiles = Parallel(n_jobs=4)(delayed(staff_profile_scrape)(i) for i in tqdm(list(df_wd_people["url"].unique())))

This is the one component that requires code to be run on an internal system at the moment. The RDS instance I'm using here is on an account that is not yet set up for any kind of access from outside the USGS network/CHS environment.

In [4]:
isaid = geokb.pg_cnxn(
    db="isaid",
    db_user=os.environ["rds_username"],
    db_pass=os.environ["rds_password"],
    db_host=os.environ["rds_host"],
    db_port=os.environ["rds_port"]
)

pd.DataFrame(scraped_profiles).to_sql(
    name="isaid_web_scrape_cache",
    con=isaid,
    index=False,
    if_exists="replace"
)

526