I decided it would be nice to have the linkage between other things we know about USGS staff and locations for their data and the staff profile URLs. Two possibilities for this are the ScienceBase Directory where we have a URL property that has not ever been populated with anything dynamically and the Wikidata records for USGS people that I've been exploring for the purpose of making these kinds of linkages. The advantage to ScienceBase is that it's our own local resource under USGS control. Disadvantages include it being difficult to work with in terms of read/write stability and it's custom, proprietary data model. Wikidata makes up for those disadvantages but is outside our direct control and does not yet include every person where we have a profile page. I may end up going both routes, but I'm starting with ScienceBase as a reasonable first step.

Because of the issue of some staff seeming to have multiple profile pages and an overall assumption that they should really just have one, I first need to select the best profile page to work with. This notebook handles that process and further exercises pylinkedcmd methods. After looking at some of the "duplicate" profile pages, it appears that users or the system have created multiple profiles with slight variations on name, and one of these is usually not populated. It may not be completely accurate, but a reasonable assumption at this point is to pick the page with the most content. Both are "valid" as they are in the staff listing and both are accessible online to the public, so we can legitimately select either one as a starting point for later edit by the staff member themselves or some other maintenance process.

This notebook pulls out the staff profile inventory records previously scraped, splits off those that have multiple profile URLs for one email address, scrapes each URL and gets the length of content as a variable, selects the profile with the most content, and then writes out a new set of staff profile information to use.

In [1]:
import requests
import pickle
import pylinkedcmd
from joblib import Parallel, delayed
import tqdm
import validators

In [2]:
sb = pylinkedcmd.pylinkedcmd.Sciencebase()
usgs_web = pylinkedcmd.pylinkedcmd.UsgsWeb()

In [3]:
infile = open("usgs_staff.pkl", "rb")
usgs_staff = pickle.load(infile)
infile.close()

In [4]:
usgs_staff_profiles = [
    i for i in usgs_staff 
    if len(i["profile"]) == 1
    and i["profile"][0] is not None
    and i["email"][0] is not None
]

usgs_staff_profiles_multi = [
    i for i in usgs_staff 
    if len(i["profile"]) > 1
]

In [5]:
multi_profiles = list()
for i in usgs_staff_profiles_multi:
    for p in [u for u in i["profile"] if u is not None]:
        multi_profiles.append((p, i["identifier"]))

In [6]:
profile_lengths = list()

def accumulator(url_id):
    profile_lengths.append({
        "identifier": url_id[1],
        "profile_url": url_id[0],
        "profile_length": len(str(usgs_web.scrape_profile(url_id[0])))
    })

In [7]:
Parallel(n_jobs=20, prefer="threads")(
    delayed(accumulator)
    (
        i
    ) for i in tqdm.tqdm(multi_profiles)
)

100%|██████████| 107/107 [00:09<00:00, 11.32it/s]


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [8]:
selected_profiles = list()
for item in usgs_staff_profiles_multi:
    max_profile_length = max([x["profile_length"] for x in profile_lengths if x["identifier"] == item["identifier"]])
    item["profile"] = [
        next(
            (i["profile_url"] for i in profile_lengths if i["profile_length"] == max_profile_length), 
            None
        )
    ]
    selected_profiles.append(item)

usgs_staff_profiles.extend(selected_profiles)

In [9]:
print(next((e for e in [validators.email(i["email"][0]) for i in usgs_staff_profiles] if False), None))
print(next((u for u in [validators.url(i["profile"][0]) for i in usgs_staff_profiles] if False), None))

None
None


In [10]:
outfile = open("usgs_staff_profiles.pkl", "wb")
pickle.dump(usgs_staff_profiles, outfile)
outfile.close()