This notebook works through the process of preparing records from the USGS Publications Warehouse for inclusion in our graph of everything in USGS that we can assemble to build logical catalogs of resources for various purposes. The Pubs Warehouse provides a number of unique, original records that we can't get elsewhere along with additional properties that add value to existing records. In order to work effectively with the entire Pubs Warehouse recordset, we run a regular caching process that houses original source content from a REST API. This provides us with an ability to run aggregations and other queries that help in understanding and working with the catalog metadata.

There are over 160K records in the Pubs Warehouse. They are all interesting and useful in differeing circumstances, but we don't necessarily need or want to pull every record into any given graph or index. More records could mean more noise in our queries and analyses depending on what we're trying to accomplish. In the current case, I'm really most interested in developing further intelligence on work that is ongoing now or works that have been authored/edited by staff who are either current or recent. In the following processing steps, I narrow down to pub records for staff who are already in our graph, meaning that they were sourced from the master directory source we got from the ScienceBase Directory.

In [1]:
import isaid_helpers
import pandas as pd
import os
import pickle
import datetime
import click
from copy import copy
from pylinkedcmd import utilities
import validators
import re
import collections
import numpy as np
from nested_lookup import nested_lookup

The Pubs Warehouse cache is pretty large if we pull all records. We need to ultimately come up with more ways to take advantage of the cache and run our queries on the server where it lives instead of doing what I'm doing here. In the meantime, the following codeblock can be used to grab the entire cache or load a local stash file into memory.

In [2]:
%%time
if click.confirm('Are you sure you want to run the process to get all Pubs Warehouse data from the cache?', default=False):
    pw_cache = isaid_helpers.cache_chs_cache("pw")
    pickle.dump(pw_records, open(isaid_helpers.f_process_pw, "wb"))
else:
    pw_cache = pickle.load(open(isaid_helpers.f_process_pw, "rb"))
    print("pw_cache loaded to memory from cache file")

Are you sure you want to run the process to get all Pubs Warehouse data from the cache? [y/N]: 
pw_cache loaded to memory from cache file
CPU times: user 4.09 s, sys: 666 ms, total: 4.75 s
Wall time: 8.61 s


The following function strips down a Pubs Warehouse record into its essential elements, renaming a few properties into the common names we are using in our graph and setting up a couple of more unique properties.

In [3]:
def tabularize_pw_record(pw_record):
    pub = {
        "name": pw_record["title"],
        "source": "USGS Publications Warehouse",
        "year_published": pw_record["publicationYear"],
        "id_pw": pw_record["id"]
    }
    if "docAbstract" in pw_record and pw_record["docAbstract"] is not None and len(pw_record["docAbstract"]) > 0:
        pub["description"] = isaid_helpers.strip_tags(pw_record["docAbstract"])
        
    if "ipdsId" in pw_record and pw_record["ipdsId"] is not None and len(pw_record["ipdsId"]) > 0:
        pub["id_ipds"] = pw_record["ipdsId"]
    
    if "doi" in pw_record and pw_record["doi"] is not None and len(pw_record["doi"]) > 0:
        pub["doi"] = pw_record["doi"].strip()
        
    useful_link = None
    if "links" in pw_record and isinstance(pw_record["links"], list):
        useful_link = next((l["url"] for l in pw_record["links"] if "type" in l and l["type"]["text"] == "Index Page"), None)
        
    if useful_link is None and "doi" in pub:
        useful_link = f"https://doi.org/{pub['doi']}"
        
    if useful_link is None:
        useful_link = f"https://pubs.er.usgs.gov/publication/{pw_record['indexId']}"
        
    pub["url"] = useful_link

    if "contributors" in pw_record:
        if "authors" in pw_record["contributors"]:
            pub["author_emails"] = ",".join([i for i in nested_lookup("email", pw_record["contributors"]["authors"]) if validators.email(i)])
            pub["author_orcids"] = ",".join([i.split("/")[-1] for i in nested_lookup("orcid", pw_record["contributors"]["authors"]) if validators.url(i)])
    
        if "editors" in pw_record["contributors"]:
            pub["editor_emails"] = ",".join([i for i in nested_lookup("email", pw_record["contributors"]["editors"]) if validators.email(i)])
            pub["editor_orcids"] = ",".join([i.split("/")[-1] for i in nested_lookup("orcid", pw_record["contributors"]["editors"]) if validators.url(i)])

    return pub

We want to only pull in publications at this point that can be linked to people in our graph that we've brought in from a master source. In a given workflow, it would be a good idea to update this master information in the graph before querying it for people. Here, we get all person records with email or orcid identifiers.

In [4]:
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    MATCH (p:Person)
    WHERE NOT p.email IS NULL
    OR NOT p.orcid IS NULL
    RETURN p.email, p.orcid, p.name
    """)
    linkable_persons = results.data()

emails_in_graph = [i["p.email"] for i in linkable_persons if i["p.email"] is not None]
orcids_in_graph = [i["p.orcid"] for i in linkable_persons if i["p.orcid"] is not None]

print(emails_in_graph[:5])
print(orcids_in_graph[:5])

['nreinke@contractor.usgs.gov', 'krschulz@usgs.gov', 'alchildress@contractor.usgs.gov', 'mkkelley@usgs.gov', 'astanley@usgs.gov']
['0000-0002-5275-3077', '0000-0003-2409-5211', '0000-0003-1155-2815', '0000-0003-4147-7254', '0000-0001-5756-0373']


Now that we have the emails and orcids in our graph and our full Pubs Warehouse cache in memory, we can figure out which pubs we should pull into a subset for sending to our graph. We loop through our cached pub records, check for emails and orcids in the record using a elegant little package (nested_lookup) and process anything where at least one of those identifiers is in our graph. The tabularize_pw_record function simplifies and flattens a basic record for the pub. We go ahead and dump this table to a CSV file for processing into our graph.

In [5]:
%%time
graphable_pw = list()

for pub in pw_cache:
    emails_in_pub = [i for i in nested_lookup("email", pub) if validators.email(i)]
    orcids_in_pub = [i.split("/")[-1] for i in nested_lookup("orcid", pub) if validators.url(i)]
    if next((e for e in emails_in_pub if e in emails_in_graph), None) is not None or next((o for o in orcids_in_pub if o in orcids_in_graph), None) is not None:
        graphable_pw.append(tabularize_pw_record(pub))

pd.DataFrame(graphable_pw).to_csv(isaid_helpers.f_graphable_pw, index=False)
print(len(graphable_pw))

24304
CPU times: user 14.7 s, sys: 159 ms, total: 14.8 s
Wall time: 15.4 s
