The USGS Publications Warehouse provides a metadata source for all USGS Series Reports and journal articles authored and edited by USGS staff. It provides an additional source of information for CreativeWork items already in our graph via DOIs we've encountered elsewhere and pulled into that cache. For existing matches, we will be able to potentially pick up abstracts and email identifiers for authors/editors (unique to this internal catalog source). We will also pick up an important internal identifier ("IPDS number") that has to do with the review and publishing workflow and is referenced in project records that we are pulling in from another internal USGS source. This will help us in linking projects to publication products.

For the purpose of our USGS graph, I made some decisions in the Pubs Warehouse data building code to filter the Pubs Warehouse records to only those that we can link to authors or editors via email or orcid identifiers to people we already have in our graph via the master source in the ScienceBase Directory. This limits the set of publications to 20K or so, but gives us a reasonable set of information to work against for the types of queries we are exploring in examining current capacity for various kinds of research.

Still to do: The Cost Center information in Pubs Warehouse records deserves further investigation, though we may get to the relationship between USGS organizational units and publications via another route as we start to bring in project information. We also get an indirect relationship from orgs to pubs via author affiliations, but that is transient and time steps may be off in the data.

In [1]:
import isaid_helpers
import pandas as pd

In [3]:
pd.read_csv(isaid_helpers.f_graphable_pw).head()

Unnamed: 0,name,source,year_published,id_pw,description,id_ipds,doi,url,author_emails,author_orcids,editor_emails,editor_orcids
0,Comment on 'Kidron (2018): Biocrust research: ...,USGS Publications Warehouse,2020,70211636,Kidron (2018) uses a straw man argument in an ...,IP-118462,10.1002/eco.2215,https://doi.org/10.1002/eco.2215,jayne_belnap@usgs.gov,"0000-0002-1018-2376,0000-0002-5934-3214,0000-0...",,
1,Economic valuation of health benefits from usi...,USGS Publications Warehouse,2020,70209418,Background: Radon exposure is the second leadi...,IP-110968,10.1186/s12940-020-00589-8,https://doi.org/10.1186/s12940-020-00589-8,"cshapiro@usgs.gov,epindilli@usgs.gov","0000-0003-3579-8377,0000-0002-1598-6808,0000-0...",,
2,Mineral commodity summaries 2017,USGS Publications Warehouse,2017,70180197,This report is the earliest Government publica...,,10.3133/70180197,https://minerals.usgs.gov/minerals/pubs/mcs/,jober@usgs.gov,0000-0003-1608-5611,,
3,Using structure from motion photogrammetry to ...,USGS Publications Warehouse,2016,70179121,"Structure from Motion (SfM), a photogrammetric...",IP-079064,,https://pubs.er.usgs.gov/publication/70179121,"epeitzsch@usgs.gov,dan_fagre@usgs.gov","0000-0001-7624-0455,0000-0001-8552-9461",,
4,Effects of climate change on tidal marshes alo...,USGS Publications Warehouse,2016,70175154,Public SummaryThe coastal region of California...,IP-075871,10.3133/ofr20161125,https://doi.org/10.3133/ofr20161125,"kthorne@usgs.gov,kbuffington@usgs.gov,glenn_gu...","0000-0002-1381-0657,0000-0001-9741-1241,0000-0...",,


In [5]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row WHERE NOT row.doi IS NULL
            MERGE (w:CreativeWork {doi: row.doi})
            ON MATCH
                SET w.description = row.description,
                w.id_pw = row.id_pw,
                w.id_ipds = row.id_ipds
            ON CREATE
                SET w.name = row.name,
                w.description = row.description,
                w.year_published = row.year_published,
                w.url = row.url,
                w.id_pw = row.id_pw,
                w.id_ipds = row.id_ipds

            WITH w, row
                WHERE NOT row.author_emails IS NULL
                    UNWIND split(row.author_emails, ',') AS author_email
                    MATCH (p:Person {email: author_email})
                    MERGE (p)-[a:AUTHOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.author_orcids IS NULL
                    UNWIND split(row.author_orcids, ',') AS author_orcid
                    MATCH (p:Person {orcid: author_orcid})
                    MERGE (p)-[a:AUTHOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.editor_emails IS NULL
                    UNWIND split(row.editor_emails, ',') AS editor_email
                    MATCH (p:Person {email: editor_email})
                    MERGE (p)-[a:EDITOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.editor_orcids IS NULL
                    UNWIND split(row.editor_orcids, ',') AS editor_orcid
                    MATCH (p:Person {orcid: editor_orcid})
                    MERGE (p)-[a:EDITOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_pw
    })

CPU times: user 7.7 ms, sys: 5.15 ms, total: 12.8 ms
Wall time: 6min 24s


In [6]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row WHERE row.doi IS NULL
            MERGE (w:CreativeWork {name: row.name})
            ON MATCH
                SET w.description = row.description,
                w.id_pw = row.id_pw,
                w.id_ipds = row.id_ipds
            ON CREATE
                SET w.name = row.name,
                w.description = row.description,
                w.year_published = row.year_published,
                w.url = row.url,
                w.id_pw = row.id_pw,
                w.id_ipds = row.id_ipds

            WITH w, row
                WHERE NOT row.author_emails IS NULL
                    UNWIND split(row.author_emails, ',') AS author_email
                    MATCH (p:Person {email: author_email})
                    MERGE (p)-[a:AUTHOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.author_orcids IS NULL
                    UNWIND split(row.author_orcids, ',') AS author_orcid
                    MATCH (p:Person {orcid: author_orcid})
                    MERGE (p)-[a:AUTHOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.editor_emails IS NULL
                    UNWIND split(row.editor_emails, ',') AS editor_email
                    MATCH (p:Person {email: editor_email})
                    MERGE (p)-[a:EDITOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url

            WITH w, row
                WHERE NOT row.editor_orcids IS NULL
                    UNWIND split(row.editor_orcids, ',') AS editor_orcid
                    MATCH (p:Person {orcid: editor_orcid})
                    MERGE (p)-[a:EDITOR_OF]->(w)
                        SET a.date_qualifier = row.year_published,
                        a.reference = row.url
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_pw
    })

CPU times: user 5.45 ms, sys: 3.17 ms, total: 8.62 ms
Wall time: 4min 22s
