One of the general use cases for our "USGS graph" is pulling together all of the various data and information assets associated with a given topical are of some kind. With the graph, we've essentially stitched together all of the functionally disconnected catalogs and inventories we have built up separately for our people and their skills, publications and datasets and the subject matters they address, and all the other pockets of information we have describing and containing our work. The process of building the graph exercises all of that, evaluating and making decisions about the relationships between things, harmonizing different content models, and putting everything together in a way that it can be queried as a whole.

One specific exercise of this concept is toward a catalog of all USGS climate science capabilities. On the one hand, this can be just about everything we do across earth system science, but we need to narrow in on those assets that are most closely connected to the topic area. This notebook explores the start to that process via a full text query across nodes containing likely content where we can initiate a search. From this, we can pull together a sub-graph based on the connections developed here.

In [1]:
import isaid_helpers
import pandas as pd

Figuring out what to query on is one part of the problem in any approach like this. From exercising things like our metadata keywords, we know that we don't have very strict alignment with controlled vocabularies or reference sources that can be exploited just yet. We've teased some of this out, making decisions about what to include and how to include it in our graph in the area of DefinedSubjectMatter and UndefinedSubjectMatter nodes.

For climate science and this exercise, I've pulled in a glossary of terms from EPA that seem like they are reasonable as a starting point.

In [2]:
epa_climate_change_glossary = pd.read_excel("https://sor.epa.gov/sor_internet/registry/termreg/searchandretrieve/glossariesandkeywordlists/search.do;jsessionid=87TsZz9rOm0k-CQjrtbfA43w1rPWKUuBf9N25Wg0XwzV9z4-yqOh!1399359231?details=&d-1342820-e=11&6578706f7274=1&glossaryName=Glossary+Climate+Change+Terms&includeInReport=")
epa_climate_change_glossary.head()

Unnamed: 0,Term,Definitions,Acronym
0,100-Year Flood Levels,Severe flood levels with a one-in-100 likeliho...,
1,Abrupt Climate Change,"Sudden (on the order of decades), large change...",
2,Adaptation,Adjustment or preparation of natural or human ...,
3,Adaptive Capacity,The ability of a system to adjust to climate c...,
4,Aerosols,Small particles or liquid droplets in the atmo...,


To support this type of vector into the graph, we set up a full text index on names and descriptions across any of the nodes likely to contain reasonably useful content. We can run our search against this by sending all of our terms from the EPA glossary with OR conjunctions.

In [9]:
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    CALL db.index.fulltext.queryNodes("allTexts", "%(search_list)s") YIELD node, score
    RETURN node.name AS name, labels(node)[0] AS item_type, node.url AS url, node.email AS email, node.orcid as orcid, node.doi AS doi, node.id_basis_project AS basis_project_number, node.source as source, score
    """ % {
        "search_list": "'"+"'""' OR '".join(list(epa_climate_change_glossary.Term))+"'"
    })
    query_results = results.data()
len(query_results)

57366

We do get a reasonable result set that seems logical for what we know we have in the graph at the moment. This represents about a third of our total nodes to date retrieved simply by looking for one take on key concepts associated with climate change science.

In [10]:
df_query_results = pd.DataFrame(query_results)
display(df_query_results[['name','item_type']].groupby(['item_type']).agg(['count']))
display(df_query_results[['name','source']].groupby(['source']).agg(['count']))

Unnamed: 0_level_0,name
Unnamed: 0_level_1,count
item_type,Unnamed: 1_level_2
CreativeWork,25317
Dataset,16395
DefinedSubjectMatter,393
Event,25
Expertise,420
JobTitle,53
Organization,310
Person,956
Project,13328
UndefinedSubjectMatter,169


Unnamed: 0_level_0,name
Unnamed: 0_level_1,count
source,Unnamed: 1_level_2
BASIS+ Project via SIPP Services,2077
BASIS+ Subtask via SIPP Services,2067
BASIS+ Task via SIPP Services,9235
DOI Record,155
ORCID,1479
SIPP Centers,86
USGS Profile Pages,543
USGS Publications Warehouse,18184
https://www.usgs.gov/science/mission-areas,2


Some of these items are direct assets we would think about for our virtual catalog while others are starting points to network into the catalog via their relationships (e.g., we only got the people directly described in some way with these terms and we'd want to follow the graph to get everyone who has contributed).

In [11]:
df_query_results

Unnamed: 0,name,item_type,url,email,orcid,doi,basis_project_number,source,score
0,Ozone effects on plant physiology,Project,,,,,ZD00BNI,BASIS+ Task via SIPP Services,45.374844
1,Abrupt Climate Change in the Eastern US,Project,,,,,LP00BMY,BASIS+ Task via SIPP Services,35.243271
2,Preparing for climate change: The potential co...,CreativeWork,http://www.climateimpacts.org/us-climate-asses...,,,,,USGS Publications Warehouse,35.020210
3,Calculation of solar radiation in mountainous ...,CreativeWork,https://doi.org/10.1016/0168-1923(87)90061-X,,,10.1016/0168-1923(87)90061-X,,USGS Publications Warehouse,33.812096
4,"NRPP: Exposure, Sensitivity, and Adaptive Capa...",Project,,,,,SB00C2C,BASIS+ Task via SIPP Services,33.351692
...,...,...,...,...,...,...,...,...,...
57361,Geology and ore deposits of the Whitepine area...,CreativeWork,https://doi.org/10.3133/ofr56103,,,10.3133/ofr56103,,USGS Publications Warehouse,0.166519
57362,Detrital zircon geochronology of some neoprote...,CreativeWork,https://doi.org/10.1130/2007.2431(07),,,10.1130/2007.2431(07),,USGS Publications Warehouse,0.155399
57363,NWHC-RT PCR for detecting G. destructants,Project,,,,,NC009U9,BASIS+ Task via SIPP Services,0.150113
57364,Global Iron,Project,,,,,ZS00DVM,BASIS+ Task via SIPP Services,0.128098


In [12]:
df_query_results.to_csv("data/usgs_climate_change_science_catalog.csv", index=False)