One of the general use cases for our "USGS graph" is pulling together all of the various data and information assets associated with a given topical are of some kind. With the graph, we've essentially stitched together all of the functionally disconnected catalogs and inventories we have built up separately for our people and their skills, publications and datasets and the subject matters they address, and all the other pockets of information we have describing and containing our work. The process of building the graph exercises all of that, evaluating and making decisions about the relationships between things, harmonizing different content models, and putting everything together in a way that it can be queried as a whole.

One specific exercise of this concept is toward a catalog of all USGS climate science capabilities. On the one hand, this can be just about everything we do across earth system science, but we need to narrow in on those assets that are most closely connected to the topic area. This notebook explores the start to that process via a full text query across nodes containing likely content where we can initiate a search. From this, we can pull together a sub-graph based on the connections developed here.

In [1]:
import isaid_helpers
import pandas as pd

Figuring out what to query on is one part of the problem in any approach like this. From exercising things like our metadata keywords, we know that we don't have very strict alignment with controlled vocabularies or reference sources that can be exploited just yet. We've teased some of this out, making decisions about what to include and how to include it in our graph in the area of DefinedSubjectMatter and UndefinedSubjectMatter nodes.

For climate science and this exercise, I've pulled in a glossary of terms from EPA that seem like they are reasonable as a starting point.

In [2]:
epa_climate_change_glossary = pd.read_excel("https://sor.epa.gov/sor_internet/registry/termreg/searchandretrieve/glossariesandkeywordlists/search.do;jsessionid=87TsZz9rOm0k-CQjrtbfA43w1rPWKUuBf9N25Wg0XwzV9z4-yqOh!1399359231?details=&d-1342820-e=11&6578706f7274=1&glossaryName=Glossary+Climate+Change+Terms&includeInReport=")
epa_climate_change_glossary.head()

Unnamed: 0,Term,Definitions,Acronym
0,100-Year Flood Levels,Severe flood levels with a one-in-100 likeliho...,
1,Abrupt Climate Change,"Sudden (on the order of decades), large change...",
2,Adaptation,Adjustment or preparation of natural or human ...,
3,Adaptive Capacity,The ability of a system to adjust to climate c...,
4,Aerosols,Small particles or liquid droplets in the atmo...,


To support this type of vector into the graph, we set up a full text index on names and descriptions across any of the nodes likely to contain reasonably useful content. We can run our search against this by sending all of our terms from the EPA glossary with OR conjunctions.

In [3]:
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    results = session.run("""
    CALL db.index.fulltext.queryNodes("namesAndDescription", "%(search_list)s") YIELD node, score
    RETURN node.name AS name, labels(node)[0] AS node_type, node.url AS url, score
    """ % {
        "search_list": "'"+"'""' OR '".join(list(epa_climate_change_glossary.Term))+"'"
    })
    query_results = results.data()
len(query_results)

27010

We do get a reasonable result set that seems logical for what we know we have in the graph at the moment. This represents about a third of our total nodes to date retrieved simply by looking for one take on key concepts associated with climate change science.

In [4]:
df_query_results = pd.DataFrame(query_results)
df_query_results[['name','node_type']].groupby(['node_type']).agg(['count'])

Unnamed: 0_level_0,name
Unnamed: 0_level_1,count
node_type,Unnamed: 1_level_2
CreativeWork,8498
Dataset,16203
DefinedSubjectMatter,378
Event,23
Expertise,420
JobTitle,53
Journal,95
Organization,309
Person,956
UndefinedSubjectMatter,75


Some of these items are direct assets we would think about for our virtual catalog while others are starting points to network into the catalog via their relationships (e.g., we only got the people directly described in some way with these terms and we'd want to follow the graph to get everyone who has contributed).

In [5]:
df_query_results

Unnamed: 0,name,node_type,url,score
0,greenhouse gases,DefinedSubjectMatter,https://apps.usgs.gov/thesaurus/term-simple.ph...,27.417095
1,Solar radiation for National Hydrography Datas...,Dataset,"['https://doi.org/10.5066/P9OD7FAL', 'doi:10.5...",26.986904
2,Solar radiation for National Hydrography Datas...,Dataset,"['https://doi.org/10.5066/P9OD7FAL', 'doi:10.5...",26.622208
3,The Lethality of Hot Water and Ozone on Aquati...,Dataset,https://doi.org/10.5066/F7WW7GM5,26.277941
4,"The Lethality of Hot Water, Chlorine, and Ozon...",Dataset,https://doi.org/10.5066/F7WW7GM5,24.097460
...,...,...,...,...
27005,Stream habitat characteristics and relative ab...,Dataset,https://doi.org/10.5066/P9D7RV03,0.222652
27006,Three Streamflow Measurements from the Mississ...,Dataset,https://doi.org/10.5066/P9S7DJCX,0.212214
27007,"Debris-flow monitoring data, Chalk Cliffs, Col...",Dataset,https://doi.org/10.5066/P9MUWDFN,0.212214
27008,Geodatabase of the available top and bottom su...,Dataset,http://water.usgs.gov/lookup/getspatial?ds40ms...,0.186049
