# Datasets

In this stepwise look at a graphing process, we've already encountered some datasets but classed them initially as the more generalized CreativeWork nodes because we hadn't yet followed up to determine more details. In this next codeblock, we pull in the entire USGS Science Data Catalog: the closer to comprehensive open data inventory for the USGS. This is the most massive process we've worked through yet in working to comprehensively graph as much of the USGS as we can reasonably connect to. This is due here to the massive collection of unique keywords represented in this subset of fairly well documented datasets. The SDC indexing process (from source metadata documents into an Elasticsearch index) has done a reasonable job of splitting out place names, defined/referenced terms from a number of thesaurus sources, and undefined terms. The place names and referenced terms will all need further validation to validate that they actually exist in some viable source, but they provide a good start at some probably better collections.

The API source for datasets provides a graph-optimized derivative of the SDC index. It makes a few choices about the concepts needing to be graphed into entities and relationships and lays out arrays of nodes that are identified with a resolvable identifier that our system already knows about vs. those with no resolvable identifier. For the moment, we are only going to process those contacts with resolvable identifiers. For person contacts, we are also only going to build relationships if the entity already exists in our graph from some other source as we don't have enough information from the SDC to determine if we really want those entities in our graph that we can't resolve from one of the routes we are already processing. We're ignoring the unidentified authors at this point as those are going to need some work to disambiguate and resolve to something before we're willing to add them into our graph.

In [1]:
import isaid_helpers
import pandas as pd

In [2]:
pd.read_csv(isaid_helpers.f_graphable_sdc).head()

Unnamed: 0,sdc_internal_id,name,description,url,last_updated
0,USGS:58e3b415e4b09da67997ee01,Wells and water-level altitude in the alluvium...,Points representing locations of wells in whic...,https://doi.org/10.5066/F71G0JF6,20200814
1,USGS:59a03b8ae4b038630d0303d7,Tuckahoe Creek stream flow quantity and qualit...,This posting contains the stream flow and load...,https://doi.org/10.5066/F7DB80RP,20200831
2,USGS:5a42aff6e4b0d05ee8bbf5f7,Low Intensity Land-use Overlap Colorado Plateau,This dataset represents the spatial overlap in...,https://doi.org/10.5066/F72J6B1M,20200827
3,USGS:5b17fc85e4b092d96521969b,"Estimated daily loads of nutrients, sediment, ...",As part of the Great Lakes Restoration Initiat...,https://doi.org/10.5066/P9UBQFPJ,20200827
4,USGS:5b5f51c2e4b006a11f66e9b9,Coastwide Reference Monitoring System (CRMS) 2...,Wetland restoration efforts conducted by the C...,https://doi.org/10.5066/P90RE64M,20200830


In [3]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MERGE (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            ON CREATE
                SET d.name = row.name,
                d.description = row.description,
                d.url = row.url,
                d.last_updated = row.last_updated
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc
    })

CPU times: user 6.82 ms, sys: 5.03 ms, total: 11.9 ms
Wall time: 5min 18s


In [4]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_usgs_thesaurus).head()

Unnamed: 0,entity_type,declared_term_source,rel_type,term,sdc_internal_id,reference,date_qualifier
0,DefinedSubjectMatter,USGS Thesaurus,ADDRESSES_SUBJECT,Groundwater Level,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814
1,DefinedSubjectMatter,USGS Thesaurus,ADDRESSES_SUBJECT,Hydrogeology,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814
2,DefinedSubjectMatter,USGS Thesaurus,ADDRESSES_SUBJECT,groundwater,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814
3,DefinedSubjectMatter,USGS Thesaurus,ADDRESSES_SUBJECT,water quality,USGS:59a03b8ae4b038630d0303d7,https://data.usgs.gov/datacatalog/data/USGS:59...,20200831
4,DefinedSubjectMatter,USGS Thesaurus,ADDRESSES_SUBJECT,agriculture,USGS:59a03b8ae4b038630d0303d7,https://data.usgs.gov/datacatalog/data/USGS:59...,20200831


In [5]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            WITH d, row
                MATCH (t:DefinedSubjectMatter {name: row.term})
                MERGE (d)-[dt:ADDRESSES_SUBJECT]-(t)
                    SET dt.reference = row.reference,
                    dt.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_usgs_thesaurus
    })

CPU times: user 13.5 ms, sys: 9.36 ms, total: 22.9 ms
Wall time: 12min 7s


In [2]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_md).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814,POINT_OF_CONTACT,Person,mholmber@usgs.gov,
1,USGS:59a03b8ae4b038630d0303d7,https://data.usgs.gov/datacatalog/data/USGS:59...,20200831,POINT_OF_CONTACT,Person,whively@usgs.gov,
2,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,POINT_OF_CONTACT,Person,jbradford@usgs.gov,
3,USGS:5b17fc85e4b092d96521969b,https://data.usgs.gov/datacatalog/data/USGS:5b...,20200827,POINT_OF_CONTACT,Person,tdstunte@usgs.gov,
4,USGS:5b5f51c2e4b006a11f66e9b9,https://data.usgs.gov/datacatalog/data/USGS:5b...,20200830,POINT_OF_CONTACT,Person,couvillionb@usgs.gov,


In [3]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {email: row.email})
                MERGE (p)-[mc:METADATA_CONTACT]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_md
    })

CPU times: user 5.82 ms, sys: 3.37 ms, total: 9.19 ms
Wall time: 4min 31s


In [4]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_poc).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814,METADATA_CONTACT,Person,mholmber@usgs.gov,
1,USGS:59a03b8ae4b038630d0303d7,https://data.usgs.gov/datacatalog/data/USGS:59...,20200831,METADATA_CONTACT,Person,whively@usgs.gov,
2,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,METADATA_CONTACT,Person,jbradford@usgs.gov,
3,USGS:5b17fc85e4b092d96521969b,https://data.usgs.gov/datacatalog/data/USGS:5b...,20200827,METADATA_CONTACT,Person,djsulliv@usgs.gov,
4,USGS:5b5f51c2e4b006a11f66e9b9,https://data.usgs.gov/datacatalog/data/USGS:5b...,20200830,METADATA_CONTACT,Person,couvillionb@usgs.gov,


In [5]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {email: row.email})
                MERGE (p)-[mc:POINT_OF_CONTACT]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_poc
    })

CPU times: user 5.02 ms, sys: 4.03 ms, total: 9.05 ms
Wall time: 4min 13s


In [6]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_author).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:58e3b415e4b09da67997ee01,https://data.usgs.gov/datacatalog/data/USGS:58...,20200814,AUTHOR_OF,Person,,0000-0002-1316-0412
1,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,AUTHOR_OF,Person,,0000-0003-2353-8500
2,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,AUTHOR_OF,Person,,0000-0001-6707-4803
3,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,AUTHOR_OF,Person,,0000-0001-9257-6303
4,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827,AUTHOR_OF,Person,,0000-0002-9643-2785


In [7]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {orcid: row.orcid})
                MERGE (p)-[mc:AUTHOR_OF]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_author
    })

CPU times: user 11.7 ms, sys: 6.25 ms, total: 18 ms
Wall time: 11min 18s


In [9]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_places).head()

Unnamed: 0,entity_type,declared_term_source,rel_type,term,sdc_internal_id,reference,date_qualifier
0,Location,,ADDRESSES_PLACE,Colorado headwaters-Plateau,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827.0
1,Location,,ADDRESSES_PLACE,Colorado,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827.0
2,Location,,ADDRESSES_PLACE,New Mexico,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827.0
3,Location,,ADDRESSES_PLACE,Arizona,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827.0
4,Location,,ADDRESSES_PLACE,Utah,USGS:5a42aff6e4b0d05ee8bbf5f7,https://data.usgs.gov/datacatalog/data/USGS:5a...,20200827.0


In [11]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (l:Location {name: row.term})
                MERGE (d)-[pl:ADDRESSES_PLACE]->(l)
                    SET pl.reference = row.reference,
                    pl.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_places
    })

CPU times: user 14.1 ms, sys: 7.55 ms, total: 21.7 ms
Wall time: 14min 11s
