# Datasets

In this stepwise look at a graphing process, we've already encountered some datasets but classed them initially as the more generalized CreativeWork nodes because we hadn't yet followed up to determine more details. In this next codeblock, we pull in the entire USGS Science Data Catalog: the closer to comprehensive open data inventory for the USGS. This is the most massive process we've worked through yet in working to comprehensively graph as much of the USGS as we can reasonably connect to. This is due here to the massive collection of unique keywords represented in this subset of fairly well documented datasets. The SDC indexing process (from source metadata documents into an Elasticsearch index) has done a reasonable job of splitting out place names, defined/referenced terms from a number of thesaurus sources, and undefined terms. The place names and referenced terms will all need further validation to validate that they actually exist in some viable source, but they provide a good start at some probably better collections.

The API source for datasets provides a graph-optimized derivative of the SDC index. It makes a few choices about the concepts needing to be graphed into entities and relationships and lays out arrays of nodes that are identified with a resolvable identifier that our system already knows about vs. those with no resolvable identifier. For the moment, we are only going to process those contacts with resolvable identifiers. For person contacts, we are also only going to build relationships if the entity already exists in our graph from some other source as we don't have enough information from the SDC to determine if we really want those entities in our graph that we can't resolve from one of the routes we are already processing. We're ignoring the unidentified authors at this point as those are going to need some work to disambiguate and resolve to something before we're willing to add them into our graph.

In [1]:
import isaid_helpers
import pandas as pd

In [2]:
pd.read_csv(isaid_helpers.f_graphable_sdc).head()

Unnamed: 0,sdc_internal_id,name,description,url,last_updated
0,USGS:ASC10,Map images portraying flight paths of low-alti...,Maps portraying the flight paths for low altit...,http://alaska.usgs.gov/science/subprogram.php?...,20201125
1,USGS:ASC108,Gulf of Alaska Shelf and Slope Iron and Nitrat...,These are data from cruises carried out in Apr...,https://doi.org/10.5066/F7222S06,20201125
2,USGS:5e0a22f8e4b0b207aa0d55ea,Radiometric thermal aerial imagery from unmann...,These digital images were taken over an area o...,https://doi.org/10.5066/P970EQ7D,20200821
3,USGS:5835ef59e4b0d9329c801bab,Amite River Flood Map Files,Heavy rainfall occurred across Louisiana durin...,https://doi.org/10.5066/F7T43R6C,20200821
4,USGS:5873c447e4b0a829a31e3195,Digital Orthorectified Aerial Image of Cottonw...,Orthorectified image from aerial photograph of...,https://dx.doi.org/10.5066/F7DZ06GR,20200825


In [3]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MERGE (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            ON CREATE
                SET d.name = row.name,
                d.description = row.description,
                d.url = row.url,
                d.last_updated = row.last_updated
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc
    })

CPU times: user 6.74 ms, sys: 4.79 ms, total: 11.5 ms
Wall time: 5min 7s


In [4]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_usgs_thesaurus).head()

Unnamed: 0,sdc_internal_id,date_qualifier,reference,DefinedSubjectMatter_url,DefinedSubjectMatter_name,DefinedSubjectMatter_source,DefinedSubjectMatter_source_reference,DefinedSubjectMatter_concept_label,DefinedSubjectMatter_description
0,USGS:f3b2429c-b5b7-442a-bc39-8d8b7cbe3429,2020-09-08T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:f3...,https://apps.usgs.gov/thesaurus/term-simple.ph...,administrative and political boundaries,USGS Thesaurus,https://apps.usgs.gov/thesaurus/thesaurus.php?...,USGS_SCIENCE_TOPICS,Lines drawn on maps or described in documents ...
1,USGS:5b736251e4b0f5d5787c61df,2020-08-27T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5b...,https://apps.usgs.gov/thesaurus/term-simple.ph...,administrative and political boundaries,USGS Thesaurus,https://apps.usgs.gov/thesaurus/thesaurus.php?...,USGS_SCIENCE_TOPICS,Lines drawn on maps or described in documents ...
2,USGS:5e0a22f8e4b0b207aa0d55ea,2020-08-21T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5e...,https://apps.usgs.gov/thesaurus/term-simple.ph...,aerial photography,USGS Thesaurus,https://apps.usgs.gov/thesaurus/thesaurus.php?...,USGS_SCIENTIFIC_METHODS,The process of taking pictures with a camera f...
3,USGS:5873c447e4b0a829a31e3195,2020-08-25T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:58...,https://apps.usgs.gov/thesaurus/term-simple.ph...,aerial photography,USGS Thesaurus,https://apps.usgs.gov/thesaurus/thesaurus.php?...,USGS_SCIENTIFIC_METHODS,The process of taking pictures with a camera f...
4,USGS:58920ec4e4b072a7ac12de7f,2020-08-25T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:58...,https://apps.usgs.gov/thesaurus/term-simple.ph...,aerial photography,USGS Thesaurus,https://apps.usgs.gov/thesaurus/thesaurus.php?...,USGS_SCIENTIFIC_METHODS,The process of taking pictures with a camera f...


In [5]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            WITH d, row
                MERGE (t:DefinedSubjectMatter {url: row.DefinedSubjectMatter_url})
                    SET t.name = row.DefinedSubjectMatter_name,
                    t.source = row.DefinedSubjectMatter_source,
                    t.reference = row.DefinedSubjectMatter_reference,
                    t.concept_label = row.DefinedSubjectMatter_concept_label,
                    t.description = row.DefinedSubjectMatter_description
                MERGE (d)-[dt:ADDRESSES_SUBJECT]->(t)
                    SET dt.reference = row.reference,
                    dt.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_usgs_thesaurus
    })

CPU times: user 19.8 ms, sys: 13.6 ms, total: 33.4 ms
Wall time: 18min 47s


In [6]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_md).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:ASC10,https://data.usgs.gov/datacatalog/data/USGS:ASC10,20201125,POINT_OF_CONTACT,Person,bmarcot@fs.fed.us,
1,USGS:ASC108,https://data.usgs.gov/datacatalog/data/USGS:AS...,20201125,POINT_OF_CONTACT,Person,jcrusius@usgs.gov,
2,USGS:5e0a22f8e4b0b207aa0d55ea,https://data.usgs.gov/datacatalog/data/USGS:5e...,20200821,POINT_OF_CONTACT,Person,mcashman@usgs.gov,
3,USGS:5835ef59e4b0d9329c801bab,https://data.usgs.gov/datacatalog/data/USGS:58...,20200821,POINT_OF_CONTACT,Person,bbreaker@usgs.gov,
4,USGS:5873c447e4b0a829a31e3195,https://data.usgs.gov/datacatalog/data/USGS:58...,20200825,POINT_OF_CONTACT,Person,dmushet@usgs.gov,


In [7]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {email: row.email})
                MERGE (p)-[mc:METADATA_CONTACT]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_md
    })

CPU times: user 7.18 ms, sys: 5.47 ms, total: 12.7 ms
Wall time: 6min 8s


In [8]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_poc).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:ASC10,https://data.usgs.gov/datacatalog/data/USGS:ASC10,20201125,METADATA_CONTACT,Person,bmarcot@fs.fed.us,
1,USGS:ASC108,https://data.usgs.gov/datacatalog/data/USGS:AS...,20201125,METADATA_CONTACT,Person,jcrusius@usgs.gov,
2,USGS:5e0a22f8e4b0b207aa0d55ea,https://data.usgs.gov/datacatalog/data/USGS:5e...,20200821,METADATA_CONTACT,Person,mcashman@usgs.gov,
3,USGS:5835ef59e4b0d9329c801bab,https://data.usgs.gov/datacatalog/data/USGS:58...,20200821,METADATA_CONTACT,Person,bbreaker@usgs.gov,
4,USGS:5873c447e4b0a829a31e3195,https://data.usgs.gov/datacatalog/data/USGS:58...,20200825,METADATA_CONTACT,Person,//www.npwrc.usgs.gov/contact,


In [9]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {email: row.email})
                MERGE (p)-[mc:POINT_OF_CONTACT]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_poc
    })

CPU times: user 6.08 ms, sys: 3.69 ms, total: 9.77 ms
Wall time: 4min 45s


In [10]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_author).head()

Unnamed: 0,sdc_internal_id,reference,date_qualifier,rel_type,entity_type,email,orcid
0,USGS:ASC108,https://data.usgs.gov/datacatalog/data/USGS:AS...,20201125,AUTHOR_OF,Person,,0000-0003-2554-0831
1,USGS:5e0a22f8e4b0b207aa0d55ea,https://data.usgs.gov/datacatalog/data/USGS:5e...,20200821,AUTHOR_OF,Person,,0000-0002-6635-4309
2,USGS:5e0a22f8e4b0b207aa0d55ea,https://data.usgs.gov/datacatalog/data/USGS:5e...,20200821,AUTHOR_OF,Person,,0000-0003-3797-4207
3,USGS:58cc167ee4b0849ce97dce32,https://data.usgs.gov/datacatalog/data/USGS:58...,20200830,AUTHOR_OF,Person,,0000-0003-0110-0284
4,USGS:58cc167ee4b0849ce97dce32,https://data.usgs.gov/datacatalog/data/USGS:58...,20200830,AUTHOR_OF,Person,,0000-0002-2834-2243


In [11]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            
            WITH d, row
                MATCH (p:Person {orcid: row.orcid})
                MERGE (p)-[mc:AUTHOR_OF]->(d)
                    SET mc.reference = row.reference,
                    mc.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_author
    })

CPU times: user 14.6 ms, sys: 7.39 ms, total: 22 ms
Wall time: 12min 52s


In [12]:
pd.read_csv(isaid_helpers.f_graphable_sdc_rels_places).head()

Unnamed: 0,sdc_internal_id,date_qualifier,reference,DefinedSubjectMatter_name,DefinedSubjectMatter_source,DefinedSubjectMatter_source_reference,DefinedSubjectMatter_concept_label,DefinedSubjectMatter_url,DefinedSubjectMatter_description
0,USGS:5cf01a85e4b0b51330e22aa6,2020-08-27T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5c...,Abbeville County,Wikidata US Counties,Wikidata county of state instances,US_COUNTY,http://www.wikidata.org/entity/Q306343,"county in South Carolina, United States"
1,USGS:5847137ee4b0f34b016ff271,2020-08-31T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:58...,Abu,Wikidata Global Volcanos,https://www.wikidata.org/wiki/Q8072,NAMED_VOLCANO,http://www.wikidata.org/entity/Q334728,"mountain in Yamaguchi Prefecture, Japan"
2,USGS:5eb1ca8782cefae35a29c3d3,2020-08-19T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5e...,Acadia National Park,Wikidata US National Parks,https://www.wikidata.org/wiki/Q34918903,NATIONAL_PARK,http://www.wikidata.org/entity/Q337396,national park in the US state of Maine
3,USGS:5c018adae4b0815414cc70bc,2020-09-25T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5c...,Acadia National Park,Wikidata US National Parks,https://www.wikidata.org/wiki/Q34918903,NATIONAL_PARK,http://www.wikidata.org/entity/Q337396,national park in the US state of Maine
4,USGS:5b92cffce4b0702d0e80a2d5,2021-06-01T00:00:00,https://data.usgs.gov/datacatalog/data/USGS:5b...,Acadia National Park,Wikidata US National Parks,https://www.wikidata.org/wiki/Q34918903,NATIONAL_PARK,http://www.wikidata.org/entity/Q337396,national park in the US state of Maine


In [14]:
%%time
with isaid_helpers.graph_driver.session(database=isaid_helpers.graphdb) as session:
    session.run("""
        LOAD CSV WITH HEADERS FROM '%(source_path)s/%(source_file)s' AS row
        WITH row
            MATCH (d:Dataset {sdc_internal_id: row.sdc_internal_id})
            WITH d, row
                MERGE (l:Location {url: row.DefinedSubjectMatter_url})
                ON CREATE
                    SET l.name = row.DefinedSubjectMatter_name,
                    l.source = row.DefinedSubjectMatter_source,
                    l.reference = row.DefinedSubjectMatter_source_reference,
                    l.concept_label = row.DefinedSubjectMatter_concept_label,
                    l.description = row.DefinedSubjectMatter_description
                MERGE (d)-[pl:ADDRESSES_PLACE]->(l)
                    SET pl.reference = row.reference,
                    pl.date_qualifier = row.date_qualifier
    """ % {
        "source_path": isaid_helpers.local_cache_path,
        "source_file": isaid_helpers.f_graphable_sdc_rels_places
    })

CPU times: user 14.8 ms, sys: 8.37 ms, total: 23.2 ms
Wall time: 13min 5s
