This notebook works through building a knowledge representation of geoheritage sites being developed by the USGS that are currently being built into a geodatabase form. This establishes a new entity type in the GeoKB - [USGS Geoheritage Site](https://geokb.wikibase.cloud/wiki/Item_talk:Q158700).

In [2]:
from wbmaker import WikibaseConnection
import pandas as pd
import geopandas as gpd

geokb = WikibaseConnection('GEOKB_CLOUD')

geoheritage_site_qid = "Q158700"
geoheritage_site_db_qid = "Q158701"

# Source Data
I'm starting here with rough draft of the database housed as a feature class in AGOL and pulled in for testing as GeoJSON.

In [3]:
geoheritage_trial = gpd.read_file('data/geoheritage.geojson')

In [30]:
geoheritage_trial.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   OBJECTID            56 non-null     int64   
 1   INDEX_              53 non-null     float64 
 2   GHS_ID              0 non-null      float64 
 3   RESPONSIBLE_PERSON  55 non-null     object  
 4   NAME                55 non-null     object  
 5   LITHOLOGY           0 non-null      float64 
 6   AGE                 0 non-null      float64 
 7   MANAGED_BY          55 non-null     object  
 8   DESIGNATION         55 non-null     object  
 9   COUNTY              0 non-null      float64 
 10  STATE               56 non-null     object  
 11  TYPE                52 non-null     object  
 12  SCIENTIFIC          52 non-null     object  
 13  EDUCATIONAL         0 non-null      float64 
 14  CULTURAL            0 non-null      float64 
 15  ECONOMIC            0 non-null    

## Type Classification
It will be useful for other purposes to introduce the geographical feature types being used for the Geoheritage Sites database as entities (TYPE field in the source database may contain multiple terms separated with "|"). From the GeoKB perspective, these concepts can likely mostly be organized as subclasses of geographical feature (a new subclass of spatio-temporal entity). Classification within Wikidata varies for these concepts, with some of them highly classified in ways we may not agree with or at least want to deal with in our case. Some of these concepts like "river" and "mesa" are also found within the SWEET ontology, but again we may or may not conform with or wish to incorporate SWEET's specific classification scheme. In any case, we should include same as or see also relationships to similar concepts from Wikidata and SWEET where possible.

Once established, the geoheritage sites will be classed as instances of one or more of the geographical feature classes along with being an instance of a geoheritage site.

There is a 2007 report of some type from NPS on NNL classification (track this down).

In [16]:
geoheritage_trial['geoheritage_types'] = geoheritage_trial['TYPE'].apply(lambda x: x.split('|') if isinstance(x, str) else None)
geoheritage_types = list(geoheritage_trial[['geoheritage_types']].explode('geoheritage_types')['geoheritage_types'].unique())
geoheritage_types

['paleontology',
 'sinkholes',
 'erosional passage',
 'waterfall',
 'river',
 'mountain',
 'butte',
 'mesa',
 'mesas',
 'bluff',
 'springs',
 'cave',
 'arch',
 'eroded cliffs',
 None,
 'glacier',
 'flowstone',
 'marsh',
 'spires',
 'cliffs',
 'bedrock',
 'dunes',
 'beach',
 'barrier island',
 'cirque',
 'rocky promontory',
 'erratic',
 'eroded cliff',
 'canyon',
 'unconformity',
 'geomorphic',
 'boulder',
 'banded metamorphics',
 'ravines',
 'ledges',
 'lake',
 'gorge',
 'strait',
 'columnar basalt',
 'metamorphic rock',
 'cliff',
 'volcano',
 'lava flows',
 'dome',
 'monolith',
 'welded tuff',
 'sills',
 'dikes',
 'volcanic island',
 'hot spring']

## Description
Some of these can be quite useful, and we'll need to talk about how we can flesh out and systematize these a bit into a set of 250 character (or less) descriptions for items.

In [35]:
geoheritage_trial['DESCRIPTION'].head(20).unique()

array(['NNL',
       'The trails at Leon Sinks near Tallahassee, Florida weave around numerous karst features such as sink holes, natural bridges and disappearing streams, providing a window into the Floridan Aquifer beneath.',
       'need point Ask Andrew',
       'sandstone falls, sandstone, shale, gravel, fluvial', nan,
       "Pompeys?Pillar National Monument?encompasses?51 acres?on the banks of the Yellowstone River?with?a massive sandstone outcrop?covering about 2 acres at its?base and rising?120 feet high?toward Montana's Big Sky. The monument's premier location at a natural ford in the Yellowstone River, and its geologic distinction as the only major sandstone formation in the area, have made?Pompeys?Pillar a celebrated landmark and outstanding observation point for?more than?eleven thousand years of human occupation.?Hundreds of markings, petroglyphs, and inscriptions left by visitors including William Clark and the Lewis and Clark Expedition have transformed this geologic ph

## Responsible Person
It will be useful to connect geoheritage sites to GeoKB entities representing people. The closest predicate we have in place now is "point of contact." I went in after the fact and made Christina DeVera a point of contact linked to the one site in our initial pilot set. We need to map the other points of contact to identified entities. One option would be for the source data to incorporate an ORCID identifier for the "RESPONSIBLE_PERSON." We can keep the shorthand there if it is otherwise useful, but the identifier will give us the hook for building real linkages.

In [19]:
geoheritage_trial['RESPONSIBLE_PERSON'].unique()

array(['CD', 'JR', 'TimH', 'TomC', nan], dtype=object)

## Images
The schema includes a URL to an image stored as attachments on an [item in ScienceBase](https://www.sciencebase.gov/catalog/item/64f7827cd34ed30c20544c32). This is another case where we could use some work to firm up the best approach.

* Wikidata uses a relationship with Wikimedia Commons for this kind of functionality. One approach that we should likely pursue anyway is establishing that capability. This is mostly in place other than the specialized display functionality within the Wikibase UI. I added an example image of the Landscape Arch from Wikimedia Commons to a new "image" property for that [item](https://geokb.wikibase.cloud/wiki/Item:Q158705). I'll work with the Wikibase.cloud folks on adding this functionality along with the map preview that I'd asked about previously.
    * In cases where we can legitimately donate other images to Wikimedia Commons that we want to use, that may be a reasonable path.
* If we need/want to use ScienceBase as an image repository, we should follow the same pattern used for the [USGS Library Historic Photo Collection](https://www.sciencebase.gov/catalog/item/519ba0a3e4b0e4e151ef5dd9) where each individual photo is its own item complete with a little bit of metadata. There is some specialized functionality in ScienceBase for static 2D images like this that provides multiple downscaling and file conversion options. However, ScienceBase is really not designed as a photo repository, with existing collections adding somewhat to ScienceBase's technical debt problems.
* We may want to pursue some other capability that is expressly tailored to handle 2D images as a new type of enterprise repository in USGS. This could be our own Wikimedia Commons implementation or something else entirely. We have other use cases for this coming along within EMMA.

In [27]:
geoheritage_trial[geoheritage_trial['THUMBNAIL_URL'].notnull()]['THUMBNAIL_URL'].head().to_list()

['https://www.sciencebase.gov/catalog/file/get/64f7827cd34ed30c20544c32?f=__disk__40%2F15%2F85%2F401585bdf8219db8c44a0a5e5b89b2ec9e3c9ca0&community=Geoheritage+Sites+of+the+Nation&allowOpen=true',
 'https://www.sciencebase.gov/catalog/file/get/64f7827cd34ed30c20544c32?f=__disk__a3%2F4d%2F4d%2Fa34d4db191c8a6a1492162dfe59b2663b5386f70&community=Geoheritage+Sites+of+the+Nation&allowOpen=true',
 'https://www.sciencebase.gov/catalog/file/get/64f7827cd34ed30c20544c32?f=__disk__a9%2Ff8%2F08%2Fa9f8088ced89e101409cbcb112e9f43025aa2881&community=Geoheritage+Sites+of+the+Nation&allowOpen=true',
 'https://www.sciencebase.gov/catalog/file/get/64f7827cd34ed30c20544c32?f=__disk__6d%2F89%2F32%2F6d8932e83b36e35230f947a955d7122a85810a3f&community=Geoheritage+Sites+of+the+Nation&allowOpen=true',
 'https://www.sciencebase.gov/catalog/file/get/64f7827cd34ed30c20544c32?f=__disk__db%2Fc7%2F80%2Fdbc780a433ae1724b968feb99d83bb6e828ada2a&community=Geoheritage+Sites+of+the+Nation&allowOpen=true']

## Values
It looks like the schema is heading in the direction of linking concepts associated with scientific and other "values" to each site, with "SCIENTIFIC" being the only field populated with some concepts (in addition to a VALUE field containing a basic flag on value types). The concepts here are things that make sense to incorporate in some way into the GeoKB for other purposes, but we'll need to work through what the significance of these values are and where that end of the schema is headed. Many of these are essentially geologic features but they likely mean something more in the context of a "scientific value system" as essentially the justification for a specific geographic feature being characterized as a geoheritage site.

In [29]:
geoheritage_trial['VALUE'].unique()

array(['Scientific|Cultural|Economic', 'Scientific',
       'Scientific|Cultural', 'Scientific|Aesthetic',
       'Scientific|Cultural|Aesthetic', nan, 'Scientific|Educational',
       'Scientific|Cultural|Economic|Aesthetic', 'Scientific|Economic',
       'Scientific|Cultural|Educational'], dtype=object)

In [32]:
geoheritage_trial['scientific_values'] = geoheritage_trial['SCIENTIFIC'].apply(lambda x: x.split('|') if isinstance(x, str) else None)
scientific_values = list(geoheritage_trial[['scientific_values']].explode('scientific_values')['scientific_values'].unique())
scientific_values

['History of the Earth',
 'Cave and Karst',
 'Erosional Systems',
 'Tectonic Systems',
 'Meteorite Impacts',
 'River, lake, and delta systems',
 'Cave and Karst Systems',
 '',
 None,
 'Glacial and Periglacial Systems',
 'Coastal',
 'Marine',
 'Desert and Semi-desert',
 'Glacial and Periglacial',
 'Glacial ',
 'Volcanic Systems',
 'River, Lake and Delta Systems',
 'Marine Systems',
 'Volcanic']

## Data Source
There is a mishmash of notes in this field currently, with the majority pointing toward some type of citation or reference work. The linkage to a publication of some kind is something we can easily build in. We already have many thousands of pub representations from the Pubs Warehouse through other work, and we can add in either one-offs or build in another tap to a source (e.g., it might be otherwise useful to develop a relationship with a NPS catalog). The main thing we need to work out, though, will be the semantics of what this relationship should be. We've already included the USGS Geoheritage database as our data source via references on specific claims. We could add these other types of assets as an additional data source reference, at least on the instance of claim indicating that something is a "Geoheritage Site." Or we might add these as their own statement of some kind indicating an original source. One way or the other, this is a useful design pattern to develop as we have many other cases where USGS builds some kind of amalgamation of data from many other original sources.

In [33]:
geoheritage_trial['DataSource'].unique()

array([nan,
       'Geologic Features of our National Forests and Grasslands (usda.gov)',
       'Thornberry-Ehrlich, T. 2009. Catoctin Mountain Park Geologic Resources Inventory Report. Natural Resource Report NPS/NRPC/GRD/NRR—2009/120. National Park Service, Denver, Colorado.',
       'Pompeys Pillar National Monument | Bureau of Land Management (blm.gov)',
       'National Natural Landmarks - National Natural Landmarks (U.S. National Park Service) (nps.gov)',
       'Active Process Monitoring Example—Landscape Arch (U.S. National Park Service) (nps.gov)',
       'Dickas, 2012',
       'Thornberry-Ehrlich, T. 2022. Little River Canyon National Preserve: Geologic resources inventory report. Natural Resource Report NPS/NRSS/GRD/NRR—2022/2469. National Park Service, Fort Collins, Colorado',
       'https://www.usgs.gov/news/fire-island-wilderness-breach-help-or-hindrance',
       'https://iugs-geoheritage.org/geoheritage_sites/the-great-unconformity-at-grand-canyon/',
       'Thornberry

# Park Lookup
A key aspect of the USGS Geoheritage Site is the park or other protected area the site is located within. In other work, we've brought those items into the GeoKB. Here we lookup the entities we will link to.

In [20]:
query_geokb_parks = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P1 wd:Q158222 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

geokb_parks = geokb.sparql_query(
    query=query_geokb_parks,
    endpoint=geokb.sparql_endpoint,
    output="dataframe"
)
geokb_parks['location_qid'] = geokb_parks['item'].apply(lambda x: x.split('/')[-1])

In [21]:
unit_linkable_sites = pd.merge(
    left=geoheritage_trial,
    right=geokb_parks[['location_qid','itemLabel']].rename(columns={'itemLabel': 'DESIGNATION'}),
    how="inner",
    on="DESIGNATION"
)

unit_linkable_sites

Unnamed: 0,OBJECTID,INDEX_,GHS_ID,RESPONSIBLE_PERSON,NAME,LITHOLOGY,AGE,MANAGED_BY,DESIGNATION,COUNTY,...,NGMDB_LINK,DataSource,Lat_DD,Long_DD,NOTES,VALUE,THUMBNAIL_URL,geometry,geoheritage_types,location_qid
0,3,20.0,,CD,Cumberland Gap,,,National Park Service,Cumberland Gap National Historical Park,,...,,,36.604,-83.673,"pilot site, USGS EarthExplorer coordinates",Scientific|Cultural,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-83.67300 36.60400),[erosional passage],Q158334
1,5,23.0,,CD,Catoctin Mountain,,,National Park Service,Catoctin Mountain Park,,...,,"Thornberry-Ehrlich, T. 2009. Catoctin Mountain...",39.6479,-77.4664,USGS EarthExplorer coordinates,Scientific|Cultural|Economic,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-77.46640 39.64790),[mountain],Q158307
2,6,24.0,,CD,Cadillac Mountain,,,National Park Service,Acadia National Park,,...,,,44.3513,-68.2265,USGS EarthExplorer coordinates,Scientific|Aesthetic,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-68.22650 44.35130),[mountain],Q158227
3,10,39.0,,CD,Bromide Hill,,,National Park Service,Chickasaw National Recreation Area,,...,,,34.4951,-96.982,USGS EarthExplorer coordinates,Scientific|Cultural,,POINT (-96.98200 34.49510),"[bluff, springs]",Q158319
4,12,48.0,,CD,Landscape Arch,,,National Park Service,Arches National Park,,...,,Active Process Monitoring Example—Landscape Ar...,38.7905,-109.6068,"NPS Geoheritage Sites, IUGS second 100 shortlist",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-109.60680 38.79050),[arch],Q158250
5,13,52.0,,CD,Devils Island,,,National Park Service,Apostle Islands National Lakeshore,,...,,,47.067941,-90.735702,"coordinates from Google, northernmost point in...",Scientific|Cultural,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-90.73570 47.06794),"[cave, eroded cliffs]",Q158247
6,15,1.0,,JR,Exit Glacier,,,National Park Service,Kenai Fjords National Park,,...,,"Dickas, 2012",60.181111,-149.638611,"on AG list, USGS EarthExplorer coordinates (60...",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-149.63861 60.18111),[glacier],Q158474
7,16,2.0,,JR,Little River Canyon (pin the falls),,,National Park Service,Little River Canyon National Preserve,,...,,"Thornberry-Ehrlich, T. 2022. Little River Cany...",34.4146,-85.6174,"NPS Geodiversity Atlas, USGS EarthExplorer coo...",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-85.61740 34.41460),[waterfall],Q158497
8,18,13.0,,JR,Salt marsh (by visitors center viewing platform),,,National Park Service,Cumberland Island National Seashore,,...,,,30.795659,-81.501117,coordinates from Google (Cumberland Island Nat...,Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-81.50112 30.79566),[marsh],Q158335
9,20,19.0,,JR,Limestone field (p 2 of IRMA report),,,National Park Service,Tallgrass Prairie National Preserve,,...,,,38.441834,-96.559548,"coordinates from Google, Fox Creek School (mig...",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-96.55955 38.44183),[bedrock],Q158637


# Managed Unit Linkages

In a knowledge representation, we are trying to make everything linkable that can possibly be linked because the links let us connect the dots to all kinds of other information and sometimes infer characteristics based on the linkage. I need to do some follow up work to track down the best available USFS and BLM sources for their managed areas. Also, while we have all 471 NPS units in the GeoKB now, we don't have complete alignment on names and will need to do some work with the USGS Geoheritage source data on this. Ideally, we'd go ahead and encode some appropriate unique identifier to the units from whatever agency in the source data.

In [22]:
geoheritage_trial[~geoheritage_trial['NAME'].isin(unit_linkable_sites['NAME'])]['DESIGNATION'].unique()

array(['Garden Park Fossil Area  Cañon City',
       'Apalachiocola National Forest',
       'Kisatchie Bayou Recreation Complex ',
       'Mark Twain National Forest', 'Pompeys Pillar National Monument',
       'Dakota Prairie Grasslands',
       'Natchez Trace National Historic Trail', 'White Sands',
       'Ozark-St. Francis National Forest', 'Shawnee National Forest',
       'White Mountain National Forest', 'El Yunque National Forest',
       'Sachuest Point National Wildlife Refuge',
       'Green Mountain National Forest', 'New River National Park',
       'Weir Farm National Historic Site', 'Rock Creek National Park',
       'Great Falls Park National Park',
       'First State National Historic Park',
       'Guam National Wildlife Refuge', "Hawai'i Volcanoes National Park",
       'Hells Canyon National Recreation Area (Managed by the Wallowa-Whitman National Forest',
       'Pisgah National Forest', 'Paterson Great Falls National Park',
       'Hickison Petroglyph Recreation

# Wikidata Linkage
One of the things we are including in our knowledge representation schema, wherever possible, is either a same as or a see also relationship to Wikidata representations for the same or similar entities. This creates an opportunity for other groups to use new or improved information that USGS puts into our knowledgebase to improve information in the global knowledge commons.

The following query pulls a large tranche (5000+) of entities (features) that are located within U.S. National Parks according to Wikidata. We need to continue working through these entities with further exploration, but it gives us a few that we can build simple name matches on for now.

In [23]:
query_wd_features_in_parks = """
SELECT ?item ?itemLabel ?located_in ?located_inLabel
WHERE {
  ?item wdt:P3018 ?located_in .
  ?located_in wdt:P31 wd:Q34918903 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""

wd_features_in_parks = geokb.sparql_query(
    query=query_wd_features_in_parks,
    endpoint="https://query.wikidata.org/sparql",
    output="dataframe"
)

wd_features_in_parks['wd_qid'] = wd_features_in_parks['item'].apply(lambda x: x.split('/')[-1])

# Build Entities
For now, I'm just getting a small number of items where we can a) make a successful match on DESIGNATION from the source data to a National Park unit in the GeoKB and b) make a successful match on NAME to a named feature within a National Park in Wikidata. We'll pull everything into the GeoKB from the USGS Geoheritage Site source data in the AGOL service once we have the schema better fleshed out.

In [24]:
linkable_entities = pd.merge(
    left=unit_linkable_sites,
    right=wd_features_in_parks[['wd_qid','itemLabel']].rename(columns={'itemLabel': 'NAME'}),
    how="inner",
    on="NAME"
)

linkable_entities

Unnamed: 0,OBJECTID,INDEX_,GHS_ID,RESPONSIBLE_PERSON,NAME,LITHOLOGY,AGE,MANAGED_BY,DESIGNATION,COUNTY,...,DataSource,Lat_DD,Long_DD,NOTES,VALUE,THUMBNAIL_URL,geometry,geoheritage_types,location_qid,wd_qid
0,12,48.0,,CD,Landscape Arch,,,National Park Service,Arches National Park,,...,Active Process Monitoring Example—Landscape Ar...,38.7905,-109.6068,"NPS Geoheritage Sites, IUGS second 100 shortlist",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-109.60680 38.79050),[arch],Q158250,Q2256910
1,15,1.0,,JR,Exit Glacier,,,National Park Service,Kenai Fjords National Park,,...,"Dickas, 2012",60.181111,-149.638611,"on AG list, USGS EarthExplorer coordinates (60...",Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-149.63861 60.18111),[glacier],Q158474,Q1383867
2,37,44.0,,TimH,Weston Lake,,,National Park Service,Congaree National Park,,...,,33.81972,-80.819212,coordinates from Google,Scientific,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-80.81921 33.81972),[lake],Q158328,Q47474661
3,39,46.0,,TimH,Guadalupe Peak,,,National Park Service,Guadalupe Mountains National Park,,...,,31.8914,-104.8606,"IUGS second 100 shortlist, USGS EarthExplorer ...",Scientific|Cultural|Aesthetic,https://www.sciencebase.gov/catalog/file/get/6...,POINT (-104.86060 31.89140),[mountain],Q158430,Q32793


In [14]:
source_ref = geokb.models.References()
source_ref.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['data source'],
        value=geoheritage_site_db_qid
    )
)

for index, row in linkable_entities.iterrows():
    item = geokb.wbi.item.new()

    item.labels.set('en', row['NAME'])
    item.descriptions.set('en', f"a geological heritage site located in the {row['DESIGNATION']}")

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value=geoheritage_site_qid,
            references=source_ref
        )
    )

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['geographical location'],
            value=row['location_qid'],
            references=source_ref
        )
    )

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['same as'],
            value=f"https://www.wikidata.org/wiki/{row['wd_qid']}"
        )
    )

    response = item.write(
        summary="Added initial couple of properties for Geoheritage Site"
    )
    print(row['NAME'], response.id)

Exit Glacier Q158702
Weston Lake Q158703
Guadalupe Peak Q158704
Landscape Arch Q158705
