This notebook works through building a knowledge representation of geoheritage sites being developed by the USGS that are currently being built into a geodatabase form. This establishes a new entity type in the GeoKB - [USGS Geoheritage Site](https://geokb.wikibase.cloud/wiki/Item_talk:Q158700).

In [12]:
from wbmaker import WikibaseConnection
import pandas as pd

geokb = WikibaseConnection('GEOKB_CLOUD')

geoheritage_site_qid = "Q158700"
geoheritage_site_db_qid = "Q158701"

# Source Data
I'm starting here with just a simple table shared via email. We'll ultimately tap an AGOL service where the database is being built.

In [6]:
geoheritage_trial = pd.read_excel('./data/geoheritage_trial.xlsx')
geoheritage_trial.head()

Unnamed: 0,NAME,MANAGED_BY,DESIGNATION,STATE
0,Exit Glacier,National Park Service,Kenai Fjords National Park,AK
1,Little River Canyon (pin the falls),National Park Service,Little River Canyon National Preserve,AL
2,Vai'ava Strait (Pola Island),National Park Service,National Park of American Samoa,American Samoa
3,The Giant Flowstone (Blanchard Springs Caverns),Federal (USFS),Ozark-St. Francis National Forest,AR
4,The Great Unconformity at Grand Canyon,National Park Service,Grand Canyon National Park,AZ


# Park Lookup
A key aspect of the USGS Geoheritage Site is the park or other protected area the site is located within. In other work, we've brought those items into the GeoKB. Here we lookup the entities we will link to.

In [3]:
query_geokb_parks = """
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P1 wd:Q158222 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
"""

geokb_parks = geokb.sparql_query(
    query=query_geokb_parks,
    endpoint=geokb.sparql_endpoint,
    output="dataframe"
)
geokb_parks['location_qid'] = geokb_parks['item'].apply(lambda x: x.split('/')[-1])

In [7]:
unit_linkable_sites = pd.merge(
    left=geoheritage_trial,
    right=geokb_parks[['location_qid','itemLabel']].rename(columns={'itemLabel': 'DESIGNATION'}),
    how="inner",
    on="DESIGNATION"
)

unit_linkable_sites

Unnamed: 0,NAME,MANAGED_BY,DESIGNATION,STATE,location_qid
0,Exit Glacier,National Park Service,Kenai Fjords National Park,AK,Q158474
1,Little River Canyon (pin the falls),National Park Service,Little River Canyon National Preserve,AL,Q158497
2,Vai'ava Strait (Pola Island),National Park Service,National Park of American Samoa,American Samoa,Q158537
3,The Great Unconformity at Grand Canyon,National Park Service,Grand Canyon National Park,AZ,Q158418
4,Devils Postpile,National Park Service,Devils Postpile National Monument,CA,Q158344
5,Salt marsh (by visitors center viewing platform),National Park Service,Cumberland Island National Seashore,GA,Q158335
6,Hanging Rock,National Park Service,Effigy Mounds National Monument,IA,Q158351
7,West Beach (point?),National Park Service,Indiana Dunes National Park,IN,Q158457
8,Limestone field (p 2 of IRMA report),National Park Service,Tallgrass Prairie National Preserve,KS,Q158637
9,Cumberland Gap,National Park Service,Cumberland Gap National Historical Park,KY,Q158334


# Managed Unit Linkages

In a knowledge representation, we are trying to make everything linkable that can possibly be linked because the links let us connect the dots to all kinds of other information and sometimes infer characteristics based on the linkage. I need to do some follow up work to track down the best available USFS and BLM sources for their managed areas. Also, while we have all 471 NPS units in the GeoKB now, we don't have complete alignment on names and will need to do some work with the USGS Geoheritage source data on this. Ideally, we'd go ahead and encode some appropriate unique identifier to the units from whatever agency in the source data.

In [8]:
geoheritage_trial[~geoheritage_trial['NAME'].isin(unit_linkable_sites['NAME'])]['DESIGNATION'].unique()

array(['Ozark-St. Francis National Forest',
       'Garden Park Fossil Area\xa0\xa0Cañon City',
       'Weir Farm National Historic Site', 'Rock Creek National Park',
       'First State National Historic Park',
       'Apalachiocola National Forest', 'Guam National Wildlife Refuge',
       "Hawai'i Volcanoes National Park",
       'Hells Canyon\xa0National Recreation Area (Managed by the Wallowa-Whitman National Forest',
       'Shawnee National Forest',
       'Kisatchie Bayou\xa0Recreation Complex\xa0',
       'Mark Twain National Forest', 'Pompeys Pillar National Monument',
       'Pisgah National Forest', 'Dakota Prairie Grasslands',
       'White Mountain National Forest',
       'Paterson Great Falls National Park', 'White Sands',
       'Hickison Petroglyph Recreation Area', 'Deschutes National Forest',
       'El Yunque National Forest',
       'Sachuest Point National Wildlife Refuge',
       'Natchez Trace National Historic Trail',
       'Great Falls Park National Park', 'G

# Wikidata Linkage
One of the things we are including in our knowledge representation schema, wherever possible, is either a same as or a see also relationship to Wikidata representations for the same or similar entities. This creates an opportunity for other groups to use new or improved information that USGS puts into our knowledgebase to improve information in the global knowledge commons.

The following query pulls a large tranche (5000+) of entities (features) that are located within U.S. National Parks according to Wikidata. We need to continue working through these entities with further exploration, but it gives us a few that we can build simple name matches on for now.

In [9]:
query_wd_features_in_parks = """
SELECT ?item ?itemLabel ?located_in ?located_inLabel
WHERE {
  ?item wdt:P3018 ?located_in .
  ?located_in wdt:P31 wd:Q34918903 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""

wd_features_in_parks = geokb.sparql_query(
    query=query_wd_features_in_parks,
    endpoint="https://query.wikidata.org/sparql",
    output="dataframe"
)

wd_features_in_parks['wd_qid'] = wd_features_in_parks['item'].apply(lambda x: x.split('/')[-1])

# Build Entities
For now, I'm just getting a small number of items where we can a) make a successful match on DESIGNATION from the source data to a National Park unit in the GeoKB and b) make a successful match on NAME to a named feature within a National Park in Wikidata. We'll pull everything into the GeoKB from the USGS Geoheritage Site source data in the AGOL service once we have the schema better fleshed out.

In [10]:
linkable_entities = pd.merge(
    left=unit_linkable_sites,
    right=wd_features_in_parks[['wd_qid','itemLabel']].rename(columns={'itemLabel': 'NAME'}),
    how="inner",
    on="NAME"
)

linkable_entities

Unnamed: 0,NAME,MANAGED_BY,DESIGNATION,STATE,location_qid,wd_qid
0,Exit Glacier,National Park Service,Kenai Fjords National Park,AK,Q158474,Q1383867
1,Weston Lake,National Park Service,Congaree National Park,SC,Q158328,Q47474661
2,Guadalupe Peak,National Park Service,Guadalupe Mountains National Park,TX,Q158430,Q32793
3,Landscape Arch,National Park Service,Arches National Park,UT,Q158250,Q2256910


In [14]:
source_ref = geokb.models.References()
source_ref.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['data source'],
        value=geoheritage_site_db_qid
    )
)

for index, row in linkable_entities.iterrows():
    item = geokb.wbi.item.new()

    item.labels.set('en', row['NAME'])
    item.descriptions.set('en', f"a geological heritage site located in the {row['DESIGNATION']}")

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value=geoheritage_site_qid,
            references=source_ref
        )
    )

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['geographical location'],
            value=row['location_qid'],
            references=source_ref
        )
    )

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['same as'],
            value=f"https://www.wikidata.org/wiki/{row['wd_qid']}"
        )
    )

    response = item.write(
        summary="Added initial couple of properties for Geoheritage Site"
    )
    print(row['NAME'], response.id)

Exit Glacier Q158702
Weston Lake Q158703
Guadalupe Peak Q158704
Landscape Arch Q158705
