# Introduction

This notebook examines the possible data sources for trails from the U.S. Bureau of Land Management. It focuses on trails in the Gunnison Gorge National Conservation Area as a test case that should have some transferability to other BLM-managed areas. The intent for this work is to develop a code-based method for combining, organizing, and curating information about public trails into "trail" entities in Wikidata; essentially a semi-automated curation mechanism.

I focus here on BLM sources as opposed to other sources of information on trails for two main reasons:

1. BLM data and information are generally public domain with no license restrictions, making them appropriate reference material for Wikidata and other data/information platforms from the Wikimedia Foundation.

undefined. BLM is the authoritative source for data/information about the lands and resources under their management jurisdiction as opposed to other aggregations or derivations of these data.

I focus here on two main sources - advertised public geospatial data and what may be a transitory information source that nevertheless contains the most complete and organized information about trails. I describe the challenges that exist with using these sources and ways they might be addressed.

# Environment

This notebook is developed and run in the Deepnote environment, synced with a GitHub repository. I'm mostly using standard Python packages already available in the base Deepnote environment with a few exceptions. I did need some foundational components not available in the Deepnote core and used a local Dockerfile to install those along with a requirements.txt for some additional package needs.

In [1]:
import pandas as pd
import geopandas as gpd
import fiona
import requests
import json
from tabula import read_pdf
from tabulate import tabulate

# GIS Data Sources

Like any land management agency, BLM has a robust GIS program to support its operations. Trails are considered part of "transportation," and since 2015 the Ground Transportation Linear Features data standard has been in effect. You kind of need to know this important fact in order to track down and use BLM trails data as "GTLF" ends up being the unique keyword that helps to find these data wherever they happen to be. That web page to the BLM policy links to three PDF documents that provide further detail on the standard and how it is to be implemented that will be important when we start trying to understand these data.

BLM's GIS data page links to 4 different "data portals." Three of these are .gov resources and the first listed is an Esri ArcGIS Hub instance (a technology that Esri has been selling to gov agencies that have contracts with that company). The three .gov portals do all have metadata records for what BLM titles the "BLM Natl GTLF Public Managed Trails" dataset, which is what I'll concentrate on for now. Their transportation data includes several other "packages," all nominally following the GTLF standard, that may have records for other types of trails (e.g., roads that may be used to connect between trails or trail segments). The public managed trails dataset shows up in the three .gov portals via the following links currently:

- BLM Natl GTLF Public Managed Trails at data.doi.gov

- BLM Natl GTLF Public Managed Trails at doi.gov

- BLM Natl GTLF Public Managed Trails at geoplatform.gov

The data.doi.gov and data.gov records seem to come from what may be the same source metadata, but they list different update dates when I looked at them. All links to access data in both of these records point to sources from the arcgis.com location, which appears to be BLM's preferred point of access. The geoplatform.gov record points to a single point of data access, which appears to be one of BLM's own Esri ArcGIS Servers (an ArcGIS MapServer interface). The ArcGIS Hub record for this dataset provides a number of data access options, including ArcGIS REST services as well as downloads.

Interestingly enough, the first Google search result to come up on a query for "blm gtlf trails data" is from databasin.org, a nonprofit conservation organization that pulls together a bunch of government data into their own systems. They reference a download of a package from a different location on a BLM.gov site. The package contains an Esri file geodatabase that looks like it has all the different layers packaged up, including the other types of roads and trails that are shown in the ArcGIS Hub and other catalog sources as separate datasets. When looking just at the layer that package calls "gtlf_public_managed_trails_ln," we see that it has 114 fewer records than other forms of the data. Apart from metadata dates that may or may not accurately reflect currency of the data, it appears that DataBasin may be working with an older version of the data.

Given what appears to be BLM's push toward using their ArcGIS Hub instance as the primary access point for these and other BLM data, that's likely where I need to pull data to work from. Without doing a record by record comparison, I can't say whether there is some difference in what BLM serves from their own infrastructure, but a quick look at the query services shows that they are at least not operating the same version and configuration of ArcGIS Server.

- arcgis.com instance

- blm.gov instance 

We can start with a quick pull of a handful of records and see what we're dealing with.

In [2]:
# Query parameters are a bit different between the different versions of the services
r_trails_arcgis_hub = requests.get("https://services1.arcgis.com/KbxwQRRfWyEYLgp4/ArcGIS/rest/services/BLM_Natl_GTLF_Public_Managed_Trails/FeatureServer/2/query?where=1%3D1&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=0.0&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=false&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=10&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=pjson&token=")

trails_arcgis_hub = r_trails_arcgis_hub.json()
df_trails_arcgis_hub = pd.DataFrame([i["attributes"] for i in trails_arcgis_hub["features"]])

df_trails_arcgis_hub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   OBJECTID                        10 non-null     int64  
 1   DSTRBTE_EXTRNL_CODE             10 non-null     object 
 2   PLAN_ROUTE_DSGNTN_AUTH          10 non-null     object 
 3   PLAN_ASSET_CLASS                10 non-null     object 
 4   PLAN_OHV_ROUTE_DSGNTN           9 non-null      object 
 5   PLAN_MODE_TRNSPRT               10 non-null     object 
 6   Original_GlobalID               10 non-null     object 
 7   ADMIN_ST                        10 non-null     object 
 8   PLAN_ADD_MODE_TRNSPRT_RSTRT_CD  0 non-null      object 
 9   OBSRVE_MODE_TRNSPRT             10 non-null     object 
 10  OBSRVE_SRFCE_TYPE               10 non-null     object 
 11  OBSRVE_FUNC_CLASS               10 non-null     object 
 12  OBSRVE_ROUTE_USE_CLASS          10 non-

I can see some attributes here that show up in the standards documentation for GTLF like PLAN_ASSET_CLASS, that I can find in the metadata but actually have to go dig up in one of the PDF documents associated with the data standard if I want full details like the acceptable values in a property and what they mean. There's a place for this level of detail in both the FGDC and ISO19115 metadata standards, and it's a bit annoying that it's not included here.

One significant point of discomfort here is that there is no date/currency information included in this particular derivative form of the data. The GTLF standard seems to call for useful fields to include CREATE_DATE and MODIFY_DATE along with CREATE_BY and MODIFY_BY. After digging around the other available forms of the data, I can't find these attributes anywhere in the national datasets. So, there was a design decision somewhere to not include those fields in the publicly available form of the data. In the current system, there doesn't seem to be any way for anyone using these data to know how current any given record of interest is or what might have changed. Without any type of versioning mechanism in place, the best anyone could do would be to pull full datasets periodically into a cache and diff them to figure how what's changing over time.

In somewhat typical fashion, though, the BLM produces these data on a state by state basis. Each state where BLM has a State Office does appear to have its own section of the ArcGIS Hub. Oregon/Washington seem to have the most robust GIS program with their own data management area complete with what appear to be more State-level data standards and models. Colorado, which is what I'm looking to in this case, has several specific trails-related data assets via their ArcGIS Hub catalog. The state-level dataset with GTLF in the description does have many more properties than the National aggregation and includes fields for created and modified dates.

I don't know this for sure yet, but the likely reason the National dataset is slimmed down may be a lack of consistency across the State-level data sources and an effort to harmonize at the lowest common denominator. The "BLM CO Roads and Trails" is also sourced from a BLM data server (gis.blm.gov/coarcgis), with the ArcGIS Hub record simply a metadata record in that catalog pointing to a BLM source. A quick look at a couple of other areas shows that Utah doesn't have any trails data source listed in their transportation category, though they do have a dataset for UT's part of the National Scenic and Historic Trails. Other states like Idaho also don't have any trails data that can be found readily in the ArcGIS Hub. My guess is that what the BLM ArcGIS Hub presents is something of an ideal that BLM is striving to achieve with some level of consistency across the States, but it looks like they are a long way off from achieving this.

Identifiers are also questionable in all of the the datasets. OBJECTID is listed in the standards documentation, but this appears to be a simple sequential numeric identifier for geospatial features in the dataset that shouldn't be used for anything other than internal consistency. GlobalID is also discussed in the standard but is documented as a "software generated value" that shouldn't be used as a meaningful unique ID. The presence of a property labeled Original_GlobalID makes me wonder a bit at the internal data management workings over time and why these properties are even included if they are not truly usable for anything. The only other identifier property that seems to be in this form of the data is FAMS_ID, which is documented as a link to another BLM system, the Facility Asset Management System, and is not populated for every record.

We'll look at the full dataset in a moment, but I suspect that there is no persistent, resolvable identifier in the public trails dataset that we could count on to detect and work through changes over time. Rather, this is a typical GIS dataset where the majority of the focus is on the geometry part of the model, making sure that the data needed to put lines in the right place on a map and run various geospatial operations is right. Other things like the properties we would be interested in using to characterize aspects of trails may not be as robust, complete, or consistently used.

Even though it's not necessarily ideal, the first place to go for BLM's GIS data on trails is probably the State level where data are available. In this case, I pulled down the simple CSV version of the properties for CO Public Roads and Trails so we can take a look at some of the dynamics for what looks like a richer information source than the National aggregation. I then ran through a couple of quick tests digging into what I'll need to work through in trails information for the GGNCA.

In [3]:
df_blm_co_trails_csv = pd.read_csv('/datasets/blmcogtlf/BLM_CO_GTLF.csv')
df_blm_co_trails_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61476 entries, 0 to 61475
Data columns (total 39 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   OBJECTID                        61476 non-null  int64  
 1   PLAN_ALLOW_MODE_TRNSPRT         58320 non-null  object 
 2   OHV_ROUTE_DSGNTN_LIM            15542 non-null  object 
 3   OHV_DSGNTN_LIM_EXPLAIN          15262 non-null  object 
 4   ROUTE_PRMRY_NM                  22526 non-null  object 
 5   ROUTE_PRMRY_NUM                 22055 non-null  object 
 6   ROUTE_SPCL_DSGNTN_TYPE          188 non-null    object 
 7   PLAN_ROUTE_DSGNTN_AUTH          61476 non-null  object 
 8   PLAN_ASSET_CLASS                59246 non-null  object 
 9   PLAN_MODE_TRNSPRT               59338 non-null  object 
 10  PLAN_ACCESS_RSTRCT              60403 non-null  object 
 11  PLAN_SEASON_RSTRCT_CODE         56380 non-null  object 
 12  PLAN_ADD_MODE_TRNSPRT_RSTRT_CD  

In [8]:
# Show number of records with null name and number
len(df_blm_co_trails_csv[df_blm_co_trails_csv.ROUTE_PRMRY_NM.isnull() & df_blm_co_trails_csv.ROUTE_PRMRY_NUM.isnull()])

35599

In [19]:
df_blm_co_trails_csv["name"] = df_blm_co_trails_csv.ROUTE_PRMRY_NM.apply(lambda x: x.strip() if isinstance(x, str) and len(x.strip()) > 0 else None)

named_co_trails = df_blm_co_trails_csv[
    df_blm_co_trails_csv["name"].notnull()
]
ids = named_co_trails["name"]
list(named_co_trails[ids.isin(ids[ids.duplicated()])].sort_values("name").name.unique())

['001',
 '002',
 '003',
 '003A',
 '004',
 '005',
 '006',
 '007',
 '007A',
 '009',
 '009A',
 '010A',
 '011',
 '011A',
 '012',
 '012B',
 '013',
 '013A',
 '013B',
 '013C',
 '013E',
 '013H',
 '013K',
 '013O',
 '013S',
 '014',
 '015',
 '015A',
 '015B',
 '016',
 '017',
 '017B',
 '018',
 '020',
 '021',
 '022',
 '022A',
 '023',
 '023A',
 '023B',
 '023C',
 '023D',
 '023H',
 '023J',
 '023P',
 '023Q',
 '023R',
 '023T',
 '023U',
 '023X',
 '024',
 '025',
 '025A',
 '026',
 '026A',
 '027',
 '028',
 '030',
 '031',
 '031A',
 '032',
 '034',
 '035',
 '036',
 '036A',
 '037',
 '037A',
 '038',
 '039',
 '040',
 '042',
 '042A',
 '044',
 '045',
 '046',
 '047',
 '048',
 '078',
 '079A',
 '079B',
 '080',
 '080A',
 '080B',
 '080C',
 '080E',
 '081',
 '081A',
 '082',
 '082A',
 '082B',
 '083',
 '083B',
 '083C',
 '084',
 '085',
 '086',
 '087',
 '087A',
 '088',
 '089',
 '090',
 '090A',
 '090B',
 '092',
 '092A',
 '092B',
 '092C',
 '100',
 '101',
 '101A',
 '102',
 '103',
 '103B',
 '103D',
 '104',
 '104B',
 '104C',
 '105'

In [20]:
# Misnamed trail in the State-level originating dataset
named_co_trails[named_co_trails.name == "Chuckar Trail"]

Unnamed: 0,OBJECTID,PLAN_ALLOW_MODE_TRNSPRT,OHV_ROUTE_DSGNTN_LIM,OHV_DSGNTN_LIM_EXPLAIN,ROUTE_PRMRY_NM,ROUTE_PRMRY_NUM,ROUTE_SPCL_DSGNTN_TYPE,PLAN_ROUTE_DSGNTN_AUTH,PLAN_ASSET_CLASS,PLAN_MODE_TRNSPRT,...,COORD_SRC2,DEF_FET_TYPE,DEF_FET2,ACCURACY_FT,MODIFY_DATE,CREATE_DATE,GlobalID,SERIAL_NUM,Shape__Length,name
22862,147897,EQU_HIK_ONLY,,,Chuckar Trail,3414T,,BLM,Transportation System - Trail,Non-Mechanized,...,UNK,UNK,,-1,2021/01/25 21:41:54+00,2015/11/25 19:57:31+00,{1B1BE269-FFA8-4EE3-9A9C-E2549B8C132E},,1628.431593,Chuckar Trail


In [24]:
# Sidewinder trail shows up in multiple records with different properties indicating a non-motorized section
named_co_trails[named_co_trails.name == "Sidewinder"]

Unnamed: 0,OBJECTID,PLAN_ALLOW_MODE_TRNSPRT,OHV_ROUTE_DSGNTN_LIM,OHV_DSGNTN_LIM_EXPLAIN,ROUTE_PRMRY_NM,ROUTE_PRMRY_NUM,ROUTE_SPCL_DSGNTN_TYPE,PLAN_ROUTE_DSGNTN_AUTH,PLAN_ASSET_CLASS,PLAN_MODE_TRNSPRT,...,COORD_SRC2,DEF_FET_TYPE,DEF_FET2,ACCURACY_FT,MODIFY_DATE,CREATE_DATE,GlobalID,SERIAL_NUM,Shape__Length,name
20591,142835,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:41+00,2015/11/25 19:57:31+00,{3AF10556-DE0A-49FC-90E6-318ED3CFF329},,11065.842584,Sidewinder
21378,144515,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:43+00,2015/11/25 19:57:31+00,{A02C44E2-496C-402E-8ECB-5B01148ED456},,2199.464452,Sidewinder
22321,146682,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:45+00,2015/11/25 19:57:31+00,{63FE67ED-680E-4B2A-A0D5-0BC23D69F81F},,945.873236,Sidewinder
22322,146683,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:45+00,2015/11/25 19:57:31+00,{F64F6C59-4FBB-48F1-BC40-7C3116531512},,0.369854,Sidewinder
22705,147540,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:45+00,2015/11/25 19:57:31+00,{B25A6230-0E19-4B87-B985-17DC39A61391},,649.415011,Sidewinder
25070,152856,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:50+00,2015/11/25 19:57:31+00,{BEA3D048-B5D6-499F-9289-59EBE09ACBDE},,5311.476683,Sidewinder
25929,154677,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:51+00,2015/11/25 19:57:31+00,{309C1DC0-3D98-45A2-B68E-EDC148A0D74F},,4149.929149,Sidewinder
28877,161384,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:59+00,2015/11/25 19:57:31+00,{F4F74CD6-1AE0-4FF3-8B10-B0190D1EB9BD},,3525.516289,Sidewinder
29124,161906,MTC_SHARED,,,Sidewinder,,,BLM,Transportation System - Trail,Motorized,...,UNK,UNK,,-1,2021/01/25 21:42:59+00,2015/11/25 19:57:31+00,{8765F1ED-B582-433F-BCF8-E4CDBAA9F8CF},,2383.192327,Sidewinder
41884,233275,NON_MOTO_SHARED,,,Sidewinder,42206T,,BLM,Transportation System - Trail,Non-Motorized,...,,UNK,,-1,2021/12/16 20:33:20+00,2021/06/17 20:00:48+00,{869AAB6D-01C8-46CF-B942-3113430DF09D},,403.484477,Sidewinder


# Public Trails Info on the Web from BLM

Before I get into working with the full GIS dataset, in terms of both geospatial features and feature attributes, it is useful to take a look at a completely different information source - essentially how BLM presents information about the trails in their various managed land units for public consumption. I'm focused on the Gunnison Gorge National Conservation Area right now as part of some volunteer work, so that's the narrowed down use case I'm working through here.

The most condensed and yet robust and complete representation of trails and other features of interest on the Gunnison Gorge NCA is found in the PDF brochure that is online but printed out and found at trailheads and other locations. Some trails seem to have a landing page kind of thing going on like the Chukar Trail, one trail in the Gunnison Gorge NCA that I've explored physically (and digitally due to a name misspelling that I'll touch on again). But this does not seem to be consistent. The map that comes up on the Chukar Trail page allows for navigation and shows the GIS line work for the other trails on the GGNCA, but there is no information about them and they do not seem to have their own URL.

So, we're left with great information worked up and laid out in a PDF brochure that will be challenging to work with in software and a GIS dataset that may contain different information on the trails in a more machine-readable but perhaps less comprehensive or useful state. The following code blocks make an attempt at reading the PDF brochure for the GGNCA. This requires the tabula-py wrapper on a Java process, and so a JVM needed to be installed in the Deepnote environment. After looking at both this option and another Python approach with Camelot, I'm concluding that this particular PDF is probably not going to be conducive to me parsing it out for trails information from code. Bummer!

In [4]:
ggnca_brochure_tables = read_pdf("/datasets/ggncabrochure/GGNCAbro_webfriendly.pdf", pages="2")

Got stderr: Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Sep 12, 2022 1:34:00 PM org.apache.pdfbox.pdmodel.font.PDTyp

In [5]:
ggnca_brochure_tables[0]

Unnamed: 0.1,Boating – Launch Sites,Other Trails,Unnamed: 0
0,Chukar Trail and Boat Launch are accessible vi...,,
1,,TRAILS AND ALLOWED DESCRIPTION,MILES
2,the Chukar Trailhead. The trail can be accesse...,STAGING USES,
3,"via a primitive, rough road, often requiring f...",AREAS,
4,wheel drive that ends at the wilderness bounda...,Flat Top-Peach Staging Area Off-route cross-co...,100+ miles of trails
5,From there it’s a mile-long hike down Chukar a...,Valley OHV designated “open play” areas. Peach...,"covering 9,800 acres."
6,"Trail to the river. All gear, including boats,...",Recreation Area OQI ing area offers a beginner...,
7,be carried down the trail to the river. No carts,,
8,or wheeled devices are allowed in the wilderne...,Red Rocks- Nighthorse Trail JKO jeep road sect...,"13 miles, one-way."
9,A commercial horse packing service is availabl...,wilderness rim from the national park boundary,


# Problems with BLM as Primary Source

From BLM directly, we have GIS data that is tuned for use in GIS applications like building maps and likely internal planning activities, a handful of web links for major trails with some type of significant public interest (e.g., access to historical sites), and highly variable curated information sources like brochures that would need to be read and turned into data by humans. We have no really good way to use BLM data/information sources as the primary references for injecting new trail entities into Wikidata, at least not programmatically and at scale. If I end up taking an approach where I essentially go through and carefully curate the knowledge about trails one trail at a time, even if I do that in the form of a database that gets synced into Wikidata, this might be fine. If I wanted to do something completely with code, I don't see a way to get there from these sources alone or even as the initial starting point. At the very least I would need to come in with an a priori set of trail names, find those in the BLM GIS data, and then work with what I find.

# OpenStreetMap

Another option on this, or perhaps an additional data starting point, is to work with what's in OpenStreetMap currently. This is what commercial groups like Natural Atlas and All Trails have done to produce their own proprietary datasets. Let's look at a couple examples of trails we know about based simply on looking them up by unique ID via the OSM Overpass API.

In [25]:
from OSMPythonTools.api import Api
from OSMPythonTools.overpass import overpassQueryBuilder, Overpass
from OSMPythonTools.nominatim import Nominatim

osm_api = Api()
overpass = Overpass()
nominatim = Nominatim()

In [26]:
cool_rock_canyon = osm_api.query('way/46023512')
cool_rock_canyon_trailhead = osm_api.query('node/3726754308')
sidewinder_trail = osm_api.query('way/239024531')

[api] downloading data: way/46023512
[api] downloading data: node/3726754308
[api] downloading data: way/239024531


In [27]:
display(sidewinder_trail.tags())
display(cool_rock_canyon.tags())
display(cool_rock_canyon_trailhead.tags())

{'bicycle': 'yes',
 'highway': 'path',
 'motorcycle': 'yes',
 'mtb:scale:imba': '3',
 'name': 'Sidewinder Trail',
 'surface': 'ground'}

{'intermittent': 'yes',
 'name': 'Cool Rock Canyon',
 'nhd:com_id': '75547519;75563569',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'amenity': 'parking', 'name': 'Cool Rock Canyon Trailhead', 'operator': 'BLM'}

There's something of a significant issue here that is also pointing toward the need for public data curation. The Sidewinder Trail is a specifically constructed trail geared toward off-road motorcycles and mountain bikes with secondary uses for hiking and horseback riding (that's from my own local knowledge of that trail plus a bit of what BLM says in their brochure). Cool Rock Canyon is another designated trail that even has its own parking area that OSM knows about as a node (also shown above).

However, Cool Rock Canyon is not actually known to OSM as a "trail" with any tags that would let us discover it as such. Looking at some history behind the way, it came from someone spending some time with the National Hydrography Dataset, turning that into OSM data. So, OSM knows Cool Rock Canyon as an intermittent stream, which is true but that's not all it is.

So, we could nominally hit the OSM API with a query for all "ways" within the GGNCA boundary and certain attributes that describe trails and still not come up with all of the trail records we want to work with. And what do we do with things that are classified as waterways but are also "highways"? The OSM data model and the Wikidata model share some commonalities, and a fair bit of work has been done to establish cross-referencing between the systems. However, this seems like an area where the Wikidata model has an advantage in being able to classify (instance of) an entity like "Cool Rock Canyon" as both an intermittent stream and a "footpath."

In the following code blocks, I run through a couple of different queries toward an operational notion of querying OSM for data that would have a relationship to the trail concept

In [None]:
# Find the GGNCA as an "area" using OSM's Nominatim name search so that we can pass this as our bounding area
ggnca = nominatim.query('Gunnison Gorge National Conservation Area')
print(ggnca.areaId())

3605900825


In [None]:
# Build and run queries for "paths" and "streams", which will give us what might be some of our useful trail records
path_query = overpassQueryBuilder(
    area=ggnca.areaId(), 
    elementType='way', 
    selector='"highway"="path"', 
    out='body'
)

stream_query = overpassQueryBuilder(
    area=ggnca.areaId(), 
    elementType='way', 
    selector='"waterway"="stream"', 
    out='body'
)

ggnca_paths = overpass.query(path_query)
print(ggnca_paths.countElements())

ggnca_streams = overpass.query(stream_query)
print(ggnca_streams.countElements())


10
163


In [None]:
# Display out the paths and trails we found
for i in ggnca_paths.elements():
    if 'name' in i.tags():
        display(i.tags())

for i in ggnca_streams.elements():
    if 'name' in i.tags():
        display(i.tags())

{'bicycle': 'yes',
 'highway': 'path',
 'motorcycle': 'yes',
 'mtb:scale:imba': '3',
 'name': 'Sidewinder Trail',
 'surface': 'ground'}

{'bicycle': 'no',
 'foot': 'yes',
 'highway': 'path',
 'horse': 'yes',
 'name': 'West River Trail',
 'source': 'BLM, Bing'}

{'foot': 'yes',
 'highway': 'path',
 'horse': 'yes',
 'name': 'Chukar Trail',
 'source': 'Bing',
 'surface': 'dirt',
 'wikidata': 'Q112667890'}

{'bicycle': 'yes',
 'foot': 'yes',
 'highway': 'path',
 'horse': 'yes',
 'motor_vehicle': 'yes',
 'name': 'Eagle Valley',
 'surface': 'dirt'}

{'intermittent': 'yes',
 'name': 'Red Canyon',
 'nhd:com_id': '64442888;64458607;64458729;64458714;64458628;64442783;64458591;64458691;64482052;64442835;64442823;64442818;64458765;64458754;64458611;64458744;64458623;64458640;64442907;64458664',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Long Gulch',
 'nhd:com_id': '64442775;64442884;64443012',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Lime Kiln Gulch',
 'nhd:com_id': '64442791',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Birthday Canyon',
 'nhd:com_id': '75563571;75547411',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Cool Rock Canyon',
 'nhd:com_id': '75547519;75563569',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Sun Cliff Canyon',
 'nhd:com_id': '75568885;75568803;75567627',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Rabbit Gulch',
 'nhd:com_id': '64464048',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Loutsenhizer Arroyo',
 'nhd:com_id': '<many>',
 'nhd:gnis_id': '00186688',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'name': 'Smith Fork',
 'nhd:com_id': '<many>',
 'nhd:gnis_id': '00186702',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'layer': '-1',
 'name': 'Cool Rock Canyon',
 'nhd:com_id': '75547519;75563569',
 'source': 'NHD, Bing',
 'tunnel': 'culvert',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Cool Rock Canyon',
 'nhd:com_id': '75547519;75563569',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'name': 'Sun Cliff Canyon',
 'nhd:com_id': '75568885;75568803;75567627',
 'source': 'NHD, Bing',
 'waterway': 'stream'}

{'intermittent': 'yes',
 'layer': '-1',
 'name': 'Sun Cliff Canyon',
 'nhd:com_id': '75568885;75568803;75567627',
 'source': 'NHD, Bing',
 'tunnel': 'culvert',
 'waterway': 'stream'}

Well, bottom line here is that OSM isn't really all the way there yet either in terms of useful content that might help us get at fully fledged public data records on trails in the GGNCA (and presumably any other public land we might try to work through). But there is some useful data here, particularly from the standpoint of an alternate and potentially better source of geospatial data from what the BLM has. At least with the name and geospatial data for a given trail from OSM and the BLM source on their ArcGIS Hub, we could potentially connect the dots. Improving OSM with additional useful tags at the same time we improve or add new records to Wikidata could be part of a code-driven process operated from an intermediary curation point we set up.

The following code blocks are probably a clunky way to do this, but I run a query for all ways and nodes and then put all the tags for those that have a name into a dataframe. This might be one way to start a data assembly exercise - pulling in all the possible records from OSM that we might be able to connect to other public sources of information. We can then start connecting dots and improving data we want to encode into the knowledge commons, combining both claims for Wikidata and tags for OSM.

In [None]:
all_ways_and_nodes_query = overpassQueryBuilder(
    area=ggnca.areaId(), 
    elementType=['way','node'], 
    out='body'
)

all_ways_and_nodes_result = overpass.query(all_ways_query)

all_ways_and_nodes = []
for i in all_ways_and_nodes_result.elements():
    if i.tags() is not None and 'name' in i.tags():
        item_data = dict(i.tags())
        item_data["osm_item_type"] = i.type()
        item_data["osm_id"] = i.id()
        item_data["osm_path"] = "/".join([i.type(), str(i.id())])
        all_ways_and_nodes.append(item_data)

df_all_ways_and_nodes = pd.DataFrame(all_ways_and_nodes)

In [None]:
df_all_ways_and_nodes

Unnamed: 0,ele,gnis:county_id,gnis:created,gnis:feature_id,gnis:state_id,name,place,osm_item_type,osm_id,osm_path,...,water,bicycle,motorcycle,mtb:scale:imba,parking,name_1,foot,horse,wikidata,tunnel
0,1718,029,10/13/1978,186487,08,Smiths Mountain,locality,node,358922443,node/358922443,...,,,,,,,,,,
1,1660,029,10/13/1978,186690,08,Middle Peach Valley Dam,,node,358922563,node/358922563,...,,,,,,,,,,
2,1888,085,10/13/1978,186833,08,State Tunnel Dam,,node,358922668,node/358922668,...,,,,,,,,,,
3,1586,,10/13/1978,186507,,Sulphur Mine,,node,369164059,node/369164059,...,,,,,,,,,,
4,1889,,10/13/1978,186834,,State Tunnel,,node,369164066,node/369164066,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,,,,,,Sun Cliff Canyon,,way,369125651,way/369125651,...,,,,,,,,,,
85,,,,,,Sun Cliff Canyon,,way,369125652,way/369125652,...,,,,,,,,,,culvert
86,,,,,,North Fork Gunnison River,,way,373537216,way/373537216,...,,,,,,,,,,
87,,,,,,Eagle Valley,,way,845490443,way/845490443,...,,yes,,,,,yes,yes,,


In [None]:
_deepnote_run_altair(df_all_ways_and_nodes, """{"$schema":"https://vega.github.io/schema/vega-lite/v4.json","mark":{"type":"bar","tooltip":true},"height":220,"autosize":{"type":"fit"},"data":{"name":"placeholder"},"encoding":{"x":{"field":"tunnel","type":"nominal","sort":null,"scale":{"type":"linear","zero":false}},"y":{"field":"COUNT(*)","type":"quantitative","sort":null,"aggregate":"count","scale":{"type":"linear","zero":true}},"color":{"field":"","type":"nominal","sort":null,"scale":{"type":"linear","zero":false}}}}""")

# Conclusions

This exercise does leave me with a couple of interesting conclusions. I can see why no one has yet done this comprehensively. Nicely curated and clean public data about trails just doesn't exist outside the scope of any of the areas where someone for some reason has taken enough interest to develop such a resource. I can also see the tremendous intellectual capital investment that commercial companies like Natural Atlas, Gaia GPS, All Trails and others have laid out to build their own sources. The one effort here in the U.S. through the USGS National Digital Trails dataset is more of an simple amalgamation at this point, relying on best available declared sources of trails data from mostly Federal providers and not yet dealing with all the deeper semantics of a full integration.

A combination of Wikidata and OSM could be a powerful platform upon which to carry out a more semantically robust integration of trails data and information from multiple sources to create a powerful public domain asset. The knowledge model is inherent in those two platforms with slight differences in them making up for weaknesses one or the other might have on their own. A challenge exists in pre-existing data already in either one of the systems that has been contributed somewhat by happenstance rather than carefully thought out design. This could create collisions in trying to do something more systematically.

## Potential operating model

I think what I've proven to myself with this exercise is that there's not really a way to go directly from point A (source data/information) to point B (knowledge commons). I've also seen how I need to focus on both Wikidata and OSM and the relationships between them in this kind of exercise when the entities in question are also inherently geospatial features.

I will have to create some type of intermediary data structure where the curatorial process can occur. While I may sometimes be able to use some code against an information source to work up my initial list of trails for a given area, I'll need to have something with trail names I can use as a starting point to go after information for claims (Wikidata)/tags (OSM). This is the type of scenario I've seen from other Wikidata contributors who end up using something like OpenRefine to work up some final details for a dataset that got into that form from some other means. Having used that same approach in other data integration exercises, I've got a lot of respect for OpenRefine and its method of recording provenance.

My goal is to make sure that there is little to nothing that can't be linked to or shown with code between the original sources I decide I can use and the point where I need to sit down and make choices that establish trail entities and their claims and tags. In doing so, I will establish a provenance trace that should both promote trustworthiness in the end result and show that it's possible to improve things at the sources such that more direct contributions to the knowledge commons could occur at some point. Demonstrating and recording everything that has to be done with records from one place to another should be a helpful lesson. It should also let me automate at least some of this process within certain constraints.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c40bd484-9cb3-4a0c-b0ba-4d390c2978fd' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>