# 03. Identify Candidate Intersections OSM

Build a CSV file with a list of candidate intersections to sample from Google Street View based purely on an OpenStreetMap XML extract.  We look for "ways" that have the tag "cycleway" and then find any nodes on those ways that are shared with another way.

You can download OpenStreetMap XML data for Australia from:

http://download.geofabrik.de

And use the "osmium" tool to reduce it to a smaller bounding box.

Or you can use the online service:

https://extract.bbbike.org

To select the required extract.  Please consider donating money to help them run that service, if you use it.

The data for Victoria is 2.82GB uncompressed, so it is not included in the GitHub repository.  Please use one of the above service to obtain it.

If downloading from geofabrik, the .osm.bz2 is a compressed version of the XML file you want, but you'll only be able to download Australia as a whole.  I recommend you download the .osm.pbf file -- a smaller file in a non-XML format -- and then use the "osmium" tool to cut it to a bounding box based on a pair of latitude/longitude coordinates.

To install and use "osmium", see:

https://osmcode.org/osmium-tool/

## Configuration

Any configuration that is required to run this notebook can be customized in the next cell

In [1]:
# The name of the output CSV files with a list of intersections to sample,
# based on cycleways in the OSM extract
# Will be saved to the 'data_sources' directory
output_intersections_file = 'osm_intersections.csv'

# Filename of OpenStreetMap XML extract that includes Victoria
# Expected to be found in the 'data_sources' directory
osm_file = 'planet_victoria.osm'
#osm_file = 'Locality_Mount_Eliza_Sample.osm'

## Code

In [2]:
# General imports
import os
import sys

import pandas as pd

from datetime import datetime

from tqdm.notebook import tqdm

import xml.etree.cElementTree as ET
from collections import defaultdict

from geographiclib.geodesic import Geodesic

In [3]:
# Functions to help write log messages that keep track of how long everything took
timestamp_starting = 0

def log_starting(msg):
    global timestamp_starting
    timestamp_starting = datetime.now()
    print(str(timestamp_starting) + ' START - ' + msg, flush=True)

def log_finished(msg):
    global timestamp_starting
    timestamp_finished = datetime.now()
    timestamp_duration = timestamp_finished - timestamp_starting
    print(str(timestamp_finished) + ' END   - ' + msg
        + '(' + str(timestamp_duration.total_seconds()) + ')',
        flush=True
    )

Load raw OpenStreetMap data for the area into memory.  This will be used to find all the intersections for each [local_street] x [town] mentioned in the PBN data.

The extract was downloaded from http://extract.bbbike.org.  Region limited to one small down for initial testing, then a box that encompasses all of Victoria.

The "Victoria" data took around 10 minutes to load into memory, and the resulting Python process used approximately 9GB of memory.

In [4]:
osm_file_path = os.path.join(os.path.abspath(os.pardir), 'data_sources', osm_file)

log_starting('Read raw OpenStreetMap data')

# In-Memory caching via dictionary objects
nodes_per_way      = defaultdict(list) # List of nodes in each way
ways_per_node      = defaultdict(list) # List of way IDs associated with each node
way_count_per_node = {}                # Count distinct way names per node
ways_by_id         = {}                # List of ways by way id
cycleways_by_id    = {}                # List of ways by way id where the way is a cycleway
node_lat           = {}                # Latitude of an intersection node by oms_id
node_lon           = {}                # Longitude of an intersection node by oms_id

# Read the OpenStreetMap XML file
context = ET.iterparse(osm_file_path, events=('start', 'end'))
context = iter(context)

way_id  = 0  # Keep track of which "way" object we are reading from XML, 0=none
node_id = 0  # Keep track of which "node" (nd) object we are reading from XML, 0=none

# Iterate through every XML element in the file as it starts or finishes
# This approach allows us to "stream" the XML rather than try to load it all into
# memory at once.  We only cache what is important to us.
way_name    = None
is_cycleway = False
recorded    = False
    
for event, elem in context:
    tag   = elem.tag
    value = elem.text
    
    if value:
        value = value.encode('utf-8').strip()
        
    # Process the start of an XML tag
    if event == 'start':
        # Process "way" objects
        if tag == 'way':
            way_id = elem.get('id', 0)
                            
            # Record that we have not found the name yet, nor evidence that it is a cycleway
            way_name    = None
            is_cycleway = False
            recorded    = False
            
        # Process "node" (nd) objetcts inside (associated with) a "way"
        elif tag == 'nd':
            node_id = elem.get('ref', 0)
            if way_id != 0:
                # Record that this node was inside this way
                nodes_per_way[way_id].append(node_id)
                ways_per_node[node_id].append(way_id)
                
        # Process "tag" objects that give a street name for each "way"
        elif tag == 'tag':
            k = elem.get('k', '?').upper()
            v = elem.get('v', '?').upper()
            #print('Tag: [{0:s}] = [{1:s}]'.format(k, v))
            
            if way_id != 0 and k == 'NAME':
                way_name = v
                ways_by_id[way_id] = way_name.upper()
            
            elif way_id != 0 and k.startswith('CYCLEWAY'):
                is_cycleway = True
            
            # If the way has a name and it is a cycleway that we have not yet recorded, do so now
            if way_name is not None and is_cycleway and not recorded:
                cycleways_by_id[way_id] = way_name.upper()
                recorded = True

        # Record the latitude/longitude for each "node" by its oms_id
        if tag == 'node':
            node_id = elem.get('id', 0)
            lat     = elem.get('lat', 0)
            lon     = elem.get('lon', 0)
            
            node_lat[node_id] = float(lat)
            node_lon[node_id] = float(lon)
            
    # At the end of an XML tag, if it was a "way" then record that we are no longer
    # in the middle of reading a "way"
    if event == 'end' and tag == 'way':
        way_id = 0

    elem.clear()

log_finished('Read raw OpenStreetMap data')


# A way can be divided up into multiple segments when a characteristic changes, e.g. speed limit change
# We do not want to recognise these boundaries of intersections, they're not
# So get a count of distinct way names per node.  If there is more than one distinct name, THEN it is an intersection
log_starting('Find distinct way names per node')

for node_id in ways_per_node.keys():
    way_names = []
    
    for way_id in ways_per_node[node_id]:
        # Watch for unnamed ways that were deliberately excluded, e.g. coastline, creek
        if way_id in ways_by_id:
            way_name = ways_by_id[way_id]
        
            if way_name not in way_names:
                way_names.append(way_name)
    
    way_count_per_node[node_id] = len(way_names)
    
log_finished('Find distinct way_names per node')

2021-10-08 16:33:05.918598 START - Read raw OpenStreetMap data
2021-10-08 16:37:17.708082 END   - Read raw OpenStreetMap data(251.789484)
2021-10-08 16:37:17.708578 START - Find distinct way names per node
2021-10-08 16:37:38.688859 END   - Find distinct way_names per node(20.980281)


In [5]:
# Find intersections nodes in each cycleway, but try to work out the bearings at the same time
# Write the output to a CSV

log_starting('Write candidate intersections to CSV')

output_intersections_path = os.path.join(os.path.abspath(os.pardir), 'data_sources', output_intersections_file)

# Open an output CSV file for writing, emulate the basic structure of the one we produced from PBN cycleways
f = open(output_intersections_path, 'w')
f.write(',objectid,local_street,town_suburb,city_count,intersection_street,intersection_lat,intersection_lon,bearing,bearing_lat,bearing_lon\n')

for way_id in cycleways_by_id.keys():
    node_list = nodes_per_way[way_id]
    for i in range(0, len(node_list)):
        this_node = node_list[i]
        
        if way_count_per_node[this_node] > 1:
            # This is an intersection, we want to output it, but we need a bearing first
            # The bearing is assumed to be the average of the bearing from the previous node to this node,
            # and the bearing from this node to the next node
        
            if i > 0:
                prev_node = node_list[i-1]
                prev_bearing = Geodesic.WGS84.Inverse(node_lat[prev_node], node_lon[prev_node], node_lat[this_node], node_lon[this_node])['azi1']
                if prev_bearing < 0:
                    prev_bearing += 360
            else:
                prev_bearing = None
            
            if i < len(node_list) - 1:
                next_node = node_list[i+1]
                next_bearing = Geodesic.WGS84.Inverse(node_lat[this_node], node_lon[this_node], node_lat[next_node], node_lon[next_node])['azi1']
                if next_bearing < 0:
                    next_bearing += 360
            else:
                next_bearing = None
        
            if prev_bearing is not None and next_bearing is not None:
                bearing = float((prev_bearing + next_bearing) / 2)
            elif prev_bearing is not None:
                bearing = float(prev_bearing)
            elif next_bearing is not None:
                bearing = float(next_bearing)
            else:
                bearing = 0
                
            # Output a record
            f.write('0,{0:s},{1:s},-,-,-,1,{2:s},{3:.6f},{4:.6f},{5:.1f},{3:.6f},{4:.6f}\n'.format(
                way_id,
                ways_by_id[way_id],
                this_node,
                node_lat[this_node],
                node_lon[this_node],
                bearing
            ))
            
f.close()

log_finished('Write candidate intersections to CSV')

2021-10-08 16:37:38.702747 START - Write candidate intersections to CSV
2021-10-08 16:37:42.718854 END   - Write candidate intersections to CSV(4.016107)
