# 01. Reverse-Geocode PBN

We want to use the Principal Bicycle Network dataset to generate a list of intersections along existing, on-road bicycle lane routes.  But first, we need to do some work to overcome some dataset limitations.

The Principal Bicycle Network dataset contains strings of coordinates (latitude/longitude) for each bike path, but it does not necessarily have a street name, and never has a town or suburb name.  It does not have any information about intersections.  Therefore we need to fill in those gaps via a process of reverse-geocoding.

This process requires a local instance of Nominatim to perform bulk reverse-geocoding.  Instructions on how to set that up can be found here:

https://nominatim.org/release-docs/latest/admin/Installation/

You will need to load data for Australia, or at least a bounding box that covers all of Victoria, into the server, so that it can handle requests for coordinates within Victoria.

The Principal Bicycle Network dataset is open data.  If you do not wish to go through this process, you can use the saved results, if provided by the author.  See thesis for a link.  But if you want to run the process for yourself, or perhaps perform the operation on a newer version of the PBN dataset, then please use this notebook.

## Configuration

Any configuration that is required to run this notebook can be customized in the next cell

In [1]:
# Principal Bicycle Network filename, downloaded from data.gov.au or included
# in your download of this code from GitHub
# Must be in the 'data_sources' subdirectory
pbn_filename = 'Principal_Bicycle_Network_(PBN).geojson'

# Output CSV file where each point in the original PBN dataset has been "exploded"
# into a CSV record, and reverse-geocoded
# Will be saved to the 'data_sources' subdirectory
exploded_output_file = 'pbn_exploded.geojson'

# Connection details for a local Nominatim instance
nominatim_domain  = '192.168.1.197/nominatim'
nominatim_scheme  = 'http'
nominatim_timeout = 10
nominatim_agent   = 'myGeocoder'

## Code

In [2]:
# General imports
import os
import sys

import pandas as pd

# Workaround for message: "PROJ: proj_create_from_database: Cannot find proj.db"
# when importing geopandas
os.environ['GDAL_DATA'] = '/home/server/anaconda3/share/gdal'
os.environ['PROJ_LIB'] = '/home/server/anaconda3/share/proj'

import geopandas as gpd  # needs pyproj <= 2.6 rather than 3.1

import geopy
import geopy.distance
from geopy import Point
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from shapely.geometry.linestring import LineString
from shapely.geometry.multilinestring import MultiLineString

from geographiclib.geodesic import Geodesic

from datetime import datetime

from tqdm.notebook import tqdm

import xml.etree.cElementTree as ET
from collections import defaultdict

from geopy.geocoders import Nominatim

import requests

import gmaps

PROJ: proj_create_from_database: Cannot find proj.db


In [3]:
# Functions to help write log messages that keep track of how long everything took
timestamp_starting = 0

def log_starting(msg):
    global timestamp_starting
    timestamp_starting = datetime.now()
    print(str(timestamp_starting) + ' START - ' + msg, flush=True)

def log_finished(msg):
    global timestamp_starting
    timestamp_finished = datetime.now()
    timestamp_duration = timestamp_finished - timestamp_starting
    print(str(timestamp_finished) + ' END   - ' + msg
        + '(' + str(timestamp_duration.total_seconds()) + ')',
        flush=True
    )

Load the Principal Bicycle Network dataset from a geojson file

In [4]:
log_starting('Load original PBN dataset')

pbn_path = os.path.join(os.path.abspath(os.pardir), 'data_sources', pbn_filename)

pbn = gpd.read_file(pbn_path)

log_finished('Load original PBN dataset')

2021-10-08 12:53:14.690948 START - Load original PBN dataset
2021-10-08 12:53:18.626194 END   - Load original PBN dataset(3.935246)


Connect to a Nominatim service and perform bulk reverse-geocoding of the data

In [5]:
locator = Nominatim(
    user_agent = nominatim_agent,
    domain     = nominatim_domain,
    scheme     = nominatim_scheme,
    timeout    = nominatim_timeout
)

Create a function to perform reverse geocoding from the 'geometry' column

It will work through each coordinate in the 'geometry' column, because a path could cut through multiple town/suburb/city values, or even multiple streets.  We may get some false positives when we geocode a coordinate near an intersection, and the reverse geocode concludes that the point is on the wrong street.  However by looking at every point on the path, we should pick up the real street, and we will develop strategies to deal with this.  E.g.

* If there is only one point (node) on a street, then exclude it or assign it low confidence
* Check if the point (node) is an intersection based on the raw OpenStreetMap data

We must be careful to handle both LineString and MultiLineString!  There is exactly ONE row
with a MultiLineString in the PBN dataset!

In [6]:
def geocode_coords(row):
    retval = []
    
    if type(row) is LineString:
        for xy in row.coords:
            retval.append(locator.reverse(Point(xy[1], xy[0])).raw)
    
    elif type(row) is MultiLineString:
        for ls in row:
            for xy in ls.coords:
                retval.append(locator.reverse(Point(xy[1], xy[0])).raw)
    
    return retval 

Geocode every coordinate in every row of the PBN dataset, with a progress bar

In [7]:
log_starting('Bulk reverse-geocoding')

tqdm.pandas()

pbn['geocode_list'] = pbn['geometry'].progress_apply(geocode_coords)

log_finished('Bulk reverse-geocoding')

2021-10-08 12:53:18.673810 START - Bulk reverse-geocoding


  0%|          | 0/40252 [00:00<?, ?it/s]

2021-10-08 13:13:38.010685 END   - Bulk reverse-geocoding(1219.336875)


Show a sample of the geocoded data, with new 'geocode_list' column

In [8]:
pbn.head(3)

Unnamed: 0,objectid,network,type,status,strategic_cycling_corridor,local_name,local_type,rd_num,name,side,...,facility_right,surface_right,width_right,lighting,verified_date,bearing,scc_name,comments,geometry,geocode_list
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00,0.0,Dandenong to Pakenham,,"LINESTRING (145.25577 -37.99385, 145.25570 -37...","[{'place_id': 3455946, 'licence': 'Data © Open..."
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00,0.0,Dandenong to Pakenham,,"LINESTRING (145.26087 -37.99526, 145.26058 -37...","[{'place_id': 3455946, 'licence': 'Data © Open..."
2,3,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00,0.0,Carrum to Warburton Trail,,"LINESTRING (145.22693 -37.99352, 145.22674 -37...","[{'place_id': 3543007, 'licence': 'Data © Open..."


"Explode" geocode_list into a row for each point, because each path might span multiple streets or towns/suburbs/cities

In [9]:
log_starting('Explode geocode_list')

pbn_explode = pbn.explode('geocode_list')

log_finished('Explode geocode_list')

2021-10-08 13:13:38.065741 START - Explode geocode_list
2021-10-08 13:13:38.418397 END   - Explode geocode_list(0.352656)


Define functions to tease out key fields from within geocoded data, then add them to the dataset as (more accessible) columns.

At the same time, we will define a new field 'local_street', which will always be stated in upper case.  If 'local_name' is defined in the original PBN record, we use that, along with 'local_type'.  Otherwise, we take the 'road' field that was provided by the reverse-geocode process.

In [10]:
# General functions to pull out a "vanilla" field from the geocoded result
def get_address(geocode):
    try:
        return geocode['address']
    except:
        return 'n/a'

def get_road(geocode):
    try:
        return geocode['road']
    except:
        return 'n/a'
    
def get_postcode(geocode):
    try:
        return geocode[postcode]
    except:
        return 'n/a'

def get_display_name(geocode):
    try:
        return geocode['display_name']
    except:
        return 'n/a'
    
def get_osm_id(geocode):
    try:
        return geocode['osm_id']
    except:
        return 'n/a'
    
# Three functions to pull out town/suburb/city
# These are special because they are inconsistently populated.
# If we want the 'town', we will use that if it is defined, otherwise we will try 'suburb', then 'city'
# If we want the 'suburb' we will resort to 'town', then 'city'
# If we want the 'city' we will resort to 'suburb', then 'town'

def get_suburb(address):
    try:
        return address['suburb']
    except Exception:
        try:
            return address['town']
        except Exception:
            try:
                return address['city']
            except Exception:
                return 'n/a'

def get_town(address):
    try:
        return address['town']
    except Exception:
        try:
            return address['suburb']
        except Exception:
            try:
                return address['city']
            except Exception:
                return 'n/a'
        
def get_city(address):
    try:
        return address['city']
    except Exception:
        try:
            return address['suburb']
        except Exception:
            try:
                return address['town']
            except Exception:
                return 'n/a'
        
# This function will define a new field 'local_street'.
# Our first preference is to use 'local_name' and 'local_type' if they were explicitly
# given to us in the original PBN dataset.  Otherwise we will use the 'road' field
# from the reverse-geocoded data.
def get_local_street(row):
    retval = '' 
    if row['local_name'] is not None and row['local_name'].strip() != '':
        retval = row['local_name']
        if row['local_type'] is not None and row['local_type'].strip() != '':
            retval = retval + ' ' + row['local_type']
    else:
        retval = row['road']
    return retval


# Extract or derive geo fields from geocode_list

log_starting('Extract or derive geo fields')

pbn_explode['address']      = pbn_explode['geocode_list'].apply(get_address)
pbn_explode['road']         = pbn_explode['address'].apply(get_road)
pbn_explode['suburb']       = pbn_explode['address'].apply(get_suburb)
pbn_explode['town']         = pbn_explode['address'].apply(get_town)
pbn_explode['city']         = pbn_explode['address'].apply(get_city)
pbn_explode['postcode']     = pbn_explode['address'].apply(get_postcode)
pbn_explode['display_name'] = pbn_explode['geocode_list'].apply(get_display_name)
pbn_explode['local_street'] = pbn_explode.apply (lambda row: get_local_street(row), axis=1)
pbn_explode['osm_id']       = pbn_explode['geocode_list'].apply(get_osm_id)

log_finished('Extract or derive geo fields')

2021-10-08 13:13:38.441213 START - Extract or derive geo fields
2021-10-08 13:13:42.068978 END   - Extract or derive geo fields(3.627765)


Show an example of what the data looks like.

In [11]:
pbn_explode.head(30)

Unnamed: 0,objectid,network,type,status,strategic_cycling_corridor,local_name,local_type,rd_num,name,side,...,geocode_list,address,road,suburb,town,city,postcode,display_name,local_street,osm_id
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115


Save output to a new file, so that we can re-read it later, without having to re-do the whole reverse-geocoding process each time.

In [12]:
log_starting('Save exploded PBN dataset')

exploded_output_path = os.path.join(os.path.abspath(os.pardir), 'data_sources', exploded_output_file)

pbn_explode.to_file(exploded_output_path, driver='GeoJSON', encoding='utf-8')

log_finished('Save exploded PBN dataset')

2021-10-08 13:13:42.140872 START - Save exploded PBN dataset
2021-10-08 13:15:38.047044 END   - Save exploded PBN dataset(115.906172)
