# Load and Geocode the Principal Bicycle Network dataset

Load the Principal Bicycle Network dataset in .geojson format:

https://discover.data.vic.gov.au/dataset/principal-bicycle-network-pbn

Then perform geocoding using a LOCAL instance of Nominatim.  This is necessary for cleansing.

* All versions of the PBN dataset come with "local_name" and "local_type" fields, which MAY be a street name, but they are often blank

* No version of the PBN dataset comes with a town/suburb/city field,.  If we see a "Main Street", which one is it?  We need to disambiguate these if we want to find intersections where bicycle lane logos are likely to appear

* The .geojson versopm of the PBN dataset contains a "geomtry" field with a "LineString" or (in one case) a "MultiLineString" series of coordinates.  These "draw" the path on the map.

Therefore, to fill in a town/suburb/city, and to reliably fill in a "local_street" field, we perform a reverse-geocode operation on each latitude/longitude coordinate in the series.

The "local" instance of Nominatim is required because the public instance only allows one query per second, maximum, and discourages bulk queries.  Our local instance was created on a Hyper-V virtual machine running under Windows, with Ubuntu 20.04, 16GB of RAM, and 100GB of SSD.  (Only 40% of the 100GB disk space was used.)

The "Australia and Oceania" .osm.pbf file was imported into the Nominatim database from the "Geofabrik" site.

Refs:

https://towardsdatascience.com/reverse-geocoding-in-python-a915acf29eb6
https://operations.osmfoundation.org/policies/nominatim/
https://nominatim.org/release-docs/latest/admin/Installation/
https://download.geofabrik.de

In [1]:
import pandas as pd
import geopandas as gpd
import geopy
from geopy import Point
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from shapely.geometry.linestring import LineString
from shapely.geometry.multilinestring import MultiLineString

from datetime import datetime
from tqdm.notebook import tqdm


timestamp_starting = 0

def log_starting(msg):
    global timestamp_starting
    timestamp_starting = datetime.now()
    print(str(timestamp_starting) + ' START - ' + msg, flush=True)

def log_finished(msg):
    global timestamp_starting
    timestamp_finished = datetime.now()
    timestamp_duration = timestamp_finished - timestamp_starting
    print(str(timestamp_finished) + ' END   - ' + msg
        + '(' + str(timestamp_duration.total_seconds()) + ')',
        flush=True
    )

Load a local copy of the Principal Bicycle Network dataset in .geojson format (which includes co-ordinate geometry)

In [2]:
log_starting('Load original PBN dataset')

pbn = gpd.read_file('Principal_Bicycle_Network_(PBN).geojson')

log_finished('Load original PBN dataset')

2021-08-10 14:47:30.751132 START - Load original PBN dataset
2021-08-10 14:47:35.287975 END   - Load original PBN dataset(4.536843)


Create an object to perform "reverse" geocoding using a local instance of the Nominatim service, based on "Australia and Oceania" data from the OpenStreetMap database

In [3]:
locator = Nominatim(user_agent="myGeocoder", domain="geo.local/nominatim", scheme="http", timeout=10)

Create a function to perform reverse geocoding from the 'geometry' column

It will work through each coordinate in the 'geometry' column, because a path could cut through multiple town/suburb/city values, or even multiple streets.  We may get some false positives when we geocode a coordinate near an intersection, and the reverse geocode concludes that the point is on the wrong street.  However by looking at every point on the path, we should pick up the real street, and we will develop strategies to deal with this.  E.g.

* If there is only one point (node) on a street, then exclude it or assign it low confidence
* Check if the point (node) is an intersection based on the raw OpenStreetMap data

We must be careful to handle both LineString and MultiLineString!  There is exactly ONE row
with a MultiLineString in the PBN dataset!

In [4]:
def geocode_coords(row):
    retval = []
    
    if type(row) is LineString:
        for xy in row.coords:
            retval.append(locator.reverse(Point(xy[1], xy[0])).raw)
    
    elif type(row) is MultiLineString:
        for ls in row:
            for xy in ls.coords:
                retval.append(locator.reverse(Point(xy[1], xy[0])).raw)
    
    return retval  

Geocode every coordinate in every row, with a progress bar

In [5]:
log_starting('Bulk reverse-geocoding')

tqdm.pandas()

pbn['geocode_list'] = pbn['geometry'].progress_apply(geocode_coords)

log_finished('Bulk reverse-geocoding')

2021-08-10 14:47:35.302770 START - Bulk reverse-geocoding


  0%|          | 0/40252 [00:00<?, ?it/s]

2021-08-10 15:25:31.289446 END   - Bulk reverse-geocoding(2275.986676)


Show a sample of the geocoded data, with new 'geocode_list' column

In [6]:
pbn.head(3)

Unnamed: 0,objectid,network,type,status,strategic_cycling_corridor,local_name,local_type,rd_num,name,side,...,facility_right,surface_right,width_right,lighting,verified_date,bearing,scc_name,comments,geometry,geocode_list
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00+00:00,0.0,Dandenong to Pakenham,,"LINESTRING (145.25577 -37.99385, 145.25570 -37...","[{'place_id': 3455946, 'licence': 'Data © Open..."
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00+00:00,0.0,Dandenong to Pakenham,,"LINESTRING (145.26087 -37.99526, 145.26058 -37...","[{'place_id': 3455946, 'licence': 'Data © Open..."
2,3,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,,,0.0,,1899-12-30T00:00:00+00:00,0.0,Carrum to Warburton Trail,,"LINESTRING (145.22693 -37.99352, 145.22674 -37...","[{'place_id': 3543007, 'licence': 'Data © Open..."


"Explode" geocode_list into a row for each point, because each path might span multiple streets or towns/suburbs/cities

In [9]:
log_starting('Explode geocode_list')

pbn_explode = pbn.explode('geocode_list')

log_finished('Explode geocode_list')

2021-08-10 16:28:36.926120 START - Explode geocode_list
2021-08-10 16:28:37.334086 END   - Explode geocode_list(0.407966)


Define functions to tease out key fields from within geocoded data, then add them to the dataset as (more accessible) columns.

At the same time, we will define a new field 'local_street', which will always be stated in upper case.  If 'local_name' is defined in the original PBN record, we use that, along with 'local_type'.  Otherwise, we take the 'road' field that was provided by the reverse-geocode process.

In [10]:
# General functions to pull out a "vanilla" field from the geocoded result
def get_address(geocode):
    try:
        return geocode['address']
    except:
        return 'n/a'

def get_road(geocode):
    try:
        return geocode['road']
    except:
        return 'n/a'
    
def get_postcode(geocode):
    try:
        return geocode[postcode]
    except:
        return 'n/a'

def get_display_name(geocode):
    try:
        return geocode['display_name']
    except:
        return 'n/a'
    
def get_osm_id(geocode):
    try:
        return geocode['osm_id']
    except:
        return 'n/a'
    
# Three functions to pull out town/suburb/city
# These are special because they are inconsistently populated.
# If we want the 'town', we will use that if it is defined, otherwise we will try 'suburb', then 'city'
# If we want the 'suburb' we will resort to 'town', then 'city'
# If we want the 'city' we will resort to 'suburb', then 'town'

def get_suburb(address):
    try:
        return address['suburb']
    except Exception:
        try:
            return address['town']
        except Exception:
            try:
                return address['city']
            except Exception:
                return 'n/a'

def get_town(address):
    try:
        return address['town']
    except Exception:
        try:
            return address['suburb']
        except Exception:
            try:
                return address['city']
            except Exception:
                return 'n/a'
        
def get_city(address):
    try:
        return address['city']
    except Exception:
        try:
            return address['suburb']
        except Exception:
            try:
                return address['town']
            except Exception:
                return 'n/a'
        
# This function will define a new field 'local_street'.
# Our first preference is to use 'local_name' and 'local_type' if they were explicitly
# given to us in the original PBN dataset.  Otherwise we will use the 'road' field
# from the reverse-geocoded data.
def get_local_street(row):
    retval = '' 
    if row['local_name'] is not None and row['local_name'].strip() != '':
        retval = row['local_name']
        if row['local_type'] is not None and row['local_type'].strip() != '':
            retval = retval + ' ' + row['local_type']
    else:
        retval = row['road']
    return retval


# Extract or derive geo fields from geocode_list

log_starting('Extract or derive geo fields')

pbn_explode['address']      = pbn_explode['geocode_list'].apply(get_address)
pbn_explode['road']         = pbn_explode['address'].apply(get_road)
pbn_explode['suburb']       = pbn_explode['address'].apply(get_suburb)
pbn_explode['town']         = pbn_explode['address'].apply(get_town)
pbn_explode['city']         = pbn_explode['address'].apply(get_city)
pbn_explode['postcode']     = pbn_explode['address'].apply(get_postcode)
pbn_explode['display_name'] = pbn_explode['geocode_list'].apply(get_display_name)
pbn_explode['local_street'] = pbn_explode.apply (lambda row: get_local_street(row), axis=1)
pbn_explode['osm_id']       = pbn_explode['geocode_list'].apply(get_osm_id)

log_finished('Extract or derive geo fields')

2021-08-10 16:28:43.907878 START - Extract or derive geo fields
2021-08-10 16:28:47.177157 END   - Extract or derive geo fields(3.269279)


Show an example of what the data looks like.

In [13]:
pbn_explode.head(30)

Unnamed: 0,objectid,network,type,status,strategic_cycling_corridor,local_name,local_type,rd_num,name,side,...,geocode_list,address,road,suburb,town,city,postcode,display_name,local_street,osm_id
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
0,1,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115
1,2,PBN & MTN,Off Road,Existing,Yes,,,0.0,,,...,"{'place_id': 3455946, 'licence': 'Data © OpenS...","{'road': 'Hallam Bypass Trail', 'suburb': 'Dov...",Hallam Bypass Trail,Doveton,Eumemmerring,Melbourne,,"Hallam Bypass Trail, Doveton, Eumemmerring, Me...",Hallam Bypass Trail,23458115


Save output to a new file, so that we can re-read it later, without having to re-do the whole reverse-geocoding process each time.

In [12]:
log_starting('Save exploded PBN dataset')

pbn_explode.to_file('pbn_exploded.geojson', driver='GeoJSON')

log_finished('Save exploded PBN dataset')

2021-08-10 16:28:56.975372 START - Save exploded PBN dataset
2021-08-10 16:30:46.066762 END   - Save exploded PBN dataset(109.09139)
