# Address Interpolation

Learn projections of addresses onto streets without knowledge of  the Address Reference System.

## Variables

* House number
* Street name
* Street tags (optional)
* Admin hierarchy (optional)
* Coordinates of projection onto street (target)

## Objective

Difference between target projection and predicted projection.

## Dataset Extraction

* House number and street name come from OpenAddresses.
* Street tags and geometry come from OSM. Street geometry is used to derive the target variable. Streets in OSM have a `highway` tag. We might need to figure out how to construct streets from their constituent ways. See pelias/interpolation for polyline construction code.
* Admin hierarchy will be fetched from a point-in-polygon service.

## Training

Due to the sequential nature of addresses, a recurrent neural network might be well-suited for the task.


In [5]:
import pandas as pd
import numpy as np
import json_lines
import subprocess
import urllib
import json
import operator
import functools

### Load OpenAddresses Data

Downloaded from https://openaddresses.io/

In [84]:
df = pd.read_csv('./data/tmp/berlin.csv')

In [85]:
df.tail()

Unnamed: 0,LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
375335,13.42943,52.491522,1,Liberdastraße,,Berlin,,,0,,df207073db4e670f
375336,13.14031,52.453649,20 C,Sakrower Landstraße,,Berlin,,,0,,2029a7431cca1768
375337,13.20838,52.558394,15,Streitstraße,,Berlin,,,0,,082573c0d67906c7
375338,13.386425,52.515338,14,Behrenstraße,,Berlin,,,0,,5a17069f536d2b44
375339,13.411902,52.581912,27,Siegfriedstraße,,Berlin,,,0,,edba5ab987d1973a


In [86]:
df.columns = df.columns.str.lower()

In [87]:
df['street'].value_counts()

Hauptstraße                  971
Berliner Straße              809
Köpenicker Straße            704
Waldstraße                   682
Dorfstraße                   589
Heerstraße                   575
Schillerstraße               551
Kastanienallee               538
Lindenstraße                 534
Schulzendorfer Straße        534
Mariendorfer Damm            493
Goethestraße                 479
Uhlandstraße                 466
Kladower Damm                460
Müllerstraße                 458
Kurfürstenstraße             458
Wendenschloßstraße           455
Bahnhofstraße                438
Scharnweberstraße            426
Rudower Straße               413
Charlottenstraße             400
Pilgramer Straße             396
Bernauer Straße              388
Brunsbütteler Damm           387
Parkstraße                   379
Johannisthaler Chaussee      369
Hildburghauser Straße        366
Ahornallee                   366
Ringstraße                   363
Landsberger Allee            362
          

### Number of unique streets

In [88]:
df['street'].nunique()

9011

In [89]:
def street_min_max(street):
    addrs = df[df['street'] == street]
    return min(addrs['number']), max(addrs['number'])

def get_street(street):
    return df[df['street'] == street]

def missing_elements(L):
    L = sorted(L)
    start, end = L[0], L[-1]
    return set(range(start, end + 1)).difference(L)

In [90]:
street_min_max('Schwedter Straße')

('1', '90')

In [91]:
df['number'] = df['number'].apply(lambda x: x.split()[0])

In [92]:
df['number'] = pd.to_numeric(df['number'])

### Number of missing house numbers on Schwedter Straße:

In [93]:
len(missing_elements(list(get_street('Schwedter Straße')['number'])))

174

In [94]:
def has_missing_addr(street):
    return len(missing_elements(list(get_street(street)['number']))) > 0

### Percentage of streets with missing addresses:

In [107]:
%%time
missing = list(map(lambda x: has_missing_addr(x), df['street'].unique()))

CPU times: user 3min 27s, sys: 854 ms, total: 3min 27s
Wall time: 3min 34s


In [109]:
m = np.array(missing)
vals, counts = np.unique(m, return_counts=True)

In [124]:
"{}%".format(counts[0]/counts[1] * 100)

'25.641383156720583%'

In [127]:
print(vals, counts)

[False  True] [1839 7172]


### Prepare Street Data

#### Using Osmosis

##### Prerequisites
* Download OSM data in pbf format from https://download.geofabrik.de/
* Install osmosis https://github.com/openstreetmap/osmosis
* Install osmtogeojson https://github.com/tyrasd/osmtogeojson

##### Run from the command line
<pre>
osmosis --read-pbf berlin-latest.osm.pbf --tf accept-ways highway=* --used-node --write-xml data/tmp/berlin-highways.osm

osmtogeojson berlin-highways.osm > data/berlin-streets.geojson
</pre>

#### Using pbf2json

##### Prerequisites
* Download OSM data in pbf format from https://download.geofabrik.de/
* Download pbf2json https://github.com/pelias/pbf2json

##### Run from the command line
<pre>
./build/pbf2json.darwin-x64 -tags="highway+name" /pelias/data/openstreetmap/berlin-latest.osm.pbf > data/berlin-streets.json
</pre>

In [8]:
streets = []

def gen_street_data():
    with open('./data/berlin-streets.json', 'rb') as f:
        for line in json_lines.reader(f, broken=True):
            if 'centroid' not in line.keys():
                continue
            centroid = line['centroid']
            tags = line['tags']
            id_ = {'id': str(line['id'])}
            street = {**id_, **centroid, **tags}
            streets.append(street)

In [9]:
gen_street_data()

In [135]:
streets[:5]

[{'id': '4045150',
  'lat': '52.3743965',
  'lon': '13.6086490',
  'highway': 'residential',
  'maxspeed': '50',
  'name': 'Waldstraße',
  'postal_code': '15732',
  'sidewalk': 'both',
  'surface': 'asphalt'},
 {'id': '4045194',
  'lat': '52.4932706',
  'lon': '13.5314388',
  'description': 'Ursula Goetze (1907-1943), member of the German Resistence, senteced to death by the Reich Court-martial',
  'highway': 'residential',
  'lit': 'yes',
  'maxspeed': '30',
  'name': 'Ursula-Goetze-Straße',
  'old_name': 'Waterbergstraße',
  'parking:condition:left': 'free',
  'parking:lane:left': 'parallel',
  'parking:lane:left:parallel': 'on_street',
  'postal_code': '10318',
  'sidewalk': 'both',
  'source:maxspeed': 'DE:zone30',
  'surface': 'asphalt'},
 {'id': '4045220',
  'lat': '52.4891055',
  'lon': '13.5252426',
  'highway': 'residential',
  'lanes': '1',
  'lit': 'yes',
  'maxspeed': '30',
  'name': 'Hönower Straße',
  'postal_code': '10318',
  'sidewalk': 'both',
  'surface': 'cobblestone

### Address Dataframe
We can go through each street from the PBF extract; requesting an interpolation of each address on the street and storing the address plus the street tags into a dataframe.

#### Known tags:
Booleans are respresented as "yes" or "no".

* lit: boolean
* maxspeed: float
* postal_code: string
* surface: string
* highway: string
* oneway: boolean
* bicycle: boolean
* parking:condition:left: string
* parking:condition:right: string
* parking:condition:both: string
* parking:lane:left: string
* parking:lane:right: string
* parking:lane:both: string
* lanes: int
* cycleway:left: string
* cycleway:right: string
* cycleway:both: string
* cycleway: string
* foot: string
* sidewalk: string
* smoothness: string
* lit_by_gaslight: boolean


In [11]:
# https://github.com/pelias/interpolation

def run_interpolation_extract(lat, lon, street, filename):
    """Create a JSON file containing the addresses on a street and their attributes."""
    subprocess.run(f"/Users/yunussh/workspace/interpolation/interpolate extract /pelias/address.db /pelias/street.db '{lat}' '{lon}' '{street}' 'geojson' | jq '.features | map(.properties) | map(select(.housenumber))' > data/streets/{filename}.json", shell=True)

def create_street_json(street):
    filename = urllib.parse.quote_plus(street['id'] + '__' + street['name'])
    gen_json(street['lat'], street['lon'], street['name'], filename)

In [None]:
[create_street_json(x) for x in streets]

In [None]:
def extract_addresses(street):
    filename = urllib.parse.quote_plus(street['id'] + '__' + street['name'])

    with open(f'data/streets/{filename}.json') as f:
        return json.load(f)

In [None]:
%%time
nested = [extract_addresses(x) for x in streets]

In [None]:
%%time
addresses = functools.reduce(operator.iconcat, nested, [])