# OpenStreetMap Data Case Study
______________________________________________________
### Rob Holtfreter

### Shoreline, WA, United States

* [OSM Shoreline, WA Map](https://www.openstreetmap.org/search?query=Shoreline%2C%20WA#map=12/47.7558/-122.3432)

This is a map of where I currently live. I'm curious if I can learn more about the area by querying the database that results from this project.

### Importing modules that will be needed for cleaning XML data and converting it to csv format.


In [170]:
#Importing modules.
import xml.etree.cElementTree as ET
import pprint
from collections import defaultdict
import re
import csv
import codecs
import cerberus
import schema
import sqlite3



### Checking out the size of my map.

In [298]:
ls -l full_map.osm 

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/07/2021  02:08 PM        69,115,084 full_map.osm
               1 File(s)     69,115,084 bytes
               0 Dir(s)  591,182,467,072 bytes free


### Checking out the tags: nodes, ways, and relations.

In [299]:
# Get element function.
def get_element(filename, tags=('node', 'way', 'relation')):
    context = iter(ET.iterparse(filename, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

In [181]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Key type function.

def key_type(element, keys):
    if element.tag == "tag":
        if lower.match(element.attrib['k']):
            keys["lower"] += 1
        elif lower_colon.search(element.attrib['k']):
            keys["lower_colon"] += 1
        elif problemchars.search(element.attrib['k']):
            keys["problemchars"] += 1
        else:
            keys["other"] += 1
        
    return keys

### Counting the element tags in the file.

In [300]:
# Count tags function.
def count_tags(filename):
    tree=ET.iterparse(filename)
    tags={}
    for event,elem in tree:
        if elem.tag not in tags.keys():
            tags[elem.tag]=1
        else:
            tags[elem.tag] = tags[elem.tag]+1
    return tags    
    
with open(OSM_FILE,'rb') as f:
    tags=count_tags(OSM_FILE)
    pprint.pprint(tags)
f.close()

{'bounds': 1,
 'member': 32512,
 'meta': 1,
 'nd': 325548,
 'node': 282515,
 'note': 1,
 'osm': 1,
 'relation': 547,
 'tag': 185752,
 'way': 38728}


### Checking out the formatting scheme for the K attribute in the tags.

In [183]:
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Key type function.
def key_type(element, keys):
    if element.tag == "tag":
        if lower.search(element.attrib['k']):
            keys['lower'] += 1
        elif lower_colon.search(element.attrib['k']):
            keys['lower_colon'] += 1
        elif problemchars.search(element.attrib['k']):
            keys['problemchars'] = keys['problemchars'] + 1
        else:    
            keys['other'] += 1  
#            print element.attrib['k']
#            print element.attrib['v']
    return keys

#Process keys map function.
def process_keys_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

with open(OSM_FILE,'rb') as f:
    keys = process_keys_map(OSM_FILE)
    pprint.pprint(keys)
f.close()

{'lower': 82920, 'lower_colon': 101414, 'other': 1418, 'problemchars': 0}


## Problems Encountered

The output of the following code shows that there is inconsistency in the usage of abbreviations for street names in the map.osm file. For example, "Ave" and "Avenue" are used interchangably as well as "N" and "North". As the map.osm file is a subset of the full_map.osm file, this problem must also exist in the larger file. The audit street names code that follows below will correct this issue.

In [291]:
#Finding values(tag attrib['v]) for unique k (tag attrib['k]) and making observations about the data.

def values_for_unique_keys(filename):

        '''
        # Manually provide the item_name value from the list of distinct_keys to calculate 
        # the values for the corresponding unique key value. We would initialize the key 
        # variable with one value at a time and without iterating so that we could have an idea
        # of what sort of values are there for corresponding key value. Also, we would not iterate
        # as it would a long amount of time to calculate the values for all the corresponding unique
        # key value
        '''
        
        key='addr:street'
        values=[]
        EL=get_element(filename, tags=('node', 'way', 'relation'))
        for element in EL:
            for tag in element.iter('tag'):
                if tag.attrib['k']==key:
                    values.append(tag.attrib['v'])
            element.clear()
        print (key)
        pprint.pprint(values)

        '''
        Using smaller map.osm file as input to audit the addr:street key
        '''
values_for_unique_keys('map.osm')  # Using smaller map.osm as input to audit the addr:street key

addr:street
['Aurora Avenue North',
 '8th Avenue Northwest',
 'Northwest Richmond Beach Road',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street',
 'North 205th Street - 244th Street Southwest',
 'North 205th Street',
 'Richmond Beach Rd',
 'North 205th Street',
 'North 200th Street',
 'Aurora Avenue North',
 'North 205th Street',
 'North 205th Street',
 'North 185th Street',
 '18336 AURORA AVE N',
 'Firdale Avenue',
 'Aurora Avenue North',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 'Firdale Avenue',
 

### Auditing street names.

In [307]:
# Auditing Street Names

'''
Creating a regex for street names, stored in street_type_re 
and a default dictionary that will include pairs of street names, 
where inconsistent abbreviations are matched with consistent terms.
The following code audits the datafile to look for street names that 
have an ending that is different from the values in the expected list.

'''
OSM_FILE = "full_map.osm"
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# List of expected street names.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

# Current values and what we would like to change them to.
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Rd": "Road",
            "Rd.": "Road",
            "Ct": "Court",
            "CT": "Court",
            "Ct.": "Court",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "BLVD": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "DR": "Drive",
           " Ctr": " Centre",
            " Pl ": " Place ",
            " Ln ": " Lane ",
            " Cir ": " Circle ",
            " Wy": " Way ",
            " S ": " South ",
            " E ": " East ",
            " W ": " West ",
            " N ": "North"
          }

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)


def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(filename):
    f = open(filename, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(filename, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
            elem.clear()        
    f.close()
    return street_types


def update_name(name, mapping):
    for key,value in mapping.items():
        if key in name:
            return name.replace(key,value)
    return name        
'''
Using the smaller map.osm file as an input to audit the street name
'''


st_types = audit('map.osm')

#pprint.pprint(dict(st_types))
for st_type, ways in st_types.items():
    for name in ways:
        better_name = update_name(name, mapping)
        print (name, "=>", better_name)


Meridian Avenue North => Meridian Avenuenue North
Aurora Avenue North => Aurora Avenuenue North
Fremont Avenue North => Fremont Avenuenue North
Aurora Ave North => Aurora Avenue North
8th Avenue Northwest => 8th Avenuenue Northwest
15th Avenue Northwest => 15th Avenuenue Northwest
244th Street Southwest => 244th Streetreet Southwest
243rd Place Southwest => 243rd Place Southwest
242nd Place Southwest => 242nd Place Southwest
North 205th Street - 244th Street Southwest => North 205th Streetreet - 244th Streetreet Southwest
Richmond Beach Rd => Richmond Beach Road
18336 AURORA AVE N => 18336 AURORA AVE N
NE 205th St => NE 205th Street
Highway 99 => Highway 99
Lake Ballinger Way => Lake Ballinger Way
5th Ave NE => 5th Avenue NE
N 202nd Pl => N 202nd Pl
91st Avenue West => 91st Avenuenue West
90th Avenue West => 90th Avenuenue West
107th Place West => 107th Place West
89th Place West => 89th Place West
87th Place West => 87th Place West
78th Place West => 78th Place West
101st Avenue West 

Looks like the abbreviations for the street names in the smaller map.osm were corrected by the preceding code. Now, I will correct the abbreviations for the larger dataset: full_map.osm.

In [308]:
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# List of expected street names.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

# Current values and what we would like to change them to.
mapping = { "St": "Street",
            "St.": "Street",
            "Ave": "Avenue",
            "Rd": "Road",
            "Rd.": "Road",
            "Ct": "Court",
            "CT": "Court",
            "Ct.": "Court",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "BLVD": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "DR": "Drive",
           " Ctr": " Centre",
            " Pl ": " Place ",
            " Ln ": " Lane ",
            " Cir ": " Circle ",
            " Wy": " Way ",
            " S ": " South ",
            " E ": " East ",
            " W ": " West ",
            " N ": "North"
          }

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)


def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")


def audit(filename):
    f = open(filename, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(filename, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
            elem.clear()        
    f.close()
    return street_types


def update_name(name, mapping):
    for key,value in mapping.items():
        if key in name:
            return name.replace(key,value)
    return name        
'''
Using the full_map.osm file as an input to audit street abbreviations.
'''


st_types = audit('full_map.osm')

#pprint.pprint(dict(st_types))
for st_type, ways in st_types.items():
    for name in ways:
        better_name = update_name(name, mapping)
        print (name, "=>", better_name)


Northeast 187th Way => Northeast 187th Way
Alaskan Way => Alaskan Way
Edmonds Way => Edmonds Way
Northeast Perkins Way => Northeast Perkins Way
NE Bothell Way => NE Bothell Way
Northeast Bothell Way => Northeast Bothell Way
McAleer Way => McAleer Way
Lake Ballinger Way => Lake Ballinger Way
Cedar Way => Cedar Way
2nd Avenue Northeast => 2nd Avenuenue Northeast
63rd Place Northeast => 63rd Place Northeast
63rd Lane Northeast => 63rd Lane Northeast
15th Avenue Northeast => 15th Avenuenue Northeast
5th Avenue Northeast => 5th Avenuenue Northeast
8th Avenue Northeast => 8th Avenuenue Northeast
62nd Court Northeast => 62nd Court Northeast
81st Avenue Northeast => 81st Avenuenue Northeast
66th Court Northeast => 66th Court Northeast
58th Lane Northeast => 58th Lane Northeast
36th Court Northeast => 36th Court Northeast
10th Avenue Northeast => 10th Avenuenue Northeast
64th Place Northeast => 64th Place Northeast
125th Avenue Northeast => 125th Avenuenue Northeast
Erickson Place Northeast => 

## Preparing data to create csv files and SQL database.

### Running the schema file provided in the course.

In [292]:
# Schema code provided by Udacity.
schema = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}


In [172]:
OSM_PATH = "full_map.osm"

NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

SCHEMA = schema

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']

In [173]:
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements
    
   # YOUR CODE HERE
    if element.tag == 'node':
        for attrib in element.attrib:
            if attrib in NODE_FIELDS:
                node_attribs[attrib] = element.attrib[attrib]
        
        for child in element:
            node_tag = {}
            if LOWER_COLON.match(child.attrib['k']):
                node_tag['type'] = child.attrib['k'].split(':',1)[0]
                node_tag['key'] = child.attrib['k'].split(':',1)[1]
                node_tag['id'] = element.attrib['id']
                node_tag['value'] = child.attrib['v']
                tags.append(node_tag)
            elif PROBLEMCHARS.match(child.attrib['k']):
                continue
            else:
                node_tag['type'] = 'regular'
                node_tag['key'] = child.attrib['k']
                node_tag['id'] = element.attrib['id']
                node_tag['value'] = child.attrib['v']
                tags.append(node_tag)
        
        return {'node': node_attribs, 'node_tags': tags}
        
    elif element.tag == 'way':
        for attrib in element.attrib:
            if attrib in WAY_FIELDS:
                way_attribs[attrib] = element.attrib[attrib]
        
        position = 0
        for child in element:
            way_tag = {}
            way_node = {}
            
            if child.tag == 'tag':
                if LOWER_COLON.match(child.attrib['k']):
                    way_tag['type'] = child.attrib['k'].split(':',1)[0]
                    way_tag['key'] = child.attrib['k'].split(':',1)[1]
                    way_tag['id'] = element.attrib['id']
                    way_tag['value'] = child.attrib['v']
                    tags.append(way_tag)
                elif PROBLEMCHARS.match(child.attrib['k']):
                    continue
                else:
                    way_tag['type'] = 'regular'
                    way_tag['key'] = child.attrib['k']
                    way_tag['id'] = element.attrib['id']
                    way_tag['value'] = child.attrib['v']
                    tags.append(way_tag)
                    
            elif child.tag == 'nd':
                way_node['id'] = element.attrib['id']
                way_node['node_id'] = child.attrib['ref']
                way_node['position'] = position
                position += 1
                way_nodes.append(way_node)
        
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}



### Helper Functions

In [176]:
# Getting and validating the element.
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))

#Unidcode dictwriter
class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
             k: (v.encode('utf-8') if isinstance(v, str) else v) for k, v in row.items()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

### Main Function

In [178]:
# Main function for creating csv files.
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w', "utf-8") as nodes_file, \
     codecs.open(NODE_TAGS_PATH, 'w', "utf-8") as nodes_tags_file, \
     codecs.open(WAYS_PATH, 'w', "utf-8") as ways_file, \
     codecs.open(WAY_NODES_PATH, 'w', "utf-8") as way_nodes_file, \
     codecs.open(WAY_TAGS_PATH, 'w', "utf-8") as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

if __name__ == '__main__':
    # Note: Validation is ~ 10X slower. For the project consider using a small
    # sample of the map when validating.
    process_map(OSM_PATH, validate=True)


# Data Overview and Additional Ideas
______________________________________________________

#### This section contains:
* Code for cleaning up problems in the csv files.
* Code for creating a database.
* Basic statistics about the dataset.
* SQL queries used to gather statistics.
* Additional ideas for using the dataset.

In [188]:
import pandas as pd
import numpy as np

In [234]:
nodes = pd.read_csv('nodes.csv')
nodes_tags = pd.read_csv('nodes_tags.csv')
ways = pd.read_csv('ways.csv')
ways_nodes = pd.read_csv('ways_nodes.csv')
ways_tags = pd.read_csv('ways_tags.csv')

### Checking out the size of each file.

In [265]:
ls -s nodes.csv

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/08/2021  11:38 AM        32,118,943 nodes.csv
               1 File(s)     32,118,943 bytes
               0 Dir(s)  592,109,125,632 bytes free


In [266]:
ls -s nodes_tags.csv

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/08/2021  11:38 AM         1,449,073 nodes_tags.csv
               1 File(s)      1,449,073 bytes
               0 Dir(s)  592,108,994,560 bytes free


In [267]:
ls -s ways.csv

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/08/2021  11:38 AM         3,189,719 ways.csv
               1 File(s)      3,189,719 bytes
               0 Dir(s)  592,108,994,560 bytes free


In [268]:
ls -s ways_nodes.csv

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/08/2021  11:38 AM         9,768,360 ways_nodes.csv
               1 File(s)      9,768,360 bytes
               0 Dir(s)  592,108,847,104 bytes free


In [269]:
ls -s ways_tags.csv

 Volume in drive C is TI10673200G
 Volume Serial Number is 5E9D-3D3F

 Directory of C:\Users\Rob


 Directory of C:\Users\Rob

04/08/2021  11:38 AM         7,260,986 ways_tags.csv
               1 File(s)      7,260,986 bytes
               0 Dir(s)  592,108,797,952 bytes free


### File sizes
______________________________________________________

full_map.osm......... 69.0 MB

nodes.csv.............. 32.0 MB

nodes_tags.csv.......  1.4 MB

ways.csv.................. 3.2 MB

ways_nodes.csv......  9.8 MB

ways_tags.csv.........  7.2 MB



### Cleaning up problems encountered with the csv files.

After creating the csv files, I noticed that all of the values (including column headings) had a "b" appended before each value. I wrote the following code to correct the problem.

In [233]:
nodes=nodes.rename(columns={"b'id'":"id"})
nodes=nodes.rename(columns={"b'lat'":"lat"})
nodes=nodes.rename(columns={"b'lon'":"lon"})
nodes=nodes.rename(columns={"b'user'":"user"})
nodes=nodes.rename(columns={"b'uid'":"uid"})
nodes=nodes.rename(columns={"b'version'":"version"})
nodes=nodes.rename(columns={"b'changeset'":"changeset"})
nodes=nodes.rename(columns={"b'timestamp'":"timestamp"})

In [236]:
nodes_tags=nodes_tags.rename(columns={"b'id'":"id"})
nodes_tags=nodes_tags.rename(columns={"b'key'":"key"})
nodes_tags=nodes_tags.rename(columns={"b'value'":"value"})
nodes_tags=nodes_tags.rename(columns={"b'type'":"type"})

In [244]:
ways=ways.rename(columns={"b'id'":"id"})
ways=ways.rename(columns={"b'changeset'":"changeset"})
ways=ways.rename(columns={"b'timestamp'":"timestamp"})
ways=ways.rename(columns={"b'user'":"user"})
ways=ways.rename(columns={"b'uid'":"uid"})
ways=ways.rename(columns={"b'version'":"version"})

In [251]:
ways_nodes=ways_nodes.rename(columns={"b'id'":"id"})
ways_nodes=ways_nodes.rename(columns={"b'node_id'":"node_id"})
ways_nodes=ways_nodes.rename(columns={"b'position'":"position"})

In [259]:
ways_tags=ways_tags.rename(columns={"b'id'":"id"})
ways_tags=ways_tags.rename(columns={"b'key'":"key"})
ways_tags=ways_tags.rename(columns={"b'value'":"value"})
ways_tags=ways_tags.rename(columns={"b'type'":"type"})

In [225]:
nodes['ID'] = nodes.ID.str[1:]
nodes['lat'] = nodes.lat.str[1:]
nodes['lon'] = nodes.lon.str[1:]
nodes['user'] = nodes.user.str[1:]
nodes['uid'] = nodes.uid.str[1:]
nodes['version'] = nodes.version.str[1:]
nodes['changeset'] = nodes.changeset.str[1:]
nodes['timestamp'] = nodes.timestamp.str[1:]

In [239]:
nodes_tags['id'] = nodes_tags.id.str[1:]
nodes_tags['key'] = nodes_tags.key.str[1:]
nodes_tags['value'] = nodes_tags.value.str[1:]
nodes_tags['type'] = nodes_tags.type.str[1:]

In [245]:
ways['id'] = ways.id.str[1:]
ways['user'] = ways.user.str[1:]
ways['uid'] = ways.uid.str[1:]
ways['version'] = ways.version.str[1:]
ways['changeset'] = ways.changeset.str[1:]
ways['timestamp'] = ways.timestamp.str[1:]

In [254]:
ways_nodes['id'] = ways_nodes.id.str[1:]
ways_nodes['node_id'] = ways_nodes.node_id.str[1:]

In [260]:
ways_tags['id'] = ways_tags.id.str[1:]
ways_tags['key'] = ways_tags.key.str[1:]
ways_tags['value'] = ways_tags.value.str[1:]
ways_tags['type'] = ways_tags.type.str[1:]

### Creating the database.

In [229]:
from sqlalchemy import create_engine

engine = create_engine('sqlite://', echo=False)

### Creating 'nodes' table.

In [230]:
nodes.to_sql('nodes', con=engine)

### Number of nodes.

In [231]:
engine.execute("SELECT count(DISTINCT(id)) FROM nodes;").fetchall()

[(282515,)]

### Creating 'nodes_tags' table.

In [241]:
nodes_tags.to_sql('nodes_tags', con=engine)

### Number of node tags.

In [248]:
engine.execute("SELECT count(DISTINCT(id)) FROM nodes_tags;").fetchall()

[(8950,)]

### Creating 'ways' table.

In [247]:
ways.to_sql('ways', con=engine)

### Number of ways.

In [249]:
engine.execute("SELECT count(DISTINCT(id)) FROM ways;").fetchall()

[(38728,)]

### Creating 'ways_nodes' table.

In [256]:
ways_nodes.to_sql('ways_nodes', con=engine)

### Number of ways nodes.

In [257]:
engine.execute("SELECT count(DISTINCT(id)) FROM ways_nodes;").fetchall()

[(38728,)]

### Creating 'ways_tags' table.

In [261]:
ways_tags.to_sql('ways_tags', con=engine)

### Number of ways tags.

In [262]:
engine.execute("SELECT count(DISTINCT(id)) FROM ways_tags;").fetchall()

[(38535,)]

### Number of unique users.

In [270]:
engine.execute("SELECT COUNT(DISTINCT(e.uid))FROM (SELECT uid FROM Nodes UNION ALL SELECT uid FROM Ways) as e;").fetchall()

[(879,)]

### Top 10 contributing users

In [274]:
engine.execute("SELECT e.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e GROUP BY e.user ORDER BY num DESC LIMIT 10;").fetchall()

[("'SeattleImport'", 70247),
 ("'patricknoll_import'", 51383),
 ("'AndrewKvalheim_import'", 33121),
 ("'Natfoot'", 17282),
 ("'Glassman_Import'", 12816),
 ("'Glassman'", 11718),
 ("'STBrenden'", 6970),
 ("'sctrojan79'", 5387),
 ("'compdude'", 4525),
 ("'KiloCrimson'", 4036)]

### Node type and count.

In [271]:
engine.execute("SELECT type , count(*) as num  FROM nodes_tags group by type order by num desc;").fetchall()

[("'addr'", 16150),
 ("'regular'", 11922),
 ("'gtfs'", 554),
 ("'source'", 177),
 ("'brand'", 145),
 ("'gnis'", 139),
 ("'survey'", 88),
 ("'seamark'", 51),
 ("'contact'", 32),
 ("'railway'", 24),
 ("'bridge'", 22),
 ("'sdot'", 16),
 ("'checked_exists'", 15),
 ("'name'", 11),
 ("'ref'", 9),
 ("'opening_hours'", 9),
 ("'recycling'", 7),
 ("'construction'", 6),
 ("'traffic_signals'", 5),
 ("'tower'", 4),
 ("'service'", 4),
 ("'operator'", 3),
 ("'healthcare'", 3),
 ("'check_date'", 3),
 ("'census'", 3),
 ("'crossing'", 2),
 ("'toilets'", 1),
 ("'social_facility'", 1),
 ("'payment'", 1),
 ("'fire_hydrant'", 1),
 ("'disused'", 1),
 ("'diet'", 1),
 ("'communication'", 1),
 ("'capacity'", 1),
 ("'building'", 1),
 ("'access'", 1),
 ("'abandoned'", 1)]

### Cuisine types and count.

In [272]:
engine.execute("select value,count(*) as num from (select key,value from nodes_tags UNION ALL select key,value from ways_tags) as e where e.key like '%cuisine%' group by value order by num desc limit 20;").fetchall()

[("'coffee_shop'", 24),
 ("'pizza'", 16),
 ("'sandwich'", 14),
 ("'burger'", 12),
 ("'mexican'", 11),
 ("'chinese'", 9),
 ("'american'", 7),
 ("'thai'", 6),
 ("'vietnamese'", 4),
 ("'mediterranean'", 3),
 ("'korean'", 3),
 ("'chicken'", 3),
 ("'tex-mex'", 2),
 ("'seafood'", 2),
 ("'frozen_yogurt'", 2),
 ("'barbecue'", 2),
 ("'asian'", 2),
 ("'vietnamese;sandwich;bubble_tea;boba;coffe;milkshake'", 1),
 ("'teriyaki'", 1),
 ("'taiwanese'", 1)]

### Looking for nodes and ways tags associated with the light rail.

In [281]:
engine.execute("select count(*) from (select key,value from nodes_tags UNION ALL select key,value from ways_tags) as e  where key like '%rail%';").fetchall()

[(158,)]

In [283]:
engine.execute("select value,count(*) as num from (select key,value from nodes_tags UNION ALL select key,value from ways_tags) as e where e.key like '%rail%' group by value order by num desc limit 20;").fetchall()

[("'abandoned'", 62),
 ("'construction'", 19),
 ("'light_rail'", 18),
 ("'switch'", 11),
 ("'signal'", 11),
 ("'rail'", 10),
 ("'razed'", 6),
 ("'station'", 4),
 ("'level_crossing'", 3),
 ("'derail'", 3),
 ("'yes'", 2),
 ("'site'", 2),
 ("'crossing'", 2),
 ("'bad'", 2),
 ("'intermediate'", 1),
 ("'excellent'", 1),
 ("'buffer_stop'", 1)]

### Looking for nodes and ways tags associated with other public transit.

In [287]:
engine.execute("select count(*) from (select key,value from nodes_tags UNION ALL select key,value from ways_tags) as e  where key like '%bus%';").fetchall()

[(419,)]

### Percentage of nodes containing references to public transit.

Based on the preceding two queries, about 0.002% of nodes (577/282,515) contain information pertaining to public transit, which in this case includes the light rail and city buses.


### Ideas for additional improvements

Information concerning proximinity to public transit could be added to the dataset for each node. Each node includes a pair of coordinates in lat/lon. In QGIS, an OSM tool exists that will batch process the distance between all of the points (i.e. nodes) in one dataset and all of the points in another dataset. That data could be added for each node in the dataset, along with the type of transit, so that a user could find the nearest light rail station to their favorite coffee shop for example.

### Anticipated problems in implementing the improvement

It would take an enormous amount of time to batch process the above mentioned data in QGIS. If multiple users broke the dataset up into smaller parts and each ran a batch process, it likely could be done quickly and without frying any computers.

### Conclusion

My Shoreline, Washington dataset was moderately large, but was relatively clean considering almost 900 users contributed to creating the OSM. Additional changes would likely improve the cleanliness of the data; however, the data isn't in bad shape now. As mentioned above, one potential improvement might be the inclusion of a proximinity to public transit stat for each node.
