Data Wrangle OpenStreetMaps Data
=================

Step One - Finish Lession 6
-------------------------

Done.

Step Two - Review the Rubric and Sample Project
--------------------------------------------

Done.

Step Three - Choose Your Map Area
-------------------------------

We use [Map Zen](https://mapzen.com/data/metro-extracts) to download the preselected metro area of San Francisco, California. After uncompressing, the dataset size is 648.9MB.

Step Four - Process your Dataset
------------------------------

### 4.1 Audit the dataset

#### Sanity check on element types

As a sanity check, we go through the dataset to see list all element types, which should contain the following:
  * `osm`: top-level root node
  * `node`, `way` and `relation`: instances of data primitives
  * `tag`: a general purpose node for key/value pair.
  * `nd`: used inside `way`s to reference a `node` element.
  * `member`: used inside a `relation`.

In [1]:
import xml.etree.cElementTree as ET

xml_file = "data/san-francisco_california.osm"
elem_types = set()
for _, elem in ET.iterparse(xml_file):
    # If the element type is not seen before, add it to the set.
    if elem.tag not in elem_types:
        elem_types.add(elem.tag)

print elem_types

set(['node', 'nd', 'bounds', 'member', 'tag', 'relation', 'way', 'osm'])


The only "unexpected" element type is `bounds`, which occurs only once in the dataset to indicate the bounding box of this dataset.

#### Street address types

Next we try to see different address types of all `node`s, and how many times each address type appears in the dataset. The address type is found from the `<tag k="addr:XXX" v="xxxxxx">` tag.

In [2]:
import operator

address_types = {}
for _, elem in ET.iterparse(xml_file):
    if elem.tag == "node":
        for tag in elem:
            if tag.attrib['k'].startswith("addr:"):
                k = tag.attrib['k'][5:]
                address_types.setdefault(k, 0)
                address_types[k] += 1

for k,v in sorted(address_types.items(), key=operator.itemgetter(1), reverse=True):
    print k, ':', v

housenumber : 17746
street : 14982
city : 12790
postcode : 2483
state : 1620
country : 727
housename : 75
unit : 52
full : 31
county : 14
floor : 2
housenumber:source : 2
suite : 2
interpolation : 2
pier : 1
province : 1
place : 1
door : 1


We can see the majority of address types are `housenumber`(17746), `street`(14982), `city`(12790) and `postcode`(2483), and the relatively rare address types are `floor`, `housenumber:source`, `suite`, `interpolation`, `pier`, `province`, `place` and `door`, which happens only once or twice in the dataset.

#### Street names

After having the street address types, we try to audit all the street name types in the dataset. Street name is extracted as the last word of `<tag k="addr:street" v="xxxxxx">` tag. We also keep a small number of examples for each street name type.

In [3]:
from collections import defaultdict
import re

street_name_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

street_names = defaultdict(int)
street_name_examples = defaultdict(list)

for _, elem in ET.iterparse(xml_file):
    if elem.tag == "node":
        for tag in elem:
            if tag.attrib['k'] == "addr:street":
                street = tag.attrib['v']
                match = street_name_re.search(street)
                if match:
                    name = match.group().lower()
                    street_names[name] += 1
                    if street_names[name] < 3:
                        street_name_examples[name].append(street)

for k,v in sorted(street_names.items(), key=operator.itemgetter(1), reverse=True):
    print k, ':', v, '[', ', '.join(street_name_examples[k]), (v>=3 and ",...]" or ']')

street : 7879 [ Haight Street, Haight Street ,...]
avenue : 3672 [ College Avenue, College Avenue ,...]
road : 806 [ Gouldin Road, Primrose Road ,...]
way : 544 [ Bancroft Way, Martin Luther King Jr Way ,...]
drive : 468 [ Laird Drive, Middlefield Drive ,...]
boulevard : 319 [ West Hillsdale Boulevard, Fremont Boulevard ,...]
broadway : 227 [ Broadway, Broadway ,...]
real : 215 [ South El Camino Real, El Camino Real ,...]
court : 207 [ Prescott Court, Prescott Court ,...]
place : 82 [ William Saroyan Place, Romolo Place ,...]
lane : 64 [ Laurel Lane, Laurel Lane ,...]
st : 52 [ Park St, S Delaware St ,...]
alameda : 50 [ The Alameda, The Alameda ,...]
circle : 47 [ Holly Park Circle, Columbia Circle ,...]
ave : 39 [ Floribunda Ave, Paloma Ave ,...]
plaza : 26 [ Lakeshore Plaza, Civic Center Plaza ,...]
plz : 23 [ Woodside Plz, Woodside Plz ,...]
center : 21 [ Fort Mason Center, Seramonte Center ,...]
square : 18 [ Shattuck Square, Jack London Square ,...]
path : 18 [ Parnassus Path, Pa

From the audit above, we see the common street names (ones that I expected to see) are:
  * street (7879)
  * avenue (3672)
  * road (806)
  * way (544)
  * drive (468)
  * boulevard (319)
  * broadway (227)
  * lane (64)
  * plaza (26)
  * square (18)
  * parkway (11)
  * highway (9)
  * walk (9)
    
Some common abbreviations are:

  * st (52) => street
  * ave (39) => avenue
  * plz (23) => plaza
  * blvd (17) => boulevard
  * st (10) => street
  * dr (8) => drive
  * rd (6) => road
  * ave. (5) => avenue
  * blvd. (1) => boulevard

There are also some valid but uncommon street names (ones that I didn't expect to see):

  * real : 215 [ South El Camino Real, El Camino Real ,...]
  * court : 207 [ Prescott Court, Prescott Court ,...]
  * place : 82 [ William Saroyan Place, Romolo Place ,...]
  * alameda : 50 [ The Alameda, The Alameda ,...]
  * circle : 47 [ Holly Park Circle, Columbia Circle ,...]
  * center : 21 [ Fort Mason Center, Seramonte Center ,...]
  * path : 18 [ Parnassus Path, Parnassus Path ,...]
  * terrace : 16 [ Greenwood Terrace, Hawthorne Terrace ,...]
  * embarcadero : 10 [ The Embarcadero, The Embarcadero ,...]
  * bridgeway : 4 [ Bridgeway, Bridgeway ,...]
  
The rest street names are either very uncommon erroneously formatted (e.g. `leimert : 2 [ Leimert, Leimert ]`, `hall : 1 [ McCone Hall ]`, `410 : 1 [ 18th Street Ste 410 ]`, `h : 1 [ Avenue H ]` etc.).

### 4.2 Clean the dataset and convert to JSON

#### Clean street names

Based on audit above, we clean the street name with the following rules:
  * keep the common street names, both common ones (e.g. `street`, `road`) and the uncommone ones (e.g. `real`, `embarcadero`);
  * convert the abbreviation to full name, e.g. `st => street`, `blvd => boulevard`;
  * drop the very uncommon or erroneously formatted names, e.g. `410 [ 18th Street Ste 410 ]`, `h : [ Avenue H ]`.

In [24]:
def get_element_created(elem):
    """Get a 'created' dictionary from an element, that has the following keys:
    'version', 'changeset', 'timestamp', 'uid', 'user'."""
    created = {}
    for k in ["version", "changeset", "timestamp", "uid", "user"]:
        if k in elem.attrib:
            created[k] = elem.attrib[k]
    return created

#### Audit street address types

In [23]:
address_types = {}
for _, elem in ET.iterparse(xml_file):
    if elem.tag == "node":
        for tag in elem:
            if tag.attrib['k'].startswith("addr:"):
                k = tag.attrib['k'][5:]
                address_types.setdefault(k, 0)
                address_types[k] += 1
for k,v in address_types.items():
    print k, ':', v

pier : 1
city : 12790
full : 31
province : 1
floor : 2
country : 727
housenumber:source : 2
county : 14
place : 1
state : 1620
street : 14982
housename : 75
postcode : 2483
suite : 2
door : 1
housenumber : 17746
unit : 52
interpolation : 2


In [25]:
def get_node_address(node):
    pass

In [16]:
import json
import xml.etree.cElementTree as ET

xml_file = "data/san-francisco_california.osm"
json_file = xml_file + ".json"

def shape_node(elem):
    node = {"type": "node"}
    node["id"] = elem.attrib["id"]
    node["coord"] = [float(elem.attrib["lat"]), 
                     float(elem.attrib["lon"])]
    node["created"] = get_element_created(elem)
    return node

def shape_way(elem):
    way = {"type": "way"}
    way["created"] = get_element_created(elem)
    return way

def shape_relation(elem):
    relation = {"type": "relation"}
    relation["created"] = get_element_created(elem)
    return relation

with open(json_file, 'w') as f:
    for _, elem in ET.iterparse(xml_file):
        # Convert the element to json dictionary according to type.
        if elem.tag == "node":
            j = shape_node(elem)
        elif elem.tag == "way":
            j = shape_way(elem)
        elif elem.tag == "relation":
            j = shape_relation(elem)
        else:
            j = None
        # Dump the json dictionary to file.
        if j:
            f.write(json.dumps(j) + "\n")

## 4.3 Import into a MangoDB database and run queries

In [None]:
!mkdir -p db
!mongod -dbpath db

In [17]:
!mongoimport --db openstreetmap --collection sanfrancisco --drop \
             --file data/san-francisco_california.osm.json

2015-06-08T14:31:29.477-0400	connected to: localhost
2015-06-08T14:31:29.479-0400	dropping: openstreetmap.sanfrancisco
2015-06-08T14:31:32.457-0400	[........................] openstreetmap.sanfrancisco	25.1 MB/624.9 MB (4.0%)
2015-06-08T14:31:35.461-0400	[#.......................] openstreetmap.sanfrancisco	50.6 MB/624.9 MB (8.1%)
2015-06-08T14:31:38.460-0400	[##......................] openstreetmap.sanfrancisco	76.4 MB/624.9 MB (12.2%)
2015-06-08T14:31:41.458-0400	[###.....................] openstreetmap.sanfrancisco	101.5 MB/624.9 MB (16.2%)
2015-06-08T14:31:44.457-0400	[####....................] openstreetmap.sanfrancisco	128.7 MB/624.9 MB (20.6%)
2015-06-08T14:31:47.456-0400	[#####...................] openstreetmap.sanfrancisco	155.1 MB/624.9 MB (24.8%)
2015-06-08T14:31:50.457-0400	[######..................] openstreetmap.sanfrancisco	180.2 MB/624.9 MB (28.8%)
2015-06-08T14:31:53.459-0400	[#######.................] openstreetmap.sanfrancisco	205.2 MB/624.9 MB (32.8%)
2015-06-08T14:

In [18]:
from pymongo import MongoClient

client = MongoClient()
db = client.openstreetmap
collection = db.sanfrancisco

In [21]:
print "Number of <node>s:", collection.find({"type":"node"}).count()
print "Number of <way>s:", collection.find({"type":"way"}).count()
print "Number of <relation>s:", collection.find({"type":"relation"}).count()

print "Number of distinct users:", len(collection.distinct("created.user"))

Number of <node>s: 3018836
Number of <way>s: 324910
Number of <relation>s: 3322
Number of distinct users: 1970


Step Five -Document your Work
---------------------------