# ** Data Wrangling of Openstreetmap Dataset**#
###### by Yuan Wang (Andy) in fulfillment of Udacity Nanodegree, Project 3

## Project Summary <a name="top"></a>
Name: Yuan Wang

**Map area:**
+ Location: Brooklyn, New York
- <a href=https://s3.amazonaws.com/metro-extracts.mapzen.com/san-jose_california.osm.bz2> Mapzen URl for Brooklyn, U.S.A. </a> 

**Objective: **

Clean and transform location data in large openstreetmap xml file to structured data format. 

Assess the quality of the data for validity, accuracy, completeness, consistency and uniformity.

Parsing and gather data from popular file formats such as .json, .xml, .csv, .html

**References:**

Udacity "Data Wrangling with MongoDB" - Lesson 6

<a href=http://www.cceo.org/addressing/documents/StreetAbbreviationsGuide.pdf> CCEO Street Abbreviations Guide PDF </a> 


In [21]:
import xml.etree.cElementTree as ET
import pprint
import re
import collections

In [2]:
bk_data = "brooklyn_new-york.osm"

> I use XML.Elementree to parse through the Brooklyn dataset and built 'count_tags' function to count the number of unique element types to explore the content of this dataset.

In [3]:
def count_tags(filename):
        tags = {}
        for event, elem in ET.iterparse(filename):
            if elem.tag in tags: 
                tags[elem.tag] += 1
            else:
                tags[elem.tag] = 1
        return tags
bk_tags = count_tags(bk_data)
pprint.pprint(bk_tags)

{'bounds': 1,
 'member': 14551,
 'nd': 3494969,
 'node': 2484785,
 'osm': 1,
 'relation': 1701,
 'tag': 2819240,
 'way': 490294}


> For the function 'key_type', we have a count of each of
four tag categories in a dictionary:
"lower", for tags that contain only lowercase letters and are valid,
  "lower_colon", for otherwise valid tags with a colon in their names,
   "problemchars", for tags with problematic characters, and other tags belong to 'other' category. 


In [4]:
import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            k = tag.get('k')
            if lower.search(k):
                keys['lower'] += 1
            elif lower_colon.search(k):
                keys['lower_colon'] += 1
            elif problemchars.search(k):
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys


def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

bk_keys = process_map(bk_data)
pprint.pprint(bk_keys)

{'lower': 1052673,
 'lower_colon': 1745044,
 'other': 7239,
 'problemchars': 14284}


> 'process_map' function is built to find out how many unique users have contributed to the map in Brooklyn area, we have 1382 uniques users have already worked on this so far.

In [5]:
#people invovlved in the map editing.
def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for e in element:
            if 'uid' in e.attrib:
                users.add(e.attrib['uid'])
    return users
users = process_map(bk_data)
len(users)

1384

> Street name abbreviation inconsistency is one of big problems in this dataset. In this following code, we build the regex matching the last element in the string, where usually the street type is based. Then we come up with a list of mapping that need not to be cleaned.  

In [12]:
from collections import defaultdict

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

expected = ["Avenue", "Boulevard", "Commons", "Court", "Drive", "Lane", "Parkway", 
                         "Place", "Road", "Square", "Street", "Trail"]

mapping = {'Ave'  : 'Avenue',
           'Blvd' : 'Boulevard',
           'Dr'   : 'Drive',
           'Ln'   : 'Lane',
           'Pkwy' : 'Parkway',
           'Rd'   : 'Road',
           'Rd.'   : 'Road',
           'St'   : 'Street',
           'street' :"Street",
           'Ct'   : "Court",
           'Cir'  : "Circle",
           'Cr'   : "Court",
           'ave'  : 'Avenue',
           'Hwg'  : 'Highway',
           'Hwy'  : 'Highway',
           'Sq'   : "Square"}


>  'audit_street_type' function searches the input string for the regex. If there is a match and it is not within the "expected" list, add the match as a key and add the string to the set. 'is_street_name' function looks at the attribute k if k="addre:street". 'audit function will return the list that match previous two functions. 


In [13]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):

        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types

In [14]:
bk_street_types = audit(bk_data)

>'update_name' function is the last step of the auditing process, which replace old names with new names to improve street name abbreviation inconsistency.

In [17]:
def update_name(name, mapping, regex):
    m = regex.search(name)
    if m:
        street_type = m.group()
        if street_type in mapping:
            name = re.sub(regex, mapping[street_type], name)

    return name

for street_type, ways in bk_street_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping, street_type_re)
        if better_name != name:
            print name, "=>", better_name

Aviation Rd => Aviation Road
5th street => 5th Street
Union street => Union Street
Columbia street => Columbia Street
Hudson street => Hudson Street
Lafayette street => Lafayette Street
Mulberry street => Mulberry Street
Pearl street => Pearl Street
Mott street => Mott Street
East 5th street => East 5th Street
Rivington street => Rivington Street
5th Ave => 5th Avenue
Park Ave => Park Avenue
Norman Ave => Norman Avenue
4th Ave => 4th Avenue
6th Ave => 6th Avenue
5th ave => 5th Avenue
Madison St => Madison Street
Jackson St => Jackson Street
Grand St => Grand Street
Washington St => Washington Street
2nd St => 2nd Street
Monroe St => Monroe Street
362nd Grand St => 362nd Grand Street
1st St => 1st Street
Union St => Union Street
Hudson St => Hudson Street
Schermerhorn St => Schermerhorn Street
Bloomfield St => Bloomfield Street
8th St => 8th Street
Smith St & Bergen St => Smith St & Bergen Street
River St => River Street
State St & Water St => State St & Water Street
Garden St => Garden