# Wrangle OpenStreetMap data
## Udacity Data Analyst Nanodegree - Project 3

### Introduction

Todo:
* Add case study Python scripts
* Create OSM sample file (see code below)

#### Map
* Cologne, http://www.openstreetmap.org/#map=11/50.9387/6.8740
* Before moving to Milan, Italy where I currently live, Cologne has been my home for several years. I know this city quite well and I am curious to improve its OSM data

### Resources
* Udacity course materials
* [Python 3 documentation](https://docs.python.org/3/)
* [MongoDB documentation](https://docs.mongodb.com/manual/)
* [MongoDB driver documentation](https://docs.mongodb.com/ecosystem/drivers/python/)
* [Markdown documentation](https://daringfireball.net/projects/markdown/syntax)

### Case study scripts

#### Iterative parsing

In [1]:
# tbd

#### Tagging types

In [None]:
# tbd

#### Exploring users

In [3]:
# tbd

#### Improving street names

In [None]:
# tbd

#### Preparing for database

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json
"""
Your task is to wrangle the data and transform the shape of the data
into the model we mentioned earlier. The output should be a list of dictionaries
that look like this:

{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}

You have to complete the function 'shape_element'.
We have provided a function that will parse the map file, and call the function with the element
as an argument. You should return a dictionary, containing the shaped data for that element.
We have also provided a way to save the data in a file, so that you could use
mongoimport later on to import the shaped data into MongoDB. 

Note that in this exercise we do not use the 'update street name' procedures
you worked on in the previous exercise. If you are using this code in your final
project, you are strongly encouraged to use the code from previous exercise to 
update the street names before you save them to JSON. 

In particular the following things should be done:
- you should process only 2 types of top level tags: "node" and "way"
- all attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - attributes in the CREATED array should be added under a key "created"
    - attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings. 
- if the second level tag "k" value contains problematic characters, it should be ignored
- if the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
- if the second level tag "k" value does not start with "addr:", but contains ":", you can
  process it in a way that you feel is best. For example, you might split it into a two-level
  dictionary like with "addr:", or otherwise convert the ":" to create a valid key.
- if there is a second ":" that separates the type/direction of a street,
  the tag should be ignored, for example:

<tag k="addr:housenumber" v="5158"/>
<tag k="addr:street" v="North Lincoln Avenue"/>
<tag k="addr:street:name" v="Lincoln"/>
<tag k="addr:street:prefix" v="North"/>
<tag k="addr:street:type" v="Avenue"/>
<tag k="amenity" v="pharmacy"/>

  should be turned into:

{...
"address": {
    "housenumber": 5158,
    "street": "North Lincoln Avenue"
}
"amenity": "pharmacy",
...
}

- for "way" specifically:

  <nd ref="305896090"/>
  <nd ref="1719825889"/>

should be turned into
"node_refs": ["305896090", "1719825889"]
"""


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]


def shape_element(element):
    
    # create dicts
    node = {}
    address = {}
    created = {}
    
    # create lists
    node_refs = []
    pos = []
    
    # check for valid tags
    if element.tag == "node" or element.tag == "way" :
        
        # save valid tag to node dict
        node['type'] = element.tag

        # check for lat and lon attributes
        if 'lat' in element.attrib.keys() and 'lon' in element.attrib.keys():
            
            # try convert lat and lon values to float
            try:
                lat = float(element.attrib['lat'])
                lon = float(element.attrib['lon'])
                
                # save values to pos list
                pos = [lat,lon]
                
            except:
                pass
        
        # loop over attribute items
        for k, m in element.attrib.items():
            
            # skip lat and lon elements
            if k not in pos:
                
                # check if attributes match CREATED list
                if k in CREATED:
                    created[k] = m
                else:
                    node[k] = m
        
        # loop over children
        for child in element:
                
            # check second level tag nd
            if child.tag == "nd":
                node_refs.append(child.attrib['ref'])
                
            # check second level address tag
            if child.tag == "tag":
            
                if child.attrib['k'].startswith("addr:"):
                    address[child.attrib['k']] = child.attrib['v']
                    
                # handle dirty address tags
                elif not child.attrib['k'].startswith("addr:") and child.attrib['k'].count(":") == 1:
                    child.attrib['k'] = child.attrib['k'].replace(":", "addr:")
                    address[child.attrib['k']] = child.attrib['v']
                
                # omit tags with more than one ":"
                elif child.attrib['k'].count(":") > 1:
                    pass
        
        # integrate created dict into node dict
        if created:
            node['created'] = created
        
        # integrate pos dict into node dict    
        if pos:
            node['pos'] = pos
            
        # integrate adress dict into node dict
        if address:
            node['address'] = address
            
        # integrate node_refs dict into node dict   
        if node_refs:
            node['node_refs'] = node_refs
        
        return node
    else:
        return None


def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

def test():
    # NOTE: if you are running this code on your computer, with a larger dataset, 
    # call the process_map procedure with pretty=False. The pretty=True option adds 
    # additional spaces to the output, making it significantly larger.
    data = process_map('example.osm', True)
    #pprint.pprint(data)
    
    correct_first_elem = {
        "id": "261114295", 
        "visible": "true", 
        "type": "node", 
        "pos": [41.9730791, -87.6866303], 
        "created": {
            "changeset": "11129782", 
            "user": "bbmiller", 
            "version": "7", 
            "uid": "451048", 
            "timestamp": "2012-03-28T18:31:23Z"
        }
    }
    assert data[0] == correct_first_elem
    assert data[-1]["address"] == {
                                    "street": "West Lexington St.", 
                                    "housenumber": "1412"
                                      }
    assert data[-1]["node_refs"] == [ "2199822281", "2199822390",  "2199822392", "2199822369", 
                                    "2199822370", "2199822284", "2199822281"]

if __name__ == "__main__":
    test()

### Data exploration

#### Generate sample OSM file

In [9]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow

OSM_FILE = "cologne_germany.osm"  # Replace this with your osm file
SAMPLE_FILE = "cologne_germany_sample.osm"

k = 10 # Parameter: take every k-th top level element

def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag

    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    """
    Changed output of write to byte objects in order to work with Python 3.x
    
    Reference:
    http://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str
    """
    output.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write(b'<osm>\n  ')

    # Write every kth top level element
    for i, element in enumerate(get_element(OSM_FILE)):
        if i % k == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write(b'</osm>')

In [14]:
# parse XML file

# list tags

# identify problems

# investigate problems

# fix problems

### Problems encountered
tbd

### Data overview
tbd

### Additional ideas
tbd

### Conclusion
tbd