# Open Street Map data

## Project Overview
The aim of this project is to choose any area of the world in https://www.openstreetmap.org and use data wrangling techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for that part of the world. Once the data has been cleaned SQL will will be used to query and aggregate the data.

### OSM Dataset
The area selected for this project is Dublin city in Ireland. Dublin is the capital city of Ireland with a population of 1,347,359 people. I chose Dublin as I've lived there for nearly 10 years and am quite familiar with the area. 

* Location: Dublin, Ireland
* [OpenStreetMap URL](https://www.openstreetmap.org/export#map=11/53.3549/-6.2512)
* [MapZen URL](https://mapzen.com/data/metro-extracts/metro/dublin_ireland/) 

## Data Overview

Let's get an idea of what the top level tags are in the OSM file. Since the file is quite large iterative parsing will be used to process the map file and find out what tags are there, as well as how many, to get the feeling on how much of which data you can expect to have in the map.

In [1]:
OSMFILE = "dublin_ireland.osm"

In [2]:
import top_level_tags
top_level_tags.count_tags(OSMFILE)

{'bounds': 1,
 'member': 91716,
 'nd': 2028541,
 'node': 1469711,
 'osm': 1,
 'relation': 4989,
 'tag': 1060747,
 'way': 269763}

The file is 369.4 MB and there are nearly 5,000,000 top level tags. 

## Check for potential problems in tags

Let's explore the data a bit more. Before processing the data and adding it into a database, we can check the
"k" value for each "tag" tag and see if there are any potential problems.

We can get a count of each of four tag categories in a dictionary:
* "lower", for tags that contain only lowercase letters and are valid,
* "lower_colon", for otherwise valid tags with a colon in their names,
* "problemchars", for tags with problematic characters, and
* "other", for other tags that do not fall into the other three categories.

In [3]:
import tagtype
tagtype.process_map(OSMFILE)

{'lower': 673577, 'lower_colon': 342839, 'other': 44331, 'problemchars': 0}

## Number of contributors

OpenStreetMap consists of data contributed by multiple people. Each piece of data in the OSMFILE is accompanied by the user_id of the person who entered it. We can find out how many people have contributed towards the data the makes up the map of Dublin:

In [4]:
import users
users.number_of_users(OSMFILE)

Number of unique contributors: 1602


## Data Auditing

The following steps will be taken to audit the OSMFILE:
1. Create a variable, 'mapping', that will replace incorrect or inconsistent entries with appropriate names/formating. Only problems found in this OSMFILE will use mapping rather than a generalized solution, since that may and will depend on the particular area being audited.
2. Write a function to actually fix the street name/postcode/phone number. The function takes a string with street name/phone number/postcode as an argument and should return the fixed version.

### 1. Fix street names

In [2]:
import audit
audit.audit_street(OSMFILE)

defaultdict(set,
            {'1-13': {'The Rise 1-13'},
             '1-9': {'Manor Court 1-9'},
             '10-21': {'Manor Court 10-21'},
             '11': {"James Business Park, St Margaret's Road, Finglas North, Dublin 11",
              'Unit 6, North Park Business Park, Finglas, Dublin 11'},
             '14-28': {'The Rise 14-28'},
             '15': {'Rathborne Close71 RATHBORNE AVENUE RATHBORNE DUBLIN 15'},
             '2': {'Dame Court, Dublin 2'},
             '26': {'26'},
             '27-31': {'Supple Park 27-31'},
             '32-39': {'Supple Park 32-39'},
             '4': {'Serpentine Avenue, Ballsbridge, Dublin 4'},
             '40-44': {'Supple Park 40-44'},
             '48-': {'Supple Park 48-'},
             'Abbey': {'Fonthill Abbey',
              'Leopardstown Abbey',
              "Mary's Abbey",
              'Rothe Abbey',
              'Seachnall Abbey'},
             'Adair': {'Adair'},
             'Airport': {'Dublin Airport'},
             'Alba

Although plenty of extra street names show up that weren't in the expected list, most of them are less common street names that are acceptable. There are a few abbreviated street names and some spelling mistakes that can be fixed using mapping.

In [3]:
audit.update_street_name(OSMFILE)

First Ave => First Avenue
Griffith Ave => Griffith Avenue
Spruce Ave => Spruce Avenue
Novara road => Novara Road
Suffolk street => Suffolk Street
Grafton street => Grafton Street
Earl street => Earl Street
St Johns Rd => St Johns Road
Woodview Heichts => Woodview Heights
library square => library Square
Old Dublin Roafd => Old Dublin Road
Strand Rd. => Strand Road
Charlestown Shopping Cente => Charlestown Shopping Centre
Oak Ridge Cres => Oak Ridge Crescent
New market hall => New market Hall
O'Reilly Aveune => O'Reilly Avenue
The Rise,Belgard heights => The Rise,Belgard Heights
Ballinclea heights => Ballinclea Heights
Charlemont St. => Charlemont Street
Upper Gardiner St. => Upper Gardiner Street
Hanbury lane => Hanbury Lane
Warner's lane => Warner's Lane
Francis St => Francis Street


'Adair'

### 2.

### 3.

## Preparing Data for Database

After auditing is complete the next step is to prepare the data to be inserted into a SQL database. To do so we will parse the elements in the OSM XML file, transforming them from document format to tabular format, thus making it possible to write to .csv files.  These csv files can then easily be imported to a SQL database as tables. 

The process for this transformation is as follows:
- Use iterparse to iteratively step through each top level element in the XML
- Shape each element into several data structures using a custom function
- Utilize a schema and validation library to ensure the transformed data is in the correct format
- Write each data structure to the appropriate .csv files

In [None]:
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET

import cerberus

import schema

OSM_PATH = "dublin_ireland.osm"

NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

SCHEMA = schema.schema

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']


def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements

    # TODO: transform each element into the correct format
    if element.tag == 'node':
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}

# ================================================== #
#               Helper Functions                     #
# ================================================== #
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))


class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


# ================================================== #
#               Main Function                        #
# ================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
         codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
         codecs.open(WAYS_PATH, 'w') as ways_file, \
         codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
         codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])


## 4. Data Exploration

### TODO:
Database queries are used to provide a statistical overview of the dataset, like:
* size of the file
* number of unique users
* number of nodes and ways
* number of chosen type of nodes, like cafes, shops etc.
Additional statistics not in the list above are computed. For SQL submissions some queries make use of more than one table.

## 5. Additional ideas

### TODO:
* One or more additional suggestions for improving the data or its analysis. The suggestions are backed up by at least one investigative query.
* Discussion about the benefits as well as some anticipated problems in implementing the improvement.

## Conclusion

### Files

### TODO: Put all Python code snippets into seperate .py files and list them here