# Open Street Map data

## Project Overview
To choose any area of the world in https://www.openstreetmap.org and use data wrangling techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for that part of the world. Finally, use SQL as the data schema to complete your project by storing, querying and aggregating the data.

### OSM Dataset
The area selected for this project is Dublin city in Ireland. 

* Location: Dublin, Ireland
* [OpenStreetMap URL](https://www.openstreetmap.org/export#map=11/53.3549/-6.2512)
* [MapZen URL](https://mapzen.com/data/metro-extracts/metro/dublin_ireland/)

## Data Overview

Let's get an idea of what the top level tags are. Since the file is quite large instead of reading it in to memory as an XML tree I'll use iterative parsing to process the map file and find out not only what tags are there, but also how many, to get the feeling on how much of which data you can expect to have in the map.

In [None]:
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import pprint

OSMFILE = "dublin_ireland.osm"

def count_tags(filename):
# return a dictionary with the tag name as the key and number of times 
# this tag can be encountered in the map as value. 
    tags = {}
    # iteratively parse the file
    for event, elem in ET.iterparse(filename):
        if elem.tag in tags: 
            tags[elem.tag] += 1
        else:
            tags[elem.tag] = 1
    return tags

pprint.pprint(count_tags(OSMFILE))

The file is 369.4 MB and there are nearly 5,000,000 top level tags. 

### Nodes

A node is one of the core elements in the OpenStreetMap data model. It consists of a single point in space defined by its latitude, longitude and node id. A third, optional dimension (altitude) can also be included. A node can also be defined as part of a particular layer=* or level=*, where distinct features pass over or under one another; say, at a bridge.
Nodes can be used to define standalone point features, but are more often used to define the shape or "path" of a way

### Ways

A way is an ordered list of nodes which normally also has at least one tag or is included within a Relation.  way can be open or closed. A closed way is one whose last node on the way is also the first on that way. A closed way may be interpreted either as a closed polyline, or an area, or both. An open way is way describing a linear feature which does not share a first and last node. Many roads, streams and railway lines are open ways.

Relations are used to model logical (and usually local) or geographic relationships between objects. They are not designed to hold loosely associated but widely spread items.

### Check fot potential problems in tags

Let's explore the data a bit more. Before I process the data and add it into a database, I'll check the
"k" value for each "tag" tag and see if there are any potential problems.
We would like to change the data model and expand the "addr:street" type of keys to a dictionary like this:
{"address": {"street": "Some value"}}. So, we have to see if we have such tags, and if we have any tags with
problematic characters.

In [None]:
import xml.etree.cElementTree as ET
import re

"""
we want to get a count of each of four tag categories in a dictionary:
  "lower", for tags that contain only lowercase letters and are valid,
  "lower_colon", for otherwise valid tags with a colon in their names,
  "problemchars", for tags with problematic characters, and
  "other", for other tags that do not fall into the other three categories.
"""
#Regex patterns
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# check the "k" value for each "<tag>"
def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'):
            if lower.search(tag.attrib['k']):
                keys['lower'] += 1
            elif lower_colon.search(tag.attrib['k']):
                keys['lower_colon'] += 1
            elif problemchars.search(tag.attrib['k']):
                print(tag.attrib)
                keys['problemchars'] += 1
            else:
                keys['other'] += 1
    return keys

def process_map(filename):
    keys = {"lower": 0, 
            "lower_colon": 0, 
            "problemchars": 0, 
            "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys 

process_map("dublin_ireland.osm")

### Number of contributors

Im intersted in finding out how many unique users have contributed to this map. 

In [None]:
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
import re

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        try:
            users.add(element.attrib['uid'])
        except KeyError:
            continue    
    return users


def test():
    users = process_map("dublin_ireland.osm")
    print('Number of unique contributors:', len(users))

test()

## 1. Problems Encountered in the Map

1.

2.

3.

4.

## 2. Preparing Data for Database

## 4. Data Exploration

## 5. Additional ideas

## Conclusion

### Files