# Open Street Map data

## Project Overview
The aim of this project is to choose any area of the world in https://www.openstreetmap.org and use data wrangling techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for that part of the world. Once the data has been cleaned SQL will will be used to query and aggregate the data.

### OSM Dataset
The area selected for this project is Dublin city in Ireland. Dublin is the capital city of Ireland with a population of 1,347,359 people. I chose Dublin as I've lived there for nearly 10 years and am quite familiar with the area. 

* Location: Dublin, Ireland
* [OpenStreetMap URL](https://www.openstreetmap.org/export#map=11/53.3549/-6.2512)
* [MapZen URL](https://mapzen.com/data/metro-extracts/metro/dublin_ireland/) 

## Data Overview

Let's get an idea of what the top level tags are in the OSM file. Since the file is quite large iterative parsing will be used to process the map file and find out what tags are there, as well as how many, to get the feeling on how much of which data you can expect to have in the map.

In [1]:
OSMFILE = "dublin_ireland.osm"

In [2]:
import top_level_tags
top_level_tags.count_tags(OSMFILE)

{'bounds': 1,
 'member': 91716,
 'nd': 2028541,
 'node': 1469711,
 'osm': 1,
 'relation': 4989,
 'tag': 1060747,
 'way': 269763}

The file is 369.4 MB and there are nearly 5,000,000 top level tags. 

## Check for potential problems in tags

Let's explore the data a bit more. Before processing the data and adding it into a database, we can check the
"k" value for each "tag" tag and see if there are any potential problems.

We can get a count of each of four tag categories in a dictionary:
* "lower", for tags that contain only lowercase letters and are valid,
* "lower_colon", for otherwise valid tags with a colon in their names,
* "problemchars", for tags with problematic characters, and
* "other", for other tags that do not fall into the other three categories.

In [3]:
import tagtype
tagtype.process_map(OSMFILE)

{'lower': 673577, 'lower_colon': 342839, 'other': 44331, 'problemchars': 0}

## Number of contributors

OpenStreetMap consists of data contributed by multiple people. Each piece of data in the OSMFILE is accompanied by the user_id of the person who entered it. We can find out how many people have contributed towards the data the makes up the map of Dublin:

In [4]:
import users
users.number_of_users(OSMFILE)

Number of unique contributors: 1602


## Data Auditing

The following steps will be taken to audit the OSMFILE:
1. Create a variable, 'mapping', that will replace incorrect or inconsistent entries with appropriate names/formating. Only problems found in this OSMFILE will use mapping rather than a generalized solution, since that may and will depend on the particular area being audited.
2. Write a function to actually fix the street name. The function takes a string with street name as an argument and should return the fixed version.

## 1. Fix street names

In [5]:
import audit
audit.audit_street(OSMFILE)

defaultdict(set,
            {'1-13': {'The Rise 1-13'},
             '1-9': {'Manor Court 1-9'},
             '10-21': {'Manor Court 10-21'},
             '11': {"James Business Park, St Margaret's Road, Finglas North, Dublin 11",
              'Unit 6, North Park Business Park, Finglas, Dublin 11'},
             '14-28': {'The Rise 14-28'},
             '15': {'Rathborne Close71 RATHBORNE AVENUE RATHBORNE DUBLIN 15'},
             '2': {'Dame Court, Dublin 2'},
             '26': {'26'},
             '27-31': {'Supple Park 27-31'},
             '32-39': {'Supple Park 32-39'},
             '4': {'Serpentine Avenue, Ballsbridge, Dublin 4'},
             '40-44': {'Supple Park 40-44'},
             '48-': {'Supple Park 48-'},
             'Abbey': {'Fonthill Abbey',
              'Leopardstown Abbey',
              "Mary's Abbey",
              'Rothe Abbey',
              'Seachnall Abbey'},
             'Adair': {'Adair'},
             'Airport': {'Dublin Airport'},
             'Alba

Although plenty of extra street names show up that weren't in the expected list, most of them are less common street names that are acceptable. There are a few abbreviated street names and some spelling mistakes that can be fixed using mapping.

In [6]:
audit.update_street_name(OSMFILE)

Strand Rd. => Strand Road
Upper Gardiner St. => Upper Gardiner Street
Charlemont St. => Charlemont Street
Old Dublin Roafd => Old Dublin Road
Griffith Ave => Griffith Avenue
First Ave => First Avenue
Spruce Ave => Spruce Avenue
The Rise,Belgard heights => The Rise,Belgard Heights
Ballinclea heights => Ballinclea Heights
Charlestown Shopping Cente => Charlestown Shopping Centre
library square => library Square
Francis St => Francis Street
Woodview Heichts => Woodview Heights
Oak Ridge Cres => Oak Ridge Crescent
Suffolk street => Suffolk Street
Earl street => Earl Street
Grafton street => Grafton Street
O'Reilly Aveune => O'Reilly Avenue
New market hall => New market Hall
St Johns Rd => St Johns Road
Novara road => Novara Road
Warner's lane => Warner's Lane
Hanbury lane => Hanbury Lane


'Clonkeen'

## 2. Fix Eircodes (Postal codes)

In [7]:
import eircodes
eircodes.audit(OSMFILE)

defaultdict(set,
            {'0000': {'0000'},
             '1': {'1'},
             '12': {'12'},
             '13': {'13'},
             '17': {'17'},
             '18': {'18'},
             '2': {'2'},
             '22': {'22'},
             '3': {'3'},
             '4': {'4'},
             '8': {'8'},
             'A94 A0D0': {'A94 A0D0'},
             'A94 AN81': {'A94 AN81'},
             'A94 E4X2': {'A94 E4X2'},
             'A94 FA39': {'A94 FA39'},
             'A94 KD78': {'A94 KD78'},
             'A94 P206': {'A94 P206'},
             'A94 PC61': {'A94 PC61'},
             'A94 PC95': {'A94 PC95'},
             'A94 R299': {'A94 R299'},
             'A94 W209': {'A94 W209'},
             'A94 X660': {'A94 X660'},
             'A94 X6C1': {'A94 X6C1'},
             'A94 X886': {'A94 X886'},
             'A96 A021': {'A96 A021'},
             'A96 AD99': {'A96 AD99'},
             'A96 CD72': {'A96 CD72'},
             'A96 D376': {'A96 D376'},
             'A96 D850': {'A9

The list above shows the Eircodes found. The acceptable format is XXX XXXX. Some are missing the space in the middle so we'll fix the. 

Ireland recently introduced Eircodes which each represent a single building. There are some post codes that are still using the old convention of postcode which was listed as Dublin and a number to indicate the region of Dublin, e.g "Dublin 2". The numbers listed at the top are likely from the older system of postcodes where it was common to either list a postcode in the format "Dublin 2", "D2" or just "2" to specify the region of Dublin. 

These old post codes can't be easily translated to Eircodes. The old convention represented a region within Dublin but the new Eircodes represent specific buildings. So for each of the old post codes there are now hundreds of unique Eircodes in it's place. To convert over to the new system would require time consuming manual work of pin pointing each of the specific building and finding out what the newly assigned Eircode is for that building. So, we will just stick to tidying up the formatting for the new Eircodes that have been entered.

In [8]:
eircodes.update_eircode(OSMFILE)

D01X2P2 => D01 X2P2
D15KPW7 => D15 KPW7
D02X285 => D02 X285
D05N7F2 => D05 N7F2
D08P 89W => D08 P89W
D6WXK28 => D6W XK28
d09 f6x0 => D09 F6X0
D09VY19 => D09 VY19


'D01 NW14'

## Preparing Data for Database

After auditing is complete the next step is to prepare the data to be inserted into a SQL database. To do so we will parse the elements in the OSM XML file, transforming them from document format to tabular format, thus making it possible to write to .csv files.  These csv files can then easily be imported to a SQL database as tables. 

The process for this transformation is as follows:
- Use iterparse to iteratively step through each top level element in the XML
- Shape each element into several data structures using a custom function
- Utilize a schema and validation library to ensure the transformed data is in the correct format
- Write each data structure to the appropriate .csv files

The data.py file generates the following csv files:
* nodes_tags.csv
* nodes.csv 
* ways_nodes.csv
* ways_tags.csv
* ways.csv

## 4. Data Exploration

### TODO:
Database queries are used to provide a statistical overview of the dataset, like:
* size of the file
* number of unique users
* number of nodes and ways
* number of chosen type of nodes, like cafes, shops etc.
Additional statistics not in the list above are computed. For SQL submissions some queries make use of more than one table.

## 5. Additional ideas

### TODO:
* One or more additional suggestions for improving the data or its analysis. The suggestions are backed up by at least one investigative query.
* Discussion about the benefits as well as some anticipated problems in implementing the improvement.

Suggestion: 
* Updating postcodes. Ireland only recently got postcodes, known as Eircode. Before that there were just postcodes in Dublin. IF OSM put all letters in Eircode to uppercasethere would be a bit more consistency. 
* There should be documentation on standard practices, e.g. for phone numbers whether to use brackets, dashes and spaces. Saw some fields that had underscores filled in as the user who entered the data probably didnt know that bit of information. Also check to make sure that just numbers have been entered, no special characters, except + which is often used in area codes.  
* Additonally OSM could have guidelines for whether addresses contain full names or abbreviated name for, e.g. "Avenue" or "Ave." 

## Conclusion

### Files

* top_level_tags -> Find out waht the top level tags are and how many of them there are.
* tagtype.py -> Check the "k" value for each "tag" tag and see if there are any potential problems.
* users.py -> Find out how many users contributed to the Dublin map.
* audit.py -> audit data for street names.
* eircodes.py -> audit data for eircodes.
* schema.py -> schemas to be generated.
* data.py -> Clean data and store in genrated csv files.