# OpenStreetMap Project

## Data for project

Melbourne, Australia metropolitan area (Node: 21579127) downloaded from:
- https://mapzen.com/data/metro-extracts/metro/melbourne_australia/
- https://www.openstreetmap.org/node/21579127

## Overview of dataset

Downloaded .osm file for Melbourne has a size of 855125721 Bytes (855 MB)

*Looping through the ElementTree object, I was able to some general stats for no. of nodes, tags, and unique contributers, etc.*

List of tags (and count) in 'melbourne_australia.osm' file:
* 'node': 3823741
* 'nd': 4494943
* 'bounds': 1
* 'member': 101777
* 'tag': 2230709
* 'relation': 4546
* 'way': 526873
* 'osm': 1

Number of unique contributers is: 2499

# Auditing dataset

## Audit #1: cleaning the street 'types'

A function was written using a regex expression to isolate the last word in the 'street name' (assumed to be the street 'type', e.g. 'Road').

This was then checked against an 'expected set' of values, with the output values either added to the expected set, or corrected via a 'mapping' dictionary depending on my assesment of their validity.

This code was used to write a .py script called 'street_type_cleaner.py' that can be called from the shape_element() function later when writing the data to .csv format.

Note: Most errors were with abbreviations ("St", "Ave", etc) or lowercase ("street", "road"), although there were some typos ("Stree"); these were all converted to capatilized full words (e.g. "Street", "Road").

Also, it should be noted that "North"/"South"/"East"/"West" were added to the list of acceptable types, as they often occur at the end of the street string (e.g."Balwyn Road South"). However, others may prefer to change how these cases are handled.

*A test function was written to show which entries had been modified, and what their 'corrected' output was.*

**From the test(), the street types appeared to be cleaned up correctly with this auditing function.**

## Audit #2: looking at the 'amenities' in Melbourne

I wanted to see what types of 'amenities' there were listed for various nodes in Melbourne

I started by printing out the full list of amenities. This list was not so big that I couldn't browse the entire thing as is.

It looked pretty clean. The only thing was there was an amenity called "yes", which seemed like an incorrect input. 

Next, I tried to find the entry/entries containing this as an amentiy and remove/update them.

In [None]:
def find_yes(osmfile):
    osm_file = open(osmfile, "r")
    amenity_is_yes = set()
    for event, element in ET.iterparse(osm_file, events=("start",)):
        if element.tag == "node":
            for tag in element.iter("tag"):
                if is_amenity_type(tag) and (tag.attrib['v'] == "yes"):
                    amenity_is_yes.add(element.attrib['id'])
    osm_file.close()
    return amenity_is_yes

print find_yes(MELB_DATASET)

It seemed to be only one entry (a "node" with "id=2674416306") that had "yes" as an "amenity" attribute.

This specific entry also had an attribute "name = barbeque". As there are other amenity entries as "bbq", I chose to update this one to also have "amenity='bbq'". This was done in the shape_element() function by finding the amenity with "yes" as a value, then applying a simple function saved in the 'amenity_cleaner.py' file that updated the attribute value to "bbq".

In [None]:
if (tag.attrib["k"] == "amenity") and (tag.attrib["v"] == "yes"):
    new_amentity_value = am.update_amenity_type()
    node_tag_dict.update({"key": tag.attrib["k"], "value": new_amentity_value, "type": "regular"})

As the rest of the amenity entries appeared ok, no further cleaning was applied.

## Audit #3: check 'postcode' validity

A regex was used to identify problematic postcodes in the dataset. These were defined as 'not starting with 3 and containing four numbers'.

A script was written to identify any non-conforming postcodes.

**Note:** Initially, only the "way" elements were assessed for postcode entries. After creating the SQLite database at a later stage, it was noticed that there were problematic entries for the tags in the "node" element also. Therefore, an 'elif element.tag == "node":' statement was added, which appeared to resolve the issue after the mapping dictionary was updated with the newly-identified problematic entries.

The output of this function was then used to build a 'mapping' dictionary that could correct any problematic entries.

Some searching was required to identify the actual postcodes of some non-conforming entries, or what they should be based on other attributes listed in the relevant tags. For example:
Searching for "Yarra Street" (way.id=266729151) revealed that the "addr:street" and "addr:postcode" attributes have each others values! Postcode should be "3220".
Searching for "Victoria (way.id=48333668) had an entry based in Clayton. Therefore, the postcode should be updated to "3168".

A test() function was written as above, before adapting the code for the 'postcode_cleaner.py' script to be called directly from the shape_element function.

## Writing the xml data into csv files

The code to write csv files from the input xml/osm file was adapted from Lesson 13: Case Study.

The cleaning/auditing functions developed above were incorporated into the relevant section of the shape_element() function.

See the 'osm_melb_csv_writer.py' file for the full code.

## Creating the SQLite database with the 'cleaned' data

**The newly created .csv files were loaded into various tables of a new database using Python DB-API (adapted from several threads in the Udacity forums)**

Using the schema provided, the tables 'nodes', 'nodes_tags', 'ways', 'ways_tags', and 'ways_nodes' were created and populated with the relavent csv files.

See 'sql_db_generator.py' file for code.

## Querying the new SQLite database

The querying was performed using the Python DB-API.

### Check street names for cleanliness

A basic test to see if the 'clean' street names had been loaded into the database was performed by searching for both the problematic version (identified during the data auditing step), and a 'cleaned' name.

*For example:*

In [None]:
SELECT tags.value FROM (SELECT * FROM nodes_tags) tags WHERE tags.key='street' 
    AND tags.value='Camellia Cresecnt' GROUP BY tags.value;

SELECT tags.value FROM (SELECT * FROM nodes_tags) tags WHERE tags.key='street' 
    AND tags.value='Camellia Crescent' GROUP BY tags.value;

The dirty name was not found, where the clean one was. This was repeated for several entries to confirm the presence of the cleaned data.

**Note:** When this check was run on the first version of the database, several erroneous entries were found (it is assumed that bad entries will have low 'counts', so only those with count < 5 need to be examined).

**Similar checks were run for both the 'postcode' and the 'amenity' fields (code note shown). These tests indicate that the database data had been cleaned successfully.**

## Other statistics about the Melbourne dataset

### Number of nodes

In [4]:
SELECT COUNT(*) FROM nodes;

[(3823741,)]


3823741

This matches the number calculated from the original xml file.

### Number of unique users

In [5]:
SELECT COUNT(DISTINCT(tag.uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) tag;

[(2493,)]


2493

This is slightly lower than the value calculated from the raw xml (2499). This may indicated that some entries have been removed from the dataset during our cleaning/file conversion processing, e.g. those with problematic characters. However the difference is relatively minor.

### Top 10 types of "surface" for "ways"
i.e. tag.attrib['surface']

In [8]:
SELECT value, COUNT(*) as count FROM ways_tags WHERE key='surface' GROUP BY value ORDER BY count DESC LIMIT 10;

[(u'asphalt', 29084),
 (u'paved', 8918),
 (u'concrete', 7152),
 (u'unpaved', 4423),
 (u'gravel', 1989),
 (u'dirt', 1134),
 (u'grass', 817),
 (u'cobblestone', 375),
 (u'ground', 249),
 (u'wood', 167)]


asphalt (29084)  
paved (8918)  
concrete (7152)  
unpaved (4423)  
gravel (1989)  
dirt (1134)  
grass (817)  
cobblestone (375)  
ground (249)  
wood (167)  

### No. of churches and no. of post offices

In [14]:
SELECT value, COUNT(*) as count FROM nodes_tags WHERE key='amenity' 
    AND (value='place_of_worship' or value='post_office') GROUP BY value;

[(u'place_of_worship', 494), (u'post_office', 266)]


There are 494 churches and 266 post offices.

### What are the top 5 "barriers" in Melbourne

In [17]:
SELECT value, COUNT(*) as count FROM ways_tags WHERE key='barrier' GROUP BY value ORDER BY count DESC LIMIT 5;

[(u'fence', 26759),
 (u'wall', 266),
 (u'hedge', 228),
 (u'retaining_wall', 152),
 (u'wire_fence', 133)]


fence (26759)  
wall (266)  
hedge (228)  
retaining_wall (152)  
wire_fence (133)  

Seems like a lot of hedges!

# Additional Ideas

### Missing data fields

It seems to me that the biggest problem with the dataset is missing attributes. Specifically, values for 'amenity' appear to be entered infrequently. For example, there are 494 nodes tagged as 'place_of_worship', but only 3 tagged as 'internet_cafe'. Given that I can see more than 3 internet cafes from where I'm sitting, this indicates that the dataset is incomplete.

One way I propose that this could be addressed is by cross-referencing with third-party datasets obtained from social media (Facebook, Twitter) or review-based applications (e.g. Yelp) where users are 'tagged' as being at a location, often with the type of establishment they are at (e.g. 'restaurant'). Therefore, the location can act as the foreign key to update a given node's 'amenity' field.

This kind of implementation may have the following benefits/anticipated problems:

**Benefits**  
  * May be able to automate the process to handle the large amounts of data required
  * Would allow for real-time/up-to-date cross-checking of amenity types. For example, near where I live there is a lot of business turnover, e.g. one day a location might be a 'restaurant', where one week later it's a 'hairdressers'. I imagine these are not being updated frequently in the osm data.

**Anticipated Problems**  
  * Conflicting entries. It's imagined that the lat./lon. location data may be set with some accepted margin of error to allow cross-compatibility between the datasets. This may lead to conflicting entries. However, its foreseeable that fine-tuning may minimize these occurances.
  * Accessability of user data from third-parties. Privacy concerns/issues would need to be addressed.

# Conclusion

The osm dataset for Melbourne is extremely large, which is to be expected given its relatively large population and hugh land area (i.e. low pop. density). There are many entries in this dataset with incomplete and/or missing attributes; however, given the size of the dataset the attributes that were inspected in this study (i.e. street names, postcodes, and amenities) were relatively clean. This allowed for the auditing/cleaning process to be conducted by directly writing  dictionaries/maps to replace, update, or remove entries. If the problematic entry sets were much larger then they would need to be cleaned more programtically. That being said, there is a huge wealth of information already available in this dataset thanks to the contributions of the 2490-ish users who have contributed!