# Wrangle OpenStreetMap Data
**Author**: Shawn Lin  
**Date**: 2017-03-30
## Introduction
In this project, we will be studying data from the OpenStreetMap(OSM) project. We will audit, clean and try to find interesting information from the data.
### OpenStreetMap
Wikipedia:
>OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. The creation and growth of OSM has been motivated by restrictions on use or availability of map information across much of the world, and the advent of inexpensive portable satellite navigation devices. OSM is considered a prominent example of volunteered geographic information.

>Created by Steve Coast in the UK in 2004, it was inspired by the success of Wikipedia and the predominance of proprietary map data in the UK and elsewhere. Since then, it has grown to over 2 million registered users, who can collect data using manual survey, GPS devices, aerial photography, and other free sources. These crowdsourced data are then made available under the Open Database Licence. The site is supported by the OpenStreetMap Foundation, a non-profit organisation registered in England and Wales.

A snapshot of the OSM webpage:
![OpenStreetMap](./OSMwebpage.JPG)


### Location: New York City
New York City, arguably the most diverse city in the world, has always been one of my favourite cities around the world. Being a Chinese myself, I have always craved proper Chinese food. And fortunately there are many in New York, which made my time at Columbia University way easier. In this case study, I have downloaded the OpenStreetMap Data for [New York Metro Area](https://mapzen.com/data/metro-extracts/metro/new-york_new-york/), and will examining whether the OSM data can be a reliable source for me to find some great food, or even to use it for my daily commute.

## Auditing the data
OSM is a collaborative project, and relies on volunteers to contribute geographic information. As these data are entered by human, there can be human entry errors, inconsistent standards, incomplete entries and many other factors that can make the data messy. We will be first using some validation rules to audit our data and make it cleaner. As the OSM dataset for New York Metro Area is incredibly large (~2.6GB), we will be sampling ~0.5% of the data first to get a flavour of what kind of problems we might encounter.

### OSM XML file:
The following is an sample OSM XML file found on the [OpenStreetMap Wiki](http://wiki.openstreetmap.org/wiki/OSM_XML):
```
<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.0.2">
 <bounds minlat="54.0889580" minlon="12.2487570" maxlat="54.0913900" maxlon="12.2524800"/>
 <node id="298884269" lat="54.0901746" lon="12.2482632" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
 <node id="261728686" lat="54.0906309" lon="12.2441924" user="PikoWinter" uid="36744" visible="true" version="1" changeset="323878" timestamp="2008-05-03T13:39:23Z"/>
 <node id="1831881213" version="1" changeset="12370172" lat="54.0900666" lon="12.2539381" user="lafkor" uid="75625" visible="true" timestamp="2012-07-20T09:43:19Z">
  <tag k="name" v="Neu Broderstorf"/>
  <tag k="traffic_sign" v="city_limit"/>
 </node>
 ...
 <node id="298884272" lat="54.0901447" lon="12.2516513" user="SvenHRO" uid="46882" visible="true" version="1" changeset="676636" timestamp="2008-09-21T21:37:45Z"/>
 <way id="26659127" user="Masch" uid="55988" visible="true" version="5" changeset="4142606" timestamp="2010-03-16T11:47:08Z">
  <nd ref="292403538"/>
  <nd ref="298884289"/>
  ...
  <nd ref="261728686"/>
  <tag k="highway" v="unclassified"/>
  <tag k="name" v="Pastower Straße"/>
 </way>
 <relation id="56688" user="kmvar" uid="56190" visible="true" version="28" changeset="6947637" timestamp="2011-01-12T14:23:49Z">
  <member type="node" ref="294942404" role=""/>
  ...
  <member type="node" ref="364933006" role=""/>
  <member type="way" ref="4579143" role=""/>
  ...
  <member type="node" ref="249673494" role=""/>
  <tag k="name" v="Küstenbus Linie 123"/>
  <tag k="network" v="VVW"/>
  <tag k="operator" v="Regionalverkehr Küste"/>
  <tag k="ref" v="123"/>
  <tag k="route" v="bus"/>
  <tag k="type" v="route"/>
 </relation>
 ...
</osm>
```

The element `<bounds>` seems rather straight forward. It is simply the map area this OSM XML file is covering. However, the other elements like `<node>` or `<way>` looks a slightly more complicated. From the [Wiki](http://wiki.openstreetmap.org/wiki/Elements), we have some definitions for these elements and attributes:

#### Elements
##### Node
A node represents a specific point on the earth's surface defined by its latitude and longitude. Each node comprises at least an id number and a pair of coordinates.  
Nodes can be used to define standalone point features. For example, a node could represent a park bench or a water well.  
Nodes are also used to define the shape of a way. When used as points along ways, nodes usually have no tags, though some of them could.

##### Way
A way is an ordered list of between 2 and 2,000 nodes that define a polyline. Ways are used to represent linear features such as rivers and roads.  
Ways can also represent the boundaries of areas (solid polygons) such as buildings or forests. In this case, the way's first and last node will be the same. This is called a "closed way".  
Note that closed ways occasionally represent loops, such as roundabouts on highways, rather than solid areas. The way's tags must be examined to discover which it is.

##### Relation
A relation is a multi-purpose data structure that documents a relationship between two or more data elements (nodes, ways, and/or other relations). Examples include:
* A route relation, which lists the ways that form a major (numbered) highway, a cycle route, or a bus route.
* A turn restriction that says you can't turn from one way into another way.
* A multipolygon that describes an area (whose boundary is the 'outer way') with holes (the 'inner ways'). 

##### Tag
All types of data element (nodes, ways and relations) can have tags. An element can have no tags. Tags describe the meaning of the particular element to which they are attached.  
A tag consists of two free format text fields; a 'key' and a 'value'. Each of these are Unicode strings of up to 255 characters. For example, `highway=residential` defines the way as a road whose main function is to give access to people's homes. An element cannot have 2 tags with the same 'key', the 'key's must be unique. e.g. you cannot have an element with both the amenity=restaurant and amenity=bar.  
For conventions on tags, refer to [Map Features](http://wiki.openstreetmap.org/wiki/Map_Features).

#### Attributes
We will only be going through some common attributes here:
##### id
Used for identifying the element. Element types have their own ID space, so there could be a node with id=100 and a way with id=100, which are unlikely to be related or geographically near to each other.  
##### user
Display name of the user who last modified the object. A user can change their display name at any time.
##### uid
The numeric identifier of the user who last modified the object. An user identifier never changes.
##### timestamp
Time of the last modification
##### visible
Whether the object is deleted or not in the database, if visible="false" then the object should only be returned by history calls.
##### version
The edit version of the object. Newly created objects start at version 1 and the value is incremented by the server when a client uploads a new version of the object. The server will reject a new version of an object if the version sent by the client does not match the current version of the object in the database.

### Sampling OSM data
Using [sample_osm.py](https://github.com/shawnlinxl/Udacity/blob/master/Projects/P3%20Wrangle%20OpenStreetMap%20Data/sample_osm.py), we have sampled about 0.5% of our original 2.61GB New York Metro Area OSM data. This sample is much smaller in size at about 14.4MB. The following is a raw output from the output file [sample.osm](https://raw.githubusercontent.com/shawnlinxl/Udacity/master/Projects/P3%20Wrangle%20OpenStreetMap%20Data/sample.osm):


### Auditing Sample
#### Street names
One of the many problems that can be found in the raw output from sample.osm is that street names are abbreviated. For example, Bushwick Avenue has been abbreviated as Bushwick Ave in some cases. Therefore, we have built an initial set of street names, and will be checking how many street names won't conform with our standard using [audit.py](). Our initial standard street names are:  
* Street 
* Avenue
* Boulevard
* Drive
* Court
* Place
* Square
* Lane
* Road
* Trail
* Parkway
* Commons

After auditing, there are indeed some street names that does not conform to our standard, and some legitimate street name types that have not been taken into account. The following is an sample output from the audit:
```
{'109': set(['Route 109']),
 '3': set(['ROAD 3']),
 'A': set(['Avenue A']),
 'Americas': set(['Avenue of the Americas']),
 'Ave': set(['Hudson Ave', 'Stewart Ave', 'Underhill Ave']),
 'B': set(['Avenue B', 'Jefferson Ave #B']),
 'Broadway': set(['Broadway']),
 'C': set(['Avenue C', 'GIBBONS CIRCLE BLDG C']),
 'Center': set(['Metrotech Center']),
 'Cir': set(['Greenview Cir',
             'Legends Cir',
             'Lewis Cir',
             'White Spruce Cir',
             'Winnecomac Cir']),
 'Circle': set(['Dawson Circle', 'Pelican Circle', 'Tompkins Circle']),
 'Concourse': set(['Grand Concourse']),
 'Cres': set(['August Cres', 'Rita Cres']),
 'Crescent': set(['71st Crescent',
                  'Brian Crescent',
                  'Cromwell Crescent',
                  'Dieterle Crescent',
                  'Ellwell Crescent',
                  'Pershing Crescent',
                  'Slocum Crescent']),
...
 ```
In this sample, one thing that is interesting to me is the `Avenue of the Americas`. It is also known as the `Sixth Avenue`, which is how most people refer to it. For this case, I will be replacing all `Avenue of the Americas` with `Sixth Avenue` later.  
Another common type of street names are avenues ending in alphabets. This can be found in areas of East Village as well as in Brooklyn. In fact, in Brooklyn, there are avenues named almost from A to Z:
![Avenues ending in alphabets](./avenuealpha.jpg)