In [2]:
from map_project import *

# Wrangling OpenStreetMap Data 


## Preliminaries 


**Topic** 
The chosen subject matter is the State of Hawai'i. The dataset of the state of Hawai'i (excluding Kauaii) was 158 MB. For efficient auditing (and debugging), smaller samples were used (Island of Hawai'i dataset-- 67.9 MB).   

## Data acquisition 

**Data source:** 
The assignment provided data sources that are part of the OpenStreetMap project.  Custom extracts of raw OSM (XML) data were obtained from [mapzen](https://mapzen.com).  

### General description of data: 

"[OpenStreetMap](https://www.openstreetmap.org/about) is built by a community of mappers that contribute and maintain data about roads, trails, `cafés`, railway stations, and much more, all over the world."

Since the OpenStreetMap is an open source project, human error is likely the main source of error-- namely, inconsistencies in data entry.  The [OSM XML wiki](http://wiki.openstreetmap.org/wiki/OSM_XML) provides documentation for the data.  

The OSM XML provides the framework for `elements` that represent physical features on the map.  `Elements` consist of `nodes`, `ways`, and `relations` (http://wiki.openstreetmap.org/wiki/Elements).  Each instance of `elements` is provided in 'blocks'-- XML elements that enclose tags with key / value attributes.  Documentation describes ['certainties and uncertainties'](http://wiki.openstreetmap.org/wiki/OSM_XML#Certainties_and_Uncertainties) of a given dataset.  


## Data wrangling  

### Data audit 

Data quality is assessed to verify assumptions about the type, shape and value of the data.  Errors and outliers are identified, and missing values are accounted for.  Measures of validity, accuracy, completeness, consistency and uniformity describe the quality of data.  

First, the osm elements in the data file was inspected: 

In [3]:
tags = count_tags(OSMFILE)

In [4]:
print tags

{'node': 765029, 'nd': 884680, 'bounds': 1, 'member': 4828, 'tag': 239102, 'osm': 1, 'way': 61483, 'relation': 982}


**'node', 'nd', 'bounds', 'member', 'tag', 'osm', 'way', 'relation'** constitute the xml elements in the OSM dataset.  

Of these, 'way', 'node', 'relation' are osm 'elements'-- the the basic components of OpenStreetMap's data model.  

'osm' is the xml element that encloses the entire osm data structure in the the .osm file. 

'bounds' contains attributes that define the boundary coordinates of the map.

'member' is an xml element under the osm element 'relation' (namely the 'multiplygon' relation) that is used to describe how thw 'way's in the 'relation' are related.  

'tag's exist as children of the element trees, fleshing out the details of the element.  

'nd's are tags under 'way's that reference the 'nodes' that make up the 'way's.  

### Data Validity

As shown, there are 765029 'node', 61483 'way', 982 'relation' element instances in the dataset.    There are no extraneous elements.  

A validator will be used in the subequent sql database intake. 

#### Verification / validation scheme of the tags in the osm dataset
The following aspects were interrogated: 

* osm: verify that there is only one.  
* bounds: verify coordinates of bounds.  
* member: are all 'members' 'ways'?  
* nds: all all 'nd's 'nodes' in dataset?  
* tags: what features are represented?  

As seen already above, there is only one instance of 'osm' and 'bounds'.  The coordinates in 'bounds' is the following: 

In [5]:
bounds = get_attrib('bounds')
print bounds

{'minlat': '18.6982854', 'maxlon': '-154.6325683', 'minlon': '-158.4338378', 'maxlat': '21.8411047'}


The coordinates can be verified using the `geolocator` module to perform a 'reverse-lookup': 

In [6]:
>>> geolocator = Nominatim()
>>> minloc = geolocator.reverse("{},{}".format(bounds['minlat'], bounds['minlon']))
>>> maxloc = geolocator.reverse("{},{}".format(bounds['maxlat'], bounds['maxlon']))
>>> print maxloc

96816


96816 is the zipcode for Honolulu. Coordinates are also easily verified in goole maps.  

Check membership of 'relation' elements: 
Obtain all member ids in 'relation' and see if all of them match with ids in 'way'.  

In [7]:
>>> relation_member_refs = get_allof_childattrib('relation', 'member', 'ref')
>>> way_ids = get_allof_attrib('way','id')

In [8]:
>>> what_in_what(relation_member_refs, way_ids)

not_in: 317, is_in: 3108


317 of member ids are not 'ways'.  'id's in 'relation' can also be nodes.  

In [9]:
>>> node_ids = get_allof_attrib('node', 'id')

In [10]:
>>> what_in_what(relation_member_refs, way_ids.union(node_ids))

not_in: 11, is_in: 3414


All but 11 elements of 'relation' is in the union of 'node' and 'way'.  What else can it be?  

In [12]:
>>> relation_ids = get_allof_attrib('relation', 'id')
what_in_what(relation_member_refs, way_ids.union(node_ids).union(relation_ids))

not_in: 0, is_in: 3425


Apparently, 11 of the members of 'relation' are themselves are 'relation's.  

Verifying that all 'nd' ids in the 'way' elements are also instances of 'node's: 

In [13]:
>>> way_nd_refs = get_allof_childattrib('way', 'nd', 'ref')
>>> what_in_what(way_nd_refs, node_ids) 

not_in: 0, is_in: 759598


#### Map features 

The tags on an .osm file is not easily translated into a human-readable entitiy.  OSM being a community project, there are no strict rules on how the physical features are tagged.  However, the tables in the wiki provide conventions: 
http://wiki.openstreetmap.org/wiki/Map_Features

The tables were scraped to obtain a list of valid map features.  This can be used to make sense of what is on this map.  

In [44]:
from scrape_wiki import *

In [52]:
reload(scrape_wiki)

<module 'scrape_wiki' from 'scrape_wiki.py'>

In [53]:
featuresinHawaii = process_attrib(OSMFILE, feature_tally, 'v')

In [54]:
len(featuresinHawaii)

775

The above dictionary (featuresinHawaii) provides a tally of all the features on this map.  

### Data accuracy and uniformity 

Besides the issue of accuracy of the features described in the OSM elements, proper nomenclature of map objects (streets, buildings, etc.) in the Hawaiian language is another dimension to consider.  The Hawaiian language did not exist in written form (except in petroglyph symbols) until the 1820s; in its current use, the Hawaiian written language uses 12 letters of the english alphabet, plus a glottal stop (the 'okina).  The vowels can also have macrons (looks like a hypen on top) that affect its pronunciation.  In practice, many pidgin/creole and colloquial terms exist as it has fluidly absorbed  foreign words.  For these reasons, the Hawaiian language is difficult to formalize.  Nevertheless, the landmarks presumably follow formal nomenclature that gives proper respect to the culture of Hawai'i.  The text data can be assessed for accuracy by comparing with an outside source (an official lexicon), while 'allophones' can be consolidated to allow uniformity in names.  

Textual input of street names can give rise to many variants.  The '`addr`' category in map features has specific sub-fields for respective components of a postal address (street name, number, postal code, etc.).  User omission, typos, miscategorization, abbreviations, etc. are common sources of variability.  

The end word of street names was audited to account for variations.  A typical ending is expected ('street', 'avenue', etc.) for street names.  In Hawai'i, Hawaiian street names typically include the expected ending, but exceptions may exist.  

In [60]:
import audit_streetnames

NameError: name 'OSMFILE' is not defined

In [57]:
!ls

Makefile                  [34m_templates[m[m                osm_project.py
Project Notes-Copy1.ipynb audit_streetnames         sample.osm
Project Notes.ipynb       conf.py                   schema
README.md                 [34mdocs[m[m                      schema.py
SofHexK.osm               index.rst                 schema.pyc
Untitled.ipynb            make.bat                  scrape_wiki.py
[34m_build[m[m                    map_project.py            scrape_wiki.pyc
[34m_static[m[m                   map_project.pyc           setup.py
