In [79]:
OSMFILE = "SofHexK.osm"
DBFILE = "SofHawaii.db"

# Wrangling OpenStreetMap Data 


## Preliminaries 


#### Topic

The chosen subject matter is the State of Hawai'i. The dataset of the state of Hawai'i (excluding Kauaii) was 158 MB. For efficient auditing (and debugging), smaller samples were used (Island of Hawai'i dataset-- 67.9 MB).   

## Data acquisition 

#### Data Source
The assignment provided data sources that are part of the OpenStreetMap project.  Custom extracts of raw OSM (XML) data were obtained from [mapzen](https://mapzen.com).  

### General description of data: 

"[OpenStreetMap](https://www.openstreetmap.org/about) is built by a community of mappers that contribute and maintain data about roads, trails, `cafés`, railway stations, and much more, all over the world."

Since the OpenStreetMap is an open source project, human error is likely the main source of error-- namely, inconsistencies in data entry.  The [OSM XML wiki](http://wiki.openstreetmap.org/wiki/OSM_XML) provides documentation for the data.  

The OSM XML provides the framework for `elements` that represent physical features on the map.  `Elements` consist of `nodes`, `ways`, and `relations` (http://wiki.openstreetmap.org/wiki/Elements).  Each instance of `elements` is provided in 'blocks'-- XML elements that enclose tags with key / value attributes.  Documentation describes ['certainties and uncertainties'](http://wiki.openstreetmap.org/wiki/OSM_XML#Certainties_and_Uncertainties) of a given dataset.  


## Data wrangling  

### Data audit 

Data quality is assessed to verify assumptions about the type, shape and value of the data.  Errors and outliers are identified, and missing values are accounted for.  Measures of validity, accuracy, completeness, consistency and uniformity describe the quality of data.  

First, the osm elements in the data file was inspected: 

In [80]:
import map_project
tags = map_project.count_tags(OSMFILE)

In [81]:
print tags

{'node': 765029, 'nd': 884680, 'bounds': 1, 'member': 4828, 'tag': 239102, 'osm': 1, 'way': 61483, 'relation': 982}


**'node', 'nd', 'bounds', 'member', 'tag', 'osm', 'way', 'relation'** constitute the xml elements in the OSM dataset.  

Of these, 'way', 'node', 'relation' are osm 'elements'-- the the basic components of OpenStreetMap's data model.  

'osm' is the xml element that encloses the entire osm data structure in the the .osm file. 

'bounds' contains attributes that define the boundary coordinates of the map.

'member' is an xml element under the osm element 'relation' (namely the 'multiplygon' relation) that is used to describe how thw 'way's in the 'relation' are related.  

'tag's exist as children of the element trees, fleshing out the details of the element.  

'nd's are tags under 'way's that reference the 'nodes' that make up the 'way's.  

### Data Validity

As shown, there are 765029 'node', 61483 'way', 982 'relation' element instances in the dataset.    There are no extraneous elements in the dataset that are unaccounted for.  

A validator will be used in the subequent sql database intake. 

#### Verification / validation scheme of the tags in the osm dataset
The following aspects of the elements were interrogated: 

* osm: verify that there is only one.  
* bounds: verify coordinates of bounds.  
* member: are all 'members' 'ways'?  
* nds: all all 'nd's 'nodes' in dataset?  
* tags: what features are represented?  

**Bounds:** As seen already above, there is only one instance of 'osm' and 'bounds'.  The coordinates in 'bounds' is the following: 

In [82]:
bounds = map_project.get_attrib('bounds')
print bounds

{'minlat': '18.6982854', 'maxlon': '-154.6325683', 'minlon': '-158.4338378', 'maxlat': '21.8411047'}


The coordinates can be verified using the `geolocator` module to perform a 'reverse-lookup': 

In [83]:
>>> geolocator = map_project.Nominatim()
>>> minloc = geolocator.reverse("{},{}".format(bounds['minlat'], bounds['minlon']))
>>> maxloc = geolocator.reverse("{},{}".format(bounds['maxlat'], bounds['maxlon']))
>>> print maxloc

96816


96816 is the zipcode for Honolulu. Coordinates are also easily verified in goole maps.  

**Relations:** Check membership of 'relation' elements: 
Obtain all member ids in 'relation' and see if all of them match with ids in 'way'.  

In [84]:
>>> relation_member_refs = map_project.get_allof_childattrib('relation', 'member', 'ref')
>>> way_ids = map_project.get_allof_attrib('way','id')

>>> map_project.a_in_b(relation_member_refs, way_ids)

not_in: 317, is_in: 3108


317 of member ids are not 'ways'.  'id's in 'relation' can also be nodes.  

In [85]:
>>> node_ids = map_project.get_allof_attrib('node', 'id')

>>> map_project.a_in_b(relation_member_refs, way_ids.union(node_ids))

not_in: 11, is_in: 3414


All but 11 elements of 'relation' is in the union of 'node' and 'way'.  What else can it be?  

In [86]:
>>> relation_ids = map_project.get_allof_attrib('relation', 'id')
map_project.a_in_b(relation_member_refs, way_ids.union(node_ids).union(relation_ids))

not_in: 0, is_in: 3425


Apparently, 11 of the members of 'relation' are themselves are 'relation's.  

**'nd's:**  Verifying that all 'nd' ids in the 'way' elements are also instances of 'node's: 

In [87]:
>>> way_nd_refs = map_project.get_allof_childattrib('way', 'nd', 'ref')
>>> map_project.a_in_b(way_nd_refs, node_ids) 

not_in: 0, is_in: 759598


#### Map features 

The tags on an .osm file is not easily translated into a human-readable entitiy.  OSM being a community project, there are no strict rules on how the physical features are tagged.  However, the tables in the wiki provide conventions: 
http://wiki.openstreetmap.org/wiki/Map_Features

The tables were scraped to obtain a list of valid map features.  This can be used to make sense of what is on this map.  

In [88]:
import map_features

In [89]:
mfeatures = map_features.scrape_wiki(
    'http://wiki.openstreetmap.org/wiki/Map_Features')

number of tables: 32


In [90]:
featuresinHawaii = map_features.get_tally(OSMFILE, mfeatures,'v')

The above dictionary (featuresinHawaii) provides a tally of all the features on this map.  There are 225 distinct features represented in the main dataset.  

In [91]:
len(featuresinHawaii.items())

250

### Data accuracy and uniformity 

Besides the issue of accuracy of the features described in the OSM elements, proper nomenclature of map objects (streets, buildings, etc.) in the Hawaiian language is another dimension to consider.  The Hawaiian language did not exist in written form (except in petroglyph symbols) until the 1820s; in its current use, the Hawaiian written language uses 12 letters of the english alphabet, plus a glottal stop (the 'okina).  The vowels can also have macrons (looks like a hypen on top) that affect its pronunciation.  In practice, many pidgin/creole and colloquial terms exist as it has fluidly absorbed  foreign words.  For these reasons, the Hawaiian language is difficult to formalize.  Nevertheless, the landmarks presumably follow formal nomenclature that gives proper respect to the culture of Hawai'i.  The text data can be assessed for accuracy by comparing with an outside source (an official lexicon), while 'allophones' can be consolidated to allow uniformity in names.  

Textual input of street names can give rise to many variants.  The '`addr`' category in map features has specific sub-fields for respective components of a postal address (street name, number, postal code, etc.).  User omission, typos, miscategorization, abbreviations, etc. are common sources of variability.  

The end word of street names was audited to account for variations.  A typical ending is expected ('street', 'avenue', etc.) for street names.  In Hawai'i, Hawaiian street names typically include the expected ending, but exceptions may exist.  

### Data consistency 

The ending word (street, lane, blvd, etc.) has many permutations that need to be unified. A script was created to audit the streetnames and create uniform nomenclature, by which permutations can be corrected by mapping to  a standard list. 

In [92]:
import audit_streetnames

In [106]:
execfile("audit_streetnames.py")

street count is: 192
HINA AVE => HINA Avenue
Paradise Ala Kai => Paradise Ala Kai
Kamehameha Hwy => Kamehameha Highway
Wainee St => Wainee Street
Lusitania St => Lusitania Street
corrected name count is: 4


### Data completeness 

Data completeness for this dataset (but not impossible) is difficult to assess.  It can be assumed that the project is never complete.  Map features can be compared to other maps or statistics that are publically available.  

## Data handling: .osm to .csv to sql

As an intermediate step to the creation of an sql database, 'node' and 'ways' element data were extracted from the .osm file and organized into .csv datasets.  A validation process was included to ensure that the data fit the schema. The schema shapes the osm data structure into a normalized table template.  

In [94]:
import osmfile_to_csvfiles

In [95]:
# Process .osm file into .csvs
osmfile_to_csvfiles.process_map(OSMFILE, 'True')

#### Problems encountered 
The csv library cannot handle unicode input.  Some user-entered text entries (e.g. '`utf8-xe3x83x92xe3x83xad`') were in unicode which raised errors when passed into `csvreader`.  A wrapper for the reader module was available to convert unicode to UTF-8.   


### Database creation and table insertion

The data in .csv files are now inserted into an sql database.  Appropriate tables (one per osm element and respective tags) are created.  

In [96]:
import csvfiles_to_sql

## Basic EDA 
Study the top 'node' contributors to the dataset. 

In [97]:
import sql_eda

These are the top contributors of the map: 

In [98]:
query = '''
SELECT user, count(*) as num
FROM nodes 
GROUP BY user
ORDER BY num DESC
limit 5;
'''
output1 = sql_eda.do_sql(DBFILE, query)

In [99]:
# Find the top 20 osm element tags for user
query = ''' 
SELECT nodes_tags.key, nodes_tags.value, nodes_tags.type, nodes.id, COUNT(*) AS num
FROM nodes_tags, nodes
WHERE nodes.id = nodes_tags.id
AND user = 'Tom_Holland'
GROUP BY nodes_tags.key
ORDER BY num DESC 
LIMIT 20;
'''
output2 = sql_eda.do_sql(DBFILE, query)

In [100]:
# Retrieve all osm element tags for user
query = ''' 
SELECT nodes_tags.key, nodes_tags.value, nodes_tags.type, nodes.id, COUNT(*) AS num
FROM nodes_tags, nodes
WHERE nodes.id = nodes_tags.id
AND user = 'Tom_Holland'
GROUP BY nodes_tags.key
ORDER BY num DESC 
;
'''
output3 = sql_eda.do_sql(DBFILE, query) 

In [101]:
# Retrieve all osm element tags for user
query = ''' 
SELECT nodes_tags.key, nodes_tags.value, nodes.lat, nodes.lon, nodes.timestamp, COUNT(*) AS num
FROM nodes_tags, nodes
WHERE nodes.id = nodes_tags.id
AND user = 'Tom_Holland'
GROUP BY nodes_tags.key
ORDER BY num DESC 
;
'''
output4 = sql_eda.do_sql(DBFILE, query) 

In [102]:
for a1 in output[:20]: 
    print unicode(geolocator.reverse("{},{}".format(a1[2], a1[3]))), a1[4]
    print "\n"
    

Alulike Trail, Waikii, Kohala, Hawaii, United States of America 2016-08-21T18:17:59Z


Lower Napo'opo'o Road, Honaunau-Napoopoo CDP, Kau, Hawaii, 96704, United States of America 2016-09-10T03:24:22Z


Keoneele Cove, Honaunau Beach Road, Honaunau-Napoopoo CDP, Kau, Hawaii, 96704, United States of America 2016-09-09T18:12:20Z


98, Laimana Street, Pu‘u‘eo, Hilo CDP, North Hilo, Hawaii, 96720, United States of America 2016-08-15T04:01:55Z


College Hall, 200, West Kawili Street, Waiākea, Hilo CDP, South Hilo, Hawaii, 96720, United States of America 2016-08-16T07:30:31Z


Keoneele Cove, Honaunau Beach Road, Honaunau-Napoopoo CDP, Kau, Hawaii, 96704, United States of America 2016-09-09T18:12:20Z


College Hall, 200, West Kawili Street, Waiākea, Hilo CDP, South Hilo, Hawaii, 96720, United States of America 2016-08-16T07:30:31Z


College Hall, 200, West Kawili Street, Waiākea, Hilo CDP, South Hilo, Hawaii, 96720, United States of America 2016-08-16T07:30:31Z


College Hall, 200, West Kawili S

The following references can provide more resources to aid in Hawaiian nomenclature:  

- USGS gnis search: http://geonames.usgs.gov/apex/f?p=136:1:0::NO::P1_COUNTY%2CP1_COUNTY_ALONG:n%2C (server down)
- Hawaii State Highways: https://en.wikipedia.org/wiki/List_of_Hawaii_state_highways 
- Hawaiian landmarks http://ulukau.org/elib/cgi-bin/library?e=d-0pepn-000Sec--11haw-50-20-frameset-book--1-010escapewin&a=d&d=D0.2&toc=0
- [S.939 - Hawaiian National Park Language Correction Act of 2000](https://www.congress.gov/106/bills/s939/BILLS-106s939es.pdf)
- [Documentation for ISO 639 identifier: haw](http://www-01.sil.org/iso639-3/documentation.asp?id=haw)
- [Pūnana Leo](https://en.wikipedia.org/wiki/P%C5%ABnana_Leo)
- [Hawaiian Language wikipedia](https://en.wikipedia.org/wiki/Hawaiian_language)

Hawaii OSM project wiki: 
- http://wiki.openstreetmap.org/wiki/Hawaii