## Map Area

I chose to look at OpenStreetMap (OSM) data from the city of Guadalajara in the state of Jalisco, Mexico.  I spent some time living in Guadalajara and was curious to know about contributions made to the city's OSM database as well as how differences in language were handled.  The map area can be found on the OSM website here:

http://www.openstreetmap.org/export#map=10/20.6174/-103.3271

To get the data, I created a Mapzen custom extract.  It turns out these can be finicky after some time, so I've also included an alternative download via the Overpass API (the geo bounding box is slightly different):

- https://s3.amazonaws.com/mapzen.odes/ex_tvaWMoe8QMguj4B6VLUnzwrfCbNc4.osm.bz2 (compressed, quicker)

- http://overpass-api.de/api/map?bbox=-104.3536,20.0753,-102.3006,21.1575 (slower, more reliable)

After getting the data (see `get_data.py`), I performed a quick audit of the different [elements](https://wiki.openstreetmap.org/wiki/Elements) (parent and child) in the OSM file.

In [11]:
from audit_fields import audit_elements
audit_elements("gdl.osm")

[('nd', 449143),
 ('node', 352351),
 ('tag', 183288),
 ('way', 54197),
 ('member', 1717),
 ('relation', 531),
 ('bounds', 1),
 ('osm', 1)]

as well as an audit of the tag keys (i.e. the 'k' attribute of the XML tag element).  As an example, the 20 most frequent tag keys are shown below.

In [12]:
from audit_fields import audit_tag_keys
data = audit_tag_keys("gdl.osm")
data[0:20] # top 20

[('highway', 149433),
 ('name', 91206),
 ('amenity', 27141),
 ('source', 24597),
 ('oneway', 23622),
 ('addr:street', 21912),
 ('operator', 21444),
 ('addr:postcode', 21063),
 ('source:date', 20289),
 ('SEP:CLAVEESC', 20085),
 ('is_in:suburb', 14061),
 ('description', 11682),
 ('power', 9084),
 ('desciption', 8430),
 ('building', 7617),
 ('surface', 5481),
 ('access', 5355),
 ('width:street', 5268),
 ('leisure', 3657),
 ('maxspeed', 3366)]

---

## Problems in the Map

The first problem that I noticed, which can be seen above, was mispelling of the `description` key as `desciption`.  I fixed the spelling for these occurrences.  To identify other problems in the map, I audited several of the above keys (see `audit_fields.py`).  As I suspected, the difference in language along with accents that are used in Spanish made things messy.  There were, however, some key-values free of error such as country and state.  For simplicity, I chose to focus on the following keys:

- amenity (`amenity`)
- street names (`addr:street`)
- postcode (`addr:postcode`)
- city (`addr:city`)
- cuisine (`cuisine`)?

#### Amenity

I didn't know what the `SEP:CLAVEESC` key was (in top 20 above) so I did some research.  It turns out this is a unique school identifier for a national database.  For elements with this key I checked to see if there were any without a school related amenity key-value (see `audit_fields.py`).  More than a quarter of these elements didn't have a school related key-value in the amenity key.

In [1]:
from audit_fields import cross_ref_sep
print cross_ref_sep("gdl.osm")

6695


To fix this, I added an amenity key with the key-value `school`.

#### Street Names

There was a lot of mess in the street name fields, it is unclear if some street names left out street types inadvertently and in fact were another street name already in the data such as 'Av. Vallarta' and 'Vallarta'. It is entirely possible that some of these names without street types were different from their apparent counterparts so I chose not to make any assumptions. I did however expand any street abbreviations to the full word such as 'av' to 'Avenida', 'esq' to 'Esquina', and 'prol' to Prolongaci√≥n.  Below we can see all the street names beginning with an abbreviation for 'Avenida'.

In [2]:
import re
from audit_fields import audit_tag_values
data = audit_tag_values("gdl.osm", "addr:street")
av = [pair for pair in data if re.search(r'^av(\s|\.)', pair[0], re.IGNORECASE)]; av

[('Av. del Bosque', 21),
 ('Av Constitucion', 9),
 ('Av. Vallarta', 6),
 ('Av. Del Bosque', 6),
 ('Av. Netzahualcoyotl', 3),
 ('Av. Azahares y Violetas', 3),
 ('Av.Clvn.Division del Nte 415,Jardines Alcalde', 3),
 ('Av. Central Guillermo Gonzalez Camarena', 3),
 (u'Av.Adolfo L\xf3pez Mateos sur', 3),
 ('Av. Guadalupe', 3),
 ('Av. Chapultepec', 3),
 (u'Av. R\xedo Nilo', 3),
 (u'Av. Federal\xedstas', 3),
 (u'Av.Circunvalaci\xf3n', 3),
 (u'Av. M\xe9xico', 3),
 (u'Av. General Ram\xf3n Corona', 3),
 ('Av. Acueducto', 3),
 ('Av. Paseo de los Emperadores', 3),
 ('Av.Tepeyac', 3),
 ('Av. Vallarta 4327', 3),
 ('Av. Francisco Javier Mina', 3)]

#### Postal Codes

Several postal codes in the data had trailing zeros or whitespace, and others were not valid postal codes for the state of Jalisco, Mexico. I confirmed valid postal codes via wikipedia, those that are valid begin with numbers 44 through 48 and are 5 digits long in total. Any postal codes not beginning with these numbers were left out. For those with trailing zeros or whitespace, it was assumed that these characters were mistakes and they were removed from the rest of the postal code.  The results below show postcodes not containing 5 digits.

In [4]:
data = audit_tag_values("gdl.osm", "addr:postcode")
invalid = [pair for pair in data if not re.search(r'^4[4-8]\d{3}$', pair[0])]; invalid

[('58350', 3),
 ('1300', 3),
 ('445100', 3),
 ('454009', 3),
 ('7002079', 3),
 ('30983', 3),
 ('454030', 3),
 ('38901', 3)]

#### City Names

Some city names were completely out of place such as 'Morelia' which is not in the state of Jaliso.  There were also inconsistencies with spelling, accents, capitalization, and the inclusion of the state name along with the city name. For example, 'GUADALAJARA JALISCO', 'Guadalajara Jalisco', and 'Guadalajara, Jalisco'. These types of errors occurred in only a few of the main municipalities in the Guadalajara area and were corrected with a simple search and replace function. To make sure that other city names were legitimate, I scraped data from a wikipedia page containing city names in the state of Jalisco and compared them with those encountered in the OpenStreetMap file (see `scrape_cities.py`). If the city name was not in this list, it was omitted.  Below is an audit (value and number of occurrences) of the city names containing the word 'Guadalajara'.

In [9]:
import re
from audit_fields import audit_tag_values
data = audit_tag_values("gdl.osm", "addr:city")
guad = [pair for pair in data if re.search(r'guad', pair[0], re.IGNORECASE)]
guad

[('Guadalajara', 222),
 ('Guadalajara , Jal.', 6),
 ('Guadalajara, Jalisco', 6),
 ('guadalajara', 3),
 ('GUADALAJARA', 3),
 ('Guadalajara ,Jal.', 3),
 ('Terranova, Guadalajara, Jalisco', 3),
 ('Guadalajara Jalisco', 3),
 ('GUADALAJARA JALISCO', 3)]

#### Cuisine Types

The cuisine attribute contained several different types of inconsistencies. Naming issues such as 'Hot_Dogs', 'hot_dogs', or 'Hot_Dogs_Gourmet', were all updated to simple derivative such as
'hot_dogs'. There were also some problems with primary language, as we had fields with 'seafood' and 'mariscos'. Because most values were in English (also values are in English even in the Spanish translated OSM wiki), I chose to change anything redundant in Spanish to it's English counterpart. Some values had more than one cuisine type such as 'sushi,_burgers'. These were split up into a list for the 'cuisine' field.

## Data Overview

After cleaning the problems found in the map (see `update_fields.py`), the original OSM XML data was converted to JSON (see `process_data.py`).  The JSON was then inserted to a MongoDB database (`import_mongo.py`).  Following are results from querying the database.