## P3: Wrangle OpenStreetMap Data: Yucatán Peninsula
Thomas Hrabchak <br>
April 2016

### Introduction
In this project we perform data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap (OSM) data for the Yucatán Peninsula in Central America. We will then import the cleansed data into a MongoDB instance so that it can be queried.

The precise location which we will use is defined by the following longitude/latitude box. You can visualize this region using the [OSM Export Tool](http://www.openstreetmap.org/export#map=7/20.262/-89.363).

#### Yucatán Peninsula GPS Location Box
- S: 18.521
- W: -92.670
- N: 23.403
- E: -86.440


### Getting the Data
There are two methods for downloading the OSM data used in this project, either manually specifying the OSM data query via an online export tool or running a python script which will download the area specific to this project.

#### Manual Online Export ####
The precise box of data used for this project is shown above. This creates the following OSM Query: 

```
(node(18.521,-92.670, 23.403,-86.440);<;);out meta;
```

Copy and paste this query into the [Overpass API Query Form](http://overpass-api.de/query_form.html) and run `Query`. This will download the OSM data which you should then move to this project's folder. Rename the file to `data.xml`.

#### Python Script ####
I created a python script which will download the OSM data specific to this project. This will put the OSM data in a file named data.xml. Run the following command in the project folder:

```
python ./download_yucatan_osm_data.py
```

To download a randomized smaller portion of the area, append `--small`. For example:

```
python ./download_yucatan_osm_data.py --small
```

### Transforming the Data
The OSM data is downloaded as an XML file which will need to be transformed into the JSON format in order to be uploaded to a MongoDB instance. Additionally, we only want to keep the XML data that is relevant. 

The following transformation rules (from Data Wrangling lesson 6) will be used:
- Process only 2 types of top level tags: "node" and "way"
- All attributes of "node" and "way" should be turned into regular key/value pairs, except:
    - Attributes in the CREATED array should be added under a key "created"
    - Attributes for latitude and longitude should be added to a "pos" array,
      for use in geospacial indexing. Make sure the values inside "pos" array are floats
      and not strings. 
- If the second level tag "k" value contains problematic characters, it should be ignored
- If the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
- If the second level tag "k" value does not start with "addr:", but contains ":", you can
  process it in a way that you feel is best. For example, you might split it into a two-level
  dictionary like with "addr:", or otherwise convert the ":" to create a valid key.
- If there is a second ":" that separates the type/direction of a street,
  the tag should be ignored

To perform the transformation, the `transform_data.py` script will take as input the data.xml file downloaded from the previous step and create a transformed_data.json file.

To transform the data, run the following command:

```
python ./transform_data.py
```

### Problems Encountered in the Map
After the data has been transformed into the JSON format, we will then clean the data. We clean the data in three ways: ensuring street names are uniformly formatted, ensuring phone numbers are uniformly formatted, and ensuring that the websites listed are still responsive.

#### Over-abbreviated street names
Similar to the lesson 6 case study, I found that there was inconsistency in the abbreviations of street names. I ran the find_abbreviations.py script to determine the different abbreviations that exists in the JSON file.

I found the following names exists:
    - Av
    - Ave
    - Avienda
    - Calle
    - Col.
    - Av.
    - ... (Calle ..)
    
Using standardize_street_name_abbreviations.py 


```
python ./clean_street_names.py
```

#### Phone Number Format
There are a variety of formats for phone numbers, so I decided to standarize them across all nodes. I created a script ... to show the different phone number formats used in the DB.


```
python ./clean_phone_numbers.py
```

#### Website Verification
A handful of nodes have a website field. I wanted to verify that the referenced websites are still valid, so I created a script ... that verifies that those websites still return data. If a website did not exist, I removed it from the json file.


```
python ./clean_websites.py
```


### Importing into a MongoDB Instance

#### Initializing the MongoDB instance
Follow the instructions on https://docs.mongodb.org/manual/installation/ to install MongoDB on your operating system.

#### Importing the OSM data

```
python ./import_data_to_mongodb.py
```

### Data Overview


```
python ./get_db_stats.py
```

#### File Sizes

#### Number of Documents

#### Number of Nodes

#### Number of Ways

#### Number of Unique Users

#### Top Contributing User

#### Number of 1 post users


### Additional Ideas

### Conclusion

### References

### Appendix

### Complete Python Script Execution Order
The following is the order of execution of all python scripts for this project:

```
python ./download_yucatan_osm_data.py
python ./transform_data.py
python ./clean_street_names.py
python ./clean_phone_numbers.py
python ./clean_websites.py
python ./import_data_to_mongodb.py
python ./get_db_stats.py
```

#### Data Wrangling Lesson 6 Code
All lesson 6 code can be found in the lesson 6 folder.