# OpenStreetMap Analysis
Providence, RI and surronding area

## Initial download and review
The target for this analysis is the OpenStreetMap data for the city of Providence, Rhode Island, and the contiguous urban area. This includes a large portion of Rhode Island, and parts of Massachusetts; covering cities from Warwick and Bristol in RI, up to Attleboro and Rehoboth in MA.

The downloaded XML file is approximately 150MB in size. The large size of the file prohibits any sort of in-memory analysis of the XML data. My immediate goal was a naive parsing of the file into JSON format so the data could be loaded into MongoDB and reviewed there. Following the initial review in MongoDB, I would then update the parsing process to better clean and capture the XML data. So, for the first pass, my central task was to parse all the data without the script choking.

Looking at the OSM Wiki (https://wiki.openstreetmap.org/wiki/OSM_XML), there's an immediate problem. The OSM XML format doesn't have an XSD schema associated with it; so, there's no way to validate the contents of the XML file. We're relying on the OSM software to spit out the data in a consistent format. It mostly likely will, but you can't make any assumptions when using other people's data. If there are irregularities in the XML data -- inconsistent element attributes, irregular nesting of elements, irregular element contents, etc. -- this could halt the script, or lead to anomalies in the JSON output.

The wiki does offer some good news, however. We get a full list of the "data primitives" in the OSM XML: https://wiki.openstreetmap.org/wiki/Elements. Specifically, there are `nodes`, `ways`, and `relations`. We also learn that these nodes can have child elements. All 3 can have `tag` elements as children; additionally, `relations` can have `member` children, and `ways` can have `nd` children. We also get a list of element attributes, but there's no promises about the consistency of their presence.

The main function in the audit file, `parse_osm_xml`, includes a number of assertions to confirm that all the data adds up. As the function parses each indidvidual XML element, it updates several running counts of simple information. These are:
* the total number of each element (including both parent and child elements)
* the total number of each attribute, grouped by element
* the total number child elements, grouped by child element, grouped by parent element

Prior to writing the JSON output, the assertions confirm that 1) the total number of each element attribute equals the total number of elements (so all attributes are always present) and 2) the total number of possible child elements (ie, `nd`, `tag`, `member`) equals the number of those elements that are in fact children of parent elements. There is a third assertion, confirming that none of the elements include text content. While buliding the counts for validation, the script also grabs sample values of various data. These provide an easy way to get a sense of what the data looks like, without having to open up the XML file. And anyhow, we're already going through the file, so we may as well grab whatever we can while we're there. The samples are:
* each element attribute and one its values, grouped by element
* each `key` and `value` attribute value on every tag element

After running the script a few times and outputting these counts and samples to the console, I realized it made more sense to write their contents to a file for future reference. The output is stored in the included `data_key.json` file, which proved helpful in developing the project.