# Wrangling OpenStreetMap Data (Dallas Metro Area)

In this project, I have used data munging techniques, such as assessing the quality of the data for 
* validity,
* accuracy, 
* completeness, 
* consistency and
* uniformity,

to clean the OpenStreetMap data for a part of Dallas Metro Area

## Project code structure

### Importing all required packages from python

In [10]:
import codecs
import re
import json
import pprint
import xml.etree.cElementTree as ET

In this project we are reading the Dallas.osm file which is a file in XML format. All the data in this file is in a tree structure. To analyse the XML data we are using the **xml.etree.cElementTree** package of Python.

The output of out code will be in the form of JSON. The data will be wrangled and ready for further analysis. For performing JSON operations **json** package is imported from Python.

The XML is created by human, and thus suseptable for data entry error. To check for human errors in data entry we will match the data with common errors with the help of regular expressions. To perform this operation we will be using the **re** package of Python.

The json out can be difficult to read and understand in plain text form. To display the output for proper understanding we are using the **pprint** package of python.

### Setting up all the regular expression variables 

In [12]:
# to get the street type
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

### Setting up required dictionaries and lists

In [13]:
# elements to be nested in a json object
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
# expected name of the streets
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
		"Trail", "Parkway", "Commons"]
# mapping of street abbrivation to a corrected name
mapping = { "St": "Street",
        "St.": "Street",
        "Ave": "Avenue",
        "Ave.": "Avenue",
        "Av": "Avenue",
        "Blvd": "Boulevard",
        "Blvd.": "Boulevard",
        "blvd": "Boulevard",
        "Dr": "Drive",
        "dr": "Drive",
        "Dr.": "Drive",
        "E": "Ease",
        "Expy": "Expressway",
        "Ln.": "Lane",
        "N": "North",
        "P": "Park",
        "Pkwy": "Parkway",
        "St": "Street",
        "St.": "Street",
        "Rd": "Road",
        "Rd.": "Road"
        }

* The above list (expected) and dictionary (mapping) are used in the further code for analysing the purity of data and correcting the same.
* The **CREATED** list id the list of elements to be nested in the resulting json object.

### Correcting the names of street address

In [15]:
# method for correcting the street names
def corrected_name(name):
    # check with the regular expression
	m = street_type_re.search(name)
	if m:
        # get the string value matched
		m = m.group()
        # check if the value is not in the expected list
		if m not in expected:
            # if the value found in mapping return the corrected string of that value
			if m in mapping:
				return mapping[m]
	# if ecpression not found return the initial string
	return name

In this method the name is been corrected and returned if it is present in the mapping dictionnary

### Shaping the data in required format

In [17]:
# get the element from the analysedata method
def shape_element(element):
    # initilize empty node dictionary
	node = {}
    # parse only the "node" and "way" tags
	if element.tag == "node" or element.tag == "way" :
		if element.tag == "node":
			node['created'] = {}
			pos = []
            # type cast latitude and longitude values from string to float
			pos.append(float(element.attrib['lat']))
			pos.append(float(element.attrib['lon']))
			node['pos'] = pos
			node['type'] = "node"
			for k,v in element.attrib.items():
                # check if the attribute is in the CREATED list
				if k in CREATED: # if yes insert it into the created dict of node
					node['created'][k] = v
				elif k=="lat" or k == "lon": # if its a lat or lon skip it as it is already been parsed
					continue
				else: # else insert directly into the node dictionary 
					node[k] = v

		elif element.tag == "way":
			node['created'] = {}
			node['type'] = "way"
			for k,v in element.attrib.items():
				if k in CREATED: # if yes insert it into the created dict of node
					node['created'][k] = v
				else: # else insert directly into the node dictionary 
					node[k] = v
			nd = []
			for n in element.findall('nd'): # find all the child elements of way node with 'nd' tag
				# append it to the nd array
				nd.append(n.attrib['ref'])
			# if nd is not empty inser it into the node dictinary
			if len(nd)>0:
				node["node_refs"] = nd
		
		# parse the address and insert it into node dictionary with proper value
		addr = {}
        # find all the tags with element name 'tag'
		for t in element.findall('tag'):
            # split the 'k' attribute with ':' and check the count,
            # skip the 'k' values with 2 ':'
			if t.attrib['k'].count(':') == 1:
				_m,v = t.attrib['k'].split(':')
				if _m == "addr":
                    # if the _m value is 'addr' call the corrected_name method with its value.
					cname = corrected_name(t.attrib['v'])
					addr[v] = cname
		# if addr not empty insert it into node
		if len(addr)>0:
			node['address'] = {}
			node['address'] = addr
        # return the final node
		return node
	else: # if its not a tag or way element return None
		return None

* The above method checks for 'node' or 'way' tag and parses it acoordingly
* The nested elements are been taken cared of in this method
* The 'nd' array is been created in the if block for 'way' and data is been filled
* Finally, the address is be fetched for both the tags and 'corrected_name' method is been called for value correction

### The analysedata method

In [18]:
def analysedata(filename, pretty=False):
	data = [] # empty list to hold the final json array and return back result
	file_out = "{0}.json".format(filename) # saving the output to the filename.json format
	with codecs.open(file_out, 'w') as fo:
		for _, element in ET.iterparse(filename):
            # call to shape_element to get the information present in the element and its sub-element
			el = shape_element(element)
            # check if returned data is not of None type
			if el:
				data.append(el)
                # if pretty parameter is set to true, then write the output in indented format
				if pretty:
					fo.write(json.dumps(el, indent=2)+"\n")
				else:
					fo.write(json.dumps(el) + "\n")
    # returning the data list with all the objects
	return data

* The analysedata method writes down the wrangled and structed data to the output file.
* It also returns the whole data to the calling method for further processing.

### Calling the initial method  

In [33]:
def main():
	data = analysedata('Dallas_small.osm')
	pprint.pprint(data[0:10])

The analysedata method is called with the target .osm file as a parameter to be analysed.

### Starting point of out code 

In [34]:
# call the main method
if __name__ == '__main__':
	main()

[{'created': {'changeset': '641383',
              'timestamp': '2008-10-31T13:10:04Z',
              'uid': '9065',
              'user': 'brianboru',
              'version': '4'},
  'id': '26450261',
  'pos': [32.9901295, -97.0027785],
  'type': 'node'},
 {'created': {'changeset': '232647',
              'timestamp': '2007-03-09T23:15:37Z',
              'uid': '6514',
              'user': 'user_6514',
              'version': '1'},
  'id': '26450262',
  'pos': [32.9905615, -97.0033364],
  'type': 'node'},
 {'created': {'changeset': '641383',
              'timestamp': '2008-10-31T13:10:08Z',
              'uid': '9065',
              'user': 'brianboru',
              'version': '2'},
  'id': '26450263',
  'pos': [32.9890496, -96.9993453],
  'type': 'node'},
 {'created': {'changeset': '641383',
              'timestamp': '2008-10-31T13:10:04Z',
              'uid': '9065',
              'user': 'brianboru',
              'version': '4'},
  'id': '26450265',
  'pos': [32.9888157, -

### The Reduced Output of the wrangled data.