#                           Wrangle OpenStreetMap Data (Houston Area):

### Introduction:

#### Data Wrangling is one of the most important phases in Data Sceince. Sources say that data wrangling constitute to 70 percent of Data analysis work. It is a process of gathering, extracting, cleaning and storing our data [1]. Most of the data avaliable today is in complex formats. Those formats include data from social networking sites, video streaming sites etc.. in XML,JSON or many standard formats. Data wrangling begins with gathering of data from any of the sources. In this project, the data is being gathered from OpenStreetMap. OpenStreetMap distributes free geographic data of the world [2]. It provides the map data in many different formats. This project uses different data wrangling techniques to check the quality of OSM data.  

Programming Language: Python

Database: Mongo DB

Modules used: Cleaning.py

#### Download link: https://mapzen.com/data/metro-extracts/metro/houston_texas/

In [43]:
'''
                                           ##################
                                           #  Project Flow: #
                                           ##################
        
        1) Auditing:
            1.1) Loading the OSM file.
            1.2) Study of the file structure(tags,attributes etc...)
            1.3) Auditing different tags to identify consistent inconsistencies.
        
        2) Cleaning:
            2.1) Defining the cleaning function for each attribute
            2.2) Shaping the element
            2.3) Writing the shaped element to JSON file.
            
        3) Mongo DB:
            3.1) Importing the JSON file to Mongo DB
            3.2) Validating the counts
            3.3) Statistical overview of dataset using DB queries
        
        4) Ideas for additional improvements
            
'''
import os
import xml.etree.ElementTree as ET
from collections import defaultdict
import pprint
import re
import phonenumbers
import codecs
import json

### Auditing:

In [44]:
'''houston_texas.osm file is downloaded from Mapzen metro extracts.
   The downloaded file is in compressed format (.osm.bz2). We can extract
   the actual file using win-rar extarctor. The file will be very huge. 
   The OSM file of huston is approx 700 MB!'''

print ("Size of houston_texas.osm file in MB:")
print (os.path.getsize("houston_texas.osm")/1000000)

Size of houston_texas.osm file in MB:
691


In [3]:
import pprint
osm_file = open("houston_texas.osm", "r")
'''The data in OSM file is organized in XML format. 
   It has a root node and many child nodes. Python provides us many
   ways to parse an XML file. Since the file is huge there is a need
   parse the file iteratively. By parsing iteratively we load one node
   into memory for each iteration. To identify the elements that are to
   be audited and cleaned we should be aware of different types of tags in 
   the file.'''

all_tags = {}

for _,element in ET.iterparse(osm_file):
    if element.tag in all_tags:
        all_tags[element.tag] += 1
    else:
        all_tags[element.tag] = 1

In [4]:
'''All the tags of OSM file are displayed below. OSM has three main data structures [3]
    1) Node
    2) Way
    3) Relation
    
    Each tag describes a geographic attribute of the feature being shown by that specific node, 
    way or relation.
    
    A node is one of the core elements in the OpenStreetMap data model. It consists of a single point in 
    space defined by its latitude, longitude and node id.
    
    A way is an ordered list of nodes which normally also has at least one tag or is included within a 
    Relation. 
    
    A relation consists of one or more tags and also an ordered list of one or more nodes, ways and/or 
    relations as members which is used to define logical or geographic relationships between other 
    elements.
    
    '''

pprint.pprint (all_tags)

{'bounds': 1,
 'member': 27113,
 'nd': 3634961,
 'node': 3039649,
 'osm': 1,
 'relation': 2467,
 'tag': 2089952,
 'way': 368288}


In [5]:
'''
  This project audits and cleans only two types of tags(nodes,ways) and their attributes. With the help 
  of classroom casestudy the following steps are considered for auditing and cleaning:

 - Process only 2 types of top level tags: "node" and "way
 
 - All attributes of "node" and "way" should be turned into regular key/value pairs, except:
     - attributes in the CREATED array should be added under a key "created"
     - attributes for latitude and longitude should be added to a "pos" array,
       for use in geospacial indexing. Make sure the values inside "pos" array are floats
       and not strings.
       
 - If the second level tag "k" value contains problematic characters, it should be ignored.
 
 - If the second level tag "k" value starts with "addr:", it should be added to a dictionary "address"
 
 - The value of addr:street should be audited and the unexpected street types should be cleaned to an
   appropriate ones in the expected list provided by Houston city council[4]. The street name may appear
   in both the node and way tags.
   For example: St.  => Street
                Blvd => Boulevard
   
 - If the second level tag "k" value does not start with "addr:", but contains ":", it can be
   processed any way. 
    
 - If there is a second ":" that separates the type/direction of a street,
   the tag should be ignored.(Only for attributes of type 'addr')
   
 - The value of addr:city should be audited and the unexpected city names should be cleaned to 
   exact city name.
   For example: "Pearland, TX" => "Pearland"
   
 - The value of addr:postcode should be audited and the unexpected post codes should be cleaned to 
   standard format.
   For example: "TX 77009"   => "77009"
                "77340-3124" => "77340"  
            
 - If there is a second ":", it should be replaced with "_" and added as key value pairs

 - The value of phone should be audited and converted to standard format. There are many 
   standard formats for phone number but I would like to store the number as one format in
   my DB. The standard format would be (XXX) XXX-XXXX.
   For example: = +1 281-776-0143 => (281) 776-0143
  '''
print "The Houston metro extract has: %d nodes and %d ways" %(all_tags["node"],all_tags["way"])

The Houston metro extract has: 3039649 nodes and 368288 ways


In [45]:
# Defining regular expressions:
street = re.compile(r'^(addr:street$)')

city = re.compile(r'^(addr:city$)')

postcode = re.compile(r'^(addr:postcode$)')

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
#The above regular expression captures the last part of the street name
#For example: street_type_re.search(Hillcroft Rd.) => Rd.

city_re = re.compile(r'TX$', re.IGNORECASE)

zero_one_colon = re.compile(r'^(\w+:?\w*$)')

zip_code1_re = re.compile(r'(-\d*$)')
zip_code2_re = re.compile(r'(^TX)',re.IGNORECASE)

phone_start_re = re.compile(r'^(\(|\+1|[2-9])')


lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


'''
The approved list is provided by Houston city council. It is avalible in appendix III of [4]
'''

Houston_approved_st_types = ["Avenue","Boulevard","Bridge","Bypass","Circle","Court","Crossing",
                             "Crossroad","Drive","Expressway","Fork","Freeway","Freeway","Highway",
                             "Lane","Loop","Motorway","Oval","Parkway","Passage","Path","Place",
                             "Road","Street","Throughway","Trail","Tunnel","Way"]

In [46]:
'Auditing the necessary attributes:'
street_types = defaultdict(set)
post_code_set = set()
city_set = set()
phone_set = set()

def audit_osm(osmfile):
    osm_file = open(osmfile, "r")
    for _, element in ET.iterparse(osm_file):
        if element.tag == "way" or element.tag == "node":
            for tag in element.iter("tag"):
                if re.search(street,tag.attrib["k"]):
                    m = street_type_re.search(tag.attrib["v"])
                    if m:
                        street_type = m.group()
                        if street_type not in Houston_approved_st_types:
                            street_types[street_type].add(tag.attrib["v"])
                elif re.search(city,tag.attrib["k"]):
                    city_set.add(tag.attrib["v"])
                elif re.search(postcode,tag.attrib["k"]):
                    post_code_set.add(tag.attrib["v"])
                elif(tag.attrib["k"]=="phone"):
                    phone_set.add(tag.attrib["v"])

In [47]:
file_in = "houston_texas.osm"
audit_osm(file_in)

In [48]:
print(street_types.keys())

['1142', 'Walk', 'Ridge', 'FM646', 'Lake', 'F-3', 'Rd', '130', 'Bailey', 'Oaks', 'TX-332', 'Texas', '1774', 'Mews', 'Business', '90A', 'Cypress', 'Dallas', '6475', 'Mall', '1464', '285', 'Fuzzel', '1462', 'Maroneal', '1092', 'Montrose', 'L', 'Pkwy', '121', 'T', 'one', '701', '59', 'Westheimer', 'Durham', 'Broadway', '1663', '925', '290', 'S.', '146', 'Beechnut', '596', 'Driscoll', 'West', '316', '270', 'Stree', '77027', 'street', '227A', '110', '1093', '1488', '87', 'Road)', 'Run', 'Hillcroft', 'Park', '77598', 'Isle', '90a', '521', '362', '529', 'Elm', '309', 'C', '303', 'Ave.', 'G', '300', 'Plaza', 'O', 'Plaze', 'Felipe', 'S', 'Fwy', '242', '103', '249', 'HIGHWAY', '105', 'Square', '1640', 'Point', 'MacGregor', 'Westhimer', '36', 'Speedway', '518', '646', 'Welford', 'Hidalgo', 'Es', '240', '9k', 'T2008', 'Blossom', '2920', 'St.', 'Rock', '575', '332', 'Richmond', '65', 'A-527', 'North', 'Crosstimbers', '704', '180', 'FM517', '650', '6', 'Blvd.', '502', 'B', 'N,', '1/2', 'F', '185', '

In [49]:
'''
   After auditing the city names, there are few inconsistent city names which are appended with ", Tx". 
   Those cities can be updated and made consistent just by having city name alone.
   For example: 'Angleton,TX' => 'Angleton'
   
'''
print city_set

set(['KATY', 'La Porte', 'katy', 'El Lago', 'The Woodlands', 'Jersey Village', 'Porter', 'Klein', 'Nassau Bay', 'Cypress', 'Wallis', 'Hockley', 'Angleton,TX', 'Conroe', 'DEER PARK', 'La Marque', 'Crystal Beach', 'Spring, TX', 'Lake Jackson, TX', 'Webster', 'Katy, TX', 'Tomball, Tx', 'Katy', 'Fresno', 'Laks Jackson', 'TEXAS CITY', 'League City, TX', 'Winnie', 'clear lake shores', 'Deer Park', 'Houston, TX', 'Friendswood, TX', 'Pearland', 'Dickinson', 'Liberty', 'Fulshear', 'Little York', 'Baytown', 'Beasley', 'HOUSTON', 'Hempstead', 'Angleton, TX', 'Rosharon', 'Santa Fe, TX', 'Tomball', 'West Columbia, TX', 'Atascocita', 'Woodlands', 'San Leon', 'Seabrook', 'Hedwig Village', 'Clear Lake Shores', 'Plantersville', 'Sugar Land', 'New Caney', 'Alvin', 'Sealy', 'Rosenberg', 'Friendswood', 'Santa Fe', 'houston', 'Alvin, TX', 'West University', 'MAGNOLIA', 'Houston, Texas', 'Magnolia', 'Bellaire', 'Richmond', 'Pasadena', 'Humble, TX', 'Cypress, TX', 'Pasadena, TX', 'Kemah', 'Shenandoah', 'West

In [50]:
'''
    After auditing the postal codes, there are few inconsistent codes which begin with 'TX'. 
    Those postal codes can be updated to one consistent format. The second type of cleaning 
    can be done on nine digit postal codes. Nide digits can be converted to a five digit code.

    For example: a) "TX 77086"    => "77086"
                 b) "77025-9998"  => "77025" 
'''

print (post_code_set)

set(['77047', '77007-2121', '77045', '77044', '77043', '77042', '77041', '77040', '77575', '77573', '77571', '77451', '77379', '77375', '77377', '77373', '77477', '77474', '77017', '77471', '77049', '77478', '77479', '77083', '73032', '77027-6850', '77584-', '77338', '77568', '77051', '77357', '77053', '77054', '77055', '77056', '77007-2113', '77058', '77059', 'TX 77009', '77565', '77566', '77388', 'TX 77494', '77365', 'TX 77086', '77363', '74404', '77441', '77447', '773867386', '77449', '77590', '77389', '77025-9998', '77025', '77024', '77027', '77021', '77020', '77023', '77022', '77515', '77511', '77510', '77355', '77354', '77530', '77591', '77459', '77450', '77598', '77019-1999', '77204', '77345', '77346', '77042-9998', '77036', '77037', '77034', '77035', '77032', '77030', '77031', '77506', '77504', '77505', '77502', '77503', '77038', '77039', '77429', '77587', '77584', '77583', '77580', '77581', 'Weslayan Street', '77423', '77422', '77665', 'tx 77042', '77498', '77089', '77088', '7

### Cleaning:

In [21]:
'''
    The process_map shapes the element using all the update functions defined in Cleaning.py script.
    also writes the shaped elemnt to JSON file. For more details, see Cleaning.py
'''
from Cleaning import process_map

In [23]:
process_map(file_in, False)

### MongoDB:

The JSON file is imported into mongo DB from cmd prompt. I tried importing using pymongo but faced the similar
which most of the studentd faced. Discussion forum saved my life and found that the best way to import is throgh cmd prompt.
    
    Command:
    
    mongoimport -d openstreetmap -c houston --file "C:\Users\vemul\Data_Analyst_Nano_Degree\P3-Open_Street_Map_Data_
    Wrangling-Python_MongoDB\houston_texas.osm.json"

In [25]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.openstreetmap

In [26]:
# Statistics of OSM collection:
print db.command("dbstats")

{u'storageSize': 265359360.0, u'ok': 1.0, u'avgObjSize': 246.62568234095878, u'views': 0, u'db': u'openstreetmap', u'indexes': 1, u'objects': 3407937, u'collections': 1, u'numExtents': 0, u'dataSize': 840484788.0, u'indexSize': 34189312.0}


In [27]:
'''
    Some fun with the counts:
'''
num_docs = db.houston.find().count()
print "Number of documents inserted:", num_docs

node_query = {"type":"node"}
num_nodes = db.houston.find(node_query).count()
print "Number of nodes inserted:", num_nodes

way_query = {"type":"way"}
num_ways = db.houston.find(way_query).count()
print "Number of ways inserted:", num_ways

Number of documents inserted: 3407937
Number of nodes inserted: 3039646
Number of ways inserted: 368276


In [28]:
'''
    There is a count mismatch. We have document types other than node and way inserted into mongo DB. 
    We have 15 such documents
'''
pipeline = [
            {"$match":{"type":{"$ne":"node"}}},
            {"$match":{"type":{"$ne":"way"}}}      
]

In [29]:
'''
    Print all the inconsistent types:
'''
for doc in db.houston.aggregate(pipeline):
    print doc["type"]

Public
building
dance
public
oil
outdoor
gas
gas
gas
gas
civil
Public
public
multipolygon
multipolygon


#### After researching those specific documets, I noticed that they are actually node or way types but they are having extra attribute called "type". So the exisiting type value (node or way) in the dictionary is repalced. As there are only 15 such documents I decided to leave as them as such without deleting or updating.

In [30]:
'''
    Number of unique users:
'''
users = db.houston.distinct('created.user')
print "Number of users who contributed to OSM-Houston area:", len(users)

Number of users who contributed to OSM-Houston area: 1648


In [31]:
def aggregate(pipeline):
    return [doc for doc in db.houston.aggregate(pipeline,allowDiskUse=True)]

In [32]:
# Few basic stats (Ones suggested in project requirements)
'''
    Top ten creators
'''
user_pipeline = [{"$group":{"_id": "$created.user",
                          "count":{"$sum":1}}},
                 {"$sort":{"count":-1}},
                 {"$limit":10}
                ]
users = aggregate(user_pipeline)
pprint.pprint(users)

[{u'_id': u'woodpeck_fixbot', u'count': 567584},
 {u'_id': u'TexasNHD', u'count': 538419},
 {u'_id': u'afdreher', u'count': 478666},
 {u'_id': u'scottyc', u'count': 204770},
 {u'_id': u'cammace', u'count': 192874},
 {u'_id': u'claysmalley', u'count': 136311},
 {u'_id': u'brianboru', u'count': 117229},
 {u'_id': u'skquinn', u'count': 86202},
 {u'_id': u'RoadGeek_MD99', u'count': 81986},
 {u'_id': u'Memoire', u'count': 56656}]


In [33]:
'''
    Top ten sources
'''
source_pipeline = [{"$match":{"source":{"$exists":1}}},
                   {"$group":{"_id":"$source",
                             "count":{"$sum":1}}},
                   {"$sort":{"count":-1}},
                   {"$limit":10}
                  ]
sources = aggregate(source_pipeline)
pprint.pprint(sources)

[{u'_id': u'Bing', u'count': 5926},
 {u'_id': u'USGS Geonames', u'count': 876},
 {u'_id': u'Yahoo', u'count': 589},
 {u'_id': u'bing', u'count': 300},
 {u'_id': u'Mapbox', u'count': 282},
 {u'_id': u'PGS', u'count': 221},
 {u'_id': u'Yahoo,TIGER', u'count': 160},
 {u'_id': u'http://www.epa.gov/enviro/geo_data.html', u'count': 80},
 {u'_id': u'TIGER/Line\xae 2008 Place Shapefiles (http://www.census.gov/geo/www/tiger/)',
  u'count': 70},
 {u'_id': u'ground', u'count': 59}]


In [34]:
'''
    Ameneties in The WoodLands city(My locality). I learnt how to use logical operator within 
    match operator [5].
'''
woodlands_pipeline = [{"$match":{
                           "$or":[
                                   {"address.city":{"$eq":"The Woodlands"}},
                                   {"address.city":{"$eq":"woodlands"}} 
                                 ]}},
                       {"$match":{"amenity":{"$exists":1}}},
                       {"$project":{"amenity":"$amenity",
                                  "name":"$name",
                                   "_id":0}}
                     ]
woodlands_amenity = aggregate(woodlands_pipeline)
pprint.pprint(woodlands_amenity)

[{u'amenity': u'pharmacy', u'name': u'H-E-B Pharmacy'},
 {u'amenity': u'dentist', u'name': u'Portofino Dental'},
 {u'amenity': u'pharmacy', u'name': u'Randalls Pharmacy'},
 {u'amenity': u'restaurant', u'name': u"Fleming's Steakhouse"},
 {u'amenity': u'restaurant', u'name': u'Atsumi'},
 {u'amenity': u'veterinary', u'name': u'Windvale Pet Hospital'},
 {u'amenity': u'restaurant', u'name': u'Fogo de Ch\xe3o Brazilian Steakhouse'},
 {u'amenity': u'place_of_worship', u'name': u'The Woodlands Methodist Church'},
 {u'amenity': u'cafe', u'name': u'Starbucks'},
 {u'amenity': u'bank', u'name': u'Citizens Bank'},
 {u'amenity': u'fast_food', u'name': u'Chick-fil-A'},
 {u'amenity': u'fast_food', u'name': u'Whataburger'},
 {u'amenity': u'restaurant', u'name': u'Sweet Tomatoes'},
 {u'amenity': u'school', u'name': u'Knox Junior High School'},
 {u'amenity': u'school', u'name': u'The Woodlands College Park High School'},
 {u'amenity': u'place_of_worship',
  u'name': u'Northwoods Unitarian Universalist Ch

In [35]:
'''
    Top five areas in Houston with more number of amenities:
'''
city_amenities_pipeline = [{"$match":{"amenity":{"$exists":1}}},
                           {"$match":{"address.city":{"$exists":1}}},
                           {"$group":{"_id":"$address.city",
                             "count":{"$sum":1}}},
                           {"$sort":{"count":-1}},
                           {"$limit":5}
                          ]
city_amenities = aggregate(city_amenities_pipeline)
pprint.pprint(city_amenities)

[{u'_id': u'Houston', u'count': 1007},
 {u'_id': u'Kingwood', u'count': 96},
 {u'_id': u'Katy', u'count': 92},
 {u'_id': u'Sugar Land', u'count': 74},
 {u'_id': u'Tomball', u'count': 69}]


In [36]:
'''
    Eataries with Chinese Cuisine:
'''
chinese_cuisine_pipeline = [{"$match":{
                            "$or":[
                                   {"amenity":{"$eq":"restaurant"}},
                                   {"amenity":{"$eq":"fast_food"}} 
                                 ]}},
                            {"$match":{"cuisine":{"$eq":"chinese"}}},
                            {"$project":{
                                  "name":"$name",
                                   "_id":0}}
                           ]
chinese_cuisine = aggregate(chinese_cuisine_pipeline)
pprint.pprint(chinese_cuisine)

[{u'name': u'Chinese Buffet'},
 {u'name': u'Grand Buffet'},
 {u'name': u"P. F. Chang's China Bistro"},
 {u'name': u'Chopsticks'},
 {u'name': u'Chinois Orient Bistro'},
 {u'name': u'Fairwan Hunan Restaurant'},
 {u'name': u'Wan Fu'},
 {u'name': u'Panda Express'},
 {u'name': u"Hu's Garden"},
 {u'name': u'Pei Wei'},
 {u'name': u'Hunan River Bistro'},
 {u'name': u'D Wok Express'},
 {u'name': u'Chinese Buffet'},
 {u'name': u'Yu Garden'},
 {u'name': u'China Stix'},
 {u'name': u'888 Chinese Restaurant'},
 {u'name': u"Fang's Cafe"},
 {u'name': u'Panda Express'},
 {u'name': u'Kim Son'},
 {u'name': u'Los Chinos Rico'},
 {u'name': u'Vstar'},
 {u'name': u"Chang's Chinese"},
 {u'name': u"Hin's Garden"},
 {u'name': u'Happy Lamp Restaurant '},
 {u'name': u'Panda Express'},
 {u'name': u'Cafe East Chinese Restaurant and Buffet'},
 {u'name': u'Oriental Gardens Chinese Restaurant'},
 {u'name': u'Hunan Garden Restaurant'},
 {u'name': u'East Star Chinese Buffet'},
 {u'name': u'Szechuan Garden'},
 {u'name': 

In [37]:
'''
    What is the most referred node in Houston city? Bar? Place_of_Worship? Lets check it out!
'''
node_ref_pipeline = [{"$match":{"type":{"$eq":"way"}}},
                     {"$unwind":"$node_refs"},
                     {"$group":{"_id":"$node_refs",
                             "count":{"$sum":1}}},
                     {"$sort":{"count":-1}},
                     {"$limit":1}
                    ]
node_ref = aggregate(node_ref_pipeline)
pprint.pprint(node_ref)

# Initially the group statemnt exceded 100MB of data and the query failed. Upon further researching I learnt about
# allowDiskUse=True option in aggrgation framework [6]. I updated the aggregate funnction.

[{u'_id': u'3759210966', u'count': 14}]


In [38]:
'''
    Extracting the most referenced node details:
'''
ref = [{"$match":{"id":{"$eq":node_ref[0]["_id"]}}}
      ]
n = aggregate(ref)
pprint.pprint(n)

# Unfortunately we do not have any deatils of the most referenced node except the position

[{u'_id': ObjectId('595b0177a7c5d286d0f6c9ee'),
  u'created': {u'changeset': u'34249805',
               u'timestamp': u'2015-09-25T18:18:05Z',
               u'uid': u'3119079',
               u'user': u'cammace',
               u'version': u'1'},
  u'id': u'3759210966',
  u'pos': [29.7214822, -95.3403131],
  u'type': u'node'}]


In [42]:
'''
    Time to validate our update_zipcode function:
'''
zipcode_pipeline = [{"$match":{"address.postcode":{"$exists":1}}},
                    {"$group":{"_id":"$address.postcode"}}
                   ]
zipcode = aggregate(zipcode_pipeline)

'''
    The update_zipcode function cleans only two types of inconsistencies(TX 77086 and 77340-7856). 
    The below result shows that the two inconsistencies are cleaned very well. But, we do have other 
    invalid formats like street name in zipcode or a 9 digit zipcode without '-' or a zipcode with 
    less than 5 digits. If the zipcode auditing is very critical then we have to clean those kind 
    of inconsistencies by revisiting the cleaning process (Data Wrangling is an iterative proccess). 
'''

"Lets print few cleaned zipcodes. The zipcode list is very huge so printing only few of them"
for number,item in enumerate(zipcode):
    if number%10 == 0:
        print item.values()

[u'77545']
[u'77363']
[u'77012']
[u'77028']
[u'77093']
[u'77061']
[u'77520']
[u'77081']
[u'77573']
[u'77486']
[u'77590']
[u'77056']
[u'77066']
[u'77385']
[u'77023']
[u'77450']
[u'77006']
[u'77554']
[u'77030']
[u'77506']


### Ideas for additional improvements

##### 1) Imrpovements for OSM forum: After reviewing the amenities in The Woodlands area I found that not all amenities are updated in openstreetmap data. It would have been more helpful if the data is near complete. Only 10% of The woodlands area amenities are reported. OSM should also standardize the tags that are to be present. For example, the phone number attribute in address tag should be made mandatory for all the eataries(Restaurants,Cafe,Bars etc..) 

##### 2) Project Improvements: In the initial phases of the project most of my time is  spent on parsing the huge OSM file many times. I believe iterative parsing is one of the efficient techniques to save the resources. Off late I have seen many articles on improving the efficeiency within python programming. For example, use of Psyco module, Threading in multiprocessor environment, Divide and Conquer. I would like to explore further on these different techniques to improve the efficiency. But, to implement these kind of techniques we should have extra knowledge on the opeating system of local machine. This woud be an extra overhead if the Data Analyst is not well versed with the OS concepts. If the system resources are very critical and the file we are operating is very large we may have to expolre any other optimizing techniques other than iterative parsing.

### Conclusion:

##### The main steps in data wrangling are covered in this project. As the data was updated by humans it is very definite that we will have errors. I have idnetified few inconsistencies and cleaned them. I am pretty sure that there will be many other errors. The data inserted in Mongo DB is not gold standard as the process of auditing and cleaning is done only once. In reality it is done iteratively. 

### References:
[1] Data Wrangling course- Udacity

[2] https://wiki.openstreetmap.org/wiki/Using_OpenStreetMap

[3] https://wiki.openstreetmap.org/wiki/Map_Features

[4] http://www.houstontx.gov/council/

[5] https://stackoverflow.com/questions/20469712/using-and-with-match-in-mongodb

[6] https://stackoverflow.com/questions/27272699/cant-get-allowdiskusetrue-to-work-with-pymongo

[7] Udacity discussion forum [Main reference]

[8] https://pypi.python.org/pypi/phonenumbers