# Udacity data analyst nanodegree- P3

### Open Street Map - Data Wrangling

### Yasser Alnalhli

## Background:

   - OpenStreetMap of Amsterdam City, The Netherlands, has been used.
    
   - The data has been downloaded as XML format from [mapzen](https://mapzen.com). Thu, Jun 15, 2017.
   
   <img src="AmstbiggerOSM.png">
    
   - Only the old center of the city has been selected. It is the most visited area of Amsterdam." It is known for its traditional architecture, canals, shopping, and many coffeeshops. Dam Square is considered its ultimate centre, but just as interesting are the areas around Nieuwmarkt and Spui. The Red Light District is also a part of the Old Centre." [for more information](http://wikitravel.org/en/Amsterdam/Old_Centre).
   
   
   <img src="AmstOSM.png">
   
    
   - I spent some time between 2012 and 2013 in Amsterdam. I liked the culture there. There are very big differences between Arabic culture and the Dutch...I liked the Old Center the most in Amsterdam. It is like an openair big museum. Many things to see and discover. That is why the Old center was my decision in this project. 
    
   - The data wrangling process of Amsterdam XML OSM has been done by discovering auditing and cleaning the dataset before further analysis using MongoDB.
    
    
    

## Problems Faced: 

 - Becouse of my computer's hardware and the bad internet I have, I decided to use just little above the suggested file size (>50MB). I had to use another internet and computer to download the proper xml file size. It took some time.
 
 - Choosing just a small part of the city did not reflect the overall finding of the analysis. So my current work is just for the old center of Amsterdam and might not be the same if I choosed a larger area.
 
  
 - The Decision to use MySQL or MongoDB also took some time, I spent some time to look at both as I did not have much knowledge about both. 
 
 - The OSM XML file of Amsterdam is quite nice and clean comparing to the other cities in the world. I saw Chicago's data at Udacity vedios and I checked the files of Riyadh, KSA where I am from, and they are little bit messy. Although Dutch Language is similar to English, but mostly English been used in XML files OSM Amsterdam. This is actually was challenge to decide what do I need to clean in the file and I came up with another idea :) 
 
 - One of the huge cultural differences I faced when I was in Amsterdam, the homosexiality. Amsterdam is a homosexual friendly and there are many bars or other stuff which I want to locate

 
 - Almost in all the world, a coffeeshop is a shop to have a nice coffee. Sometimes it been called Cafe. However, in Amsterdam, a coffeeshop is a drug shop or a shop to legally smoke some cannabis. This misunderstanding will be adressed in future steps.
 
 

## Investigating the downloaded XML file:

Loading all the libraries for used in the python code:


In [44]:
import xml.etree.ElementTree as ET
from collections import defaultdict
import pprint
import re
import codecs
import json
from pymongo import MongoClient
import os


In [45]:
#The size of the xml file
osm_file = "Amsterdam61M.osm"
print (os.path.getsize(osm_file))


63300649


As it shown above, it is quite big file to test the codes with. So it is better if a sample file less than 2 MB generated from the original file to be used during the code testing.

In [46]:
#this is from file creatSample.py


osm_file = "Amsterdam61M.osm"
SAMPLE_FILE = "Amst_Sample.osm"


def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag
    Reference:
    http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
    """
    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


with open(SAMPLE_FILE, 'wb') as output:
    output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    output.write('<osm>\n  ')

    # Write every 10th top level element
    for i, element in enumerate(get_element(osm_file)):
        if i % 50 == 0:
            output.write(ET.tostring(element, encoding='utf-8'))

    output.write('</osm>')

The output of the above program is a sample of the original file with about 1.3 MB only. Here is the exact size below

In [47]:
print (os.path.getsize('Amst_Sample.osm'))

1275367


Using iterative parsing to process the xml file to find out what are the tags and how many of them are there. 

In [48]:
# From file count_all.py


osm_file = "Amsterdam61M.osm"


# Define a function to count all types of tags in the xml file
def count_tags(filename):
    tags_dict = defaultdict(int)
    for event, elem in ET.iterparse(filename, events=("start",)):
        tags_dict[elem.tag] += 1
    return tags_dict



if __name__ == '__main__':

    tags = count_tags(osm_file)
    print("counts of primary tags:")
    pprint.pprint(tags)
    

counts of primary tags:
defaultdict(<type 'int'>, {'node': 218600, 'nd': 228655, 'bounds': 1, 'member': 7678, 'tag': 515651, 'osm': 1, 'way': 24823, 'relation': 654})


Going deeper to the xml file. Looking for only the secondary tags:

In [49]:
#From the file count_secondary_tag.py

def count_secondary_tag(filename):
        tag_keys={}
        # find the count of keys in tags
        for event, elem in ET.iterparse(filename):            
            if elem.tag == 'tag' and 'k' in elem.attrib:
                if elem.get('k') in tag_keys.keys():
                    tag_keys[elem.get('k')]=tag_keys[elem.get('k')]+1
                else:
                    tag_keys[elem.get('k')]=1  
        # sort the tag in reverse order
        import operator
        sorted_keys = sorted(tag_keys.items(), key=operator.itemgetter(1)) 
        sorted_keys.reverse()    
        return sorted_keys    



if __name__ == '__main__':
    
    
    # audit/count secondary tag
    secondary_tag = count_secondary_tag(osm_file)
    print("counts of secondary tags:")
    print (len(secondary_tag))
    

counts of secondary tags:
771


Now, looking into the tag types and check if it contains any problems regarding to its key element. This will help to understand the data abit more and check the key validation to the MongoDB:

In [50]:
# This is  from file Tag_Type.py

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        k_value = element.attrib['k']
        if lower.search(k_value) is not None:
            keys['lower'] += 1
        elif lower_colon.search(k_value) is not None:
            keys['lower_colon'] += 1
        elif problemchars.search(k_value) is not None:
            keys["problemchars"] += 1
        else:
            keys['other'] += 1

    return keys



def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

keys = process_map('Amsterdam61M.osm')
pprint.pprint(keys)

{'lower': 159586, 'lower_colon': 355377, 'other': 688, 'problemchars': 0}


This time, contributed users only will be investigating more deepely.

In [51]:
# this code is from the file Users.py

def get_user(element):
    return element.get("user")

#find and count
def count_unique(filename):
    users = set()
    print(users)
    for _, element in ET.iterparse(filename):
        if "user" in element.attrib:
            users.add(get_user(element))
    print(users)
    print(len(users))

 
count_unique(osm_file)


set([])
set(['Andyngo', 'beetletun', 'monena41', 'alv', 'OSMF Redaction Account', 'eggie', u'J\xe9r\xe9my Bachy', 'RichRico', 'alina udodova', 'xybot', 'Rejo Zenger', 'wimvantklooster', 'Ale Bels', 'outofofficeagain', 'Friendly15', 'pierlux', 'Manu1400', 'Linda_Esperanto', 'Martin2009', 'Spencer Peck', 'BCNorwich', 'stanton', 'mabapla', 'Joost van Os', 'sebastic_BAG', 'hybridOL', u'Beselch Gonzalez Pe\xf1a', 'bigbug21', 'eugenebata', 'padvinder', 'Spyd7r', 'MA-PH', 'pilotrobert', 'AppAmapper', 'wika-osm', 'Aliossandro', 'rob72', 'W-PH', 'Cieper', 'Nikita Mashkin', 'Brunovanm', 'Vicky Coelho', 'Batum Sanctum', 'beweta', 'Hanno Lans', 'wolfv', '42429', 'LeTopographeFou', 'usaf8', 'Chris Parker', 'morsi', 'GayLinc', 'Denis Kamenshchikov', 'Hendrikklaas', 'mboeringa', 'ImreSamu', 'thebonnetplume', 'A10dep', 'saschahaberkorn', 'woodpeck_repair', 'Blauwdruk', 'optimus_ed', 'Amaroussi', 'Dianne1990', 'randomjunk', 'Lancelot van Duin', 'KnotNuts', 'denilsonsa', 'stvn', '_sev', 'Thibaut75011', 

## Look more deep into the xml file and find things:

As I mentioned above, one main problem I faced when hanging out in Amsterdam with my wife, is sometime going to homosexial bar or pub which is fine, but it took me long time to distingush them from other normal cafes. It is just a cultural differences. Espically to my Muslim wife with Hijab. Many funny stories I have, but I will keep it to myself. 

So, I decided to first count how many "gay" places in that particular area. Then, What are thier names and location. Here we go:

In [52]:
""" 
To find how many gay frindly place in Amsterdam's old center, 
I look into the XML file and make this function to count them 1st.

 This is from countGay.py File """



osm_file = "Amsterdam61M.osm"
count = 0


def find_gay(osmfile):
	count = 0
	#parse the file
	for event, elem in ET.iterparse(osm_file, events=("start",)):
		#check both ways and node tags
		if elem.tag == "node" or elem.tag == "way":
			#check the key at the tag
			for tag in elem.iter("tag"):
				#check the condition
				if tag.attrib['k'] == "gay" and  tag.attrib['v'] == "yes":
					#print(tag.attrib['k'] + " friendly place number: " + str(count))
					count += 1
	print ("There are " + str(count) + " gay friendly place at the old center of Amsterdam")
					
					

find_gay(osm_file)



There are 38 gay friendly place at the old center of Amsterdam


In [53]:
"""NOw I will check the name of that place and the location
# for this, I used FindGay.py"""



osm_file = "Amsterdam61M.osm"
count = 1


def find_gay(osmfile):
	count = 1
	for event, elem in ET.iterparse(osm_file, events=("start",)):
		if elem.tag == "node" or elem.tag == "way":
			found_gay = False
			for tag in elem.iter("tag"):
				if tag.attrib['k'] == "gay" and  tag.attrib['v'] == "yes":
					count += 1
					found_gay = True
					break
			
			if found_gay == True:

				# we have found the node containing gay tag, so lets print the name
				for tag in elem.iter("tag"):
					if tag.attrib['k'] == "name":
						print(tag.attrib['v'])
						break
				
				# also lets print the lat and long values present in the node 
				# remember that we don't have lat, lon in way nodes
				if elem.tag == 'node':
					print("lat: " + elem.attrib['lat'] + ", lon: " + elem.attrib['lon'])
				

find_gay(osm_file)

IHLIA Homodok
lat: 52.3762463, lon: 4.9081983
De Barderij
lat: 52.3758329, lon: 4.9004544
Mankind
lat: 52.3607475, lon: 4.888122
Café Rouge
lat: 52.3669068, lon: 4.8961508
Pink Point
lat: 52.3741824, lon: 4.8844544
Eagle
lat: 52.3746491, lon: 4.897024
Vrolijk
lat: 52.3722426, lon: 4.8898085
Getto
lat: 52.375358, lon: 4.898286
Vivelavie
lat: 52.3663271, lon: 4.8980529
Saarein
lat: 52.370287, lon: 4.8802861
Amstel 54
lat: 52.3668687, lon: 4.895663
HotSpot
lat: 52.366956, lon: 4.8968557
Fame
lat: 52.3668819, lon: 4.8954738
Gollem
lat: 52.3660948, lon: 4.8999054
Montmartre
lat: 52.366561, lon: 4.895936
Music Box
lat: 52.3666392, lon: 4.8983205
Entre Nous
lat: 52.3665536, lon: 4.8957327
Reve Museum
lat: 52.3759303, lon: 4.9084423
't Mandje
lat: 52.374815, lon: 4.900983
Motor Sportclub Amsterdam
lat: 52.3763379, lon: 4.9022053
Sultana
lat: 52.3666507, lon: 4.8899875
Downtown
lat: 52.3665466, lon: 4.8903128
Ludwig
lat: 52.3664942, lon: 4.8906805
Soho
lat: 52.3662911, lon: 4.8909603
Taboo
lat:

Another problem that many tourists face, is the "Coffeeshops" in Amsterdam. Dutch call a Coffeeshop for drug shop where you can legally smoke weed, cannabis or other drugs. I will audit this somehow to make things clearer. 

In [54]:
"""this code is from the file Audit_coffeeshops.py


# refer to the link for medhods used
# https://docs.python.org/2/library/xml.etree.elementtree.html"""

def update_save(oldfile, newfile):
    tree = ET.parse(oldfile)
    root = tree.getroot()


    for tag in root.iter('tag'):
        #here we have all the tag elements
        # check for the key = shop and v contaings drug
        if tag.attrib['k'] == 'shop' and 'drug' in tag.attrib['v']:
            # update the value to drug store
            tag.set('v', 'drug store')
        # if the v == coffee_shop and k = cuisine
        elif tag.attrib['v'] == 'coffee_shop' and tag.attrib['k'] == 'cuisine':
            # update the value to drug store
            tag.set('v', 'drug store')
    
    tree.write(newfile,encoding="UTF-8", xml_declaration=True, default_namespace=None, method="xml")

update_save('Amsterdam61M.osm', 'updatedAmst.osm')

Another problem found that another calissifiacation of (soft_drugs) in both cuisine and shop. That also need to be changed and then all shops and cuisines will be unified as the new suggested name (drug store)

In [None]:
# THis is from code Soft_Drug_Audit.py 


def update_save(oldfile, newfile):
    tree = ET.parse(oldfile)
    root = tree.getroot()


    for tag in root.iter('tag'):
        #here we have all the tag elements
        # check for the value if it contains soft_drugs
        if tag.attrib['k'] == 'shop' and tag.attrib['v'] == "soft_drugs":
            # update the value to drug store
            tag.set('v', 'drug store')
        # if the v == soft_drugs and k = cuisine
        elif tag.attrib['v'] == 'soft_drugs' and tag.attrib['k'] == 'cuisine':
            # update the value to drug store
            tag.set('v', 'drug store')
    
    tree.write(newfile,encoding="UTF-8", xml_declaration=True, default_namespace=None, method="xml")

update_save('updatedAmst.osm', 'updatedAmst2.osm')

The updated file cleaned and organized also contains relaiable information at least for me and some tuorists, unlike Dutchs. So it will be used in furhter investigation with MongoDB

## OSM data exploration using MongoDB

First of all, the osm xml will be converted to json format and then inserted to the MongoDB. I used this code from (http://napitupulu-jon.appspot.com/posts/wrangling-openstreetmap.html) 

In [56]:
""" This code is from the file insert_osm_data.py

# source : http://napitupulu-jon.appspot.com/posts/wrangling-openstreetmap.html """

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
addresschars = re.compile(r'addr:(\w+)')
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
osm_file = 'updatedAmst.osm'

def shape_element(element):
    #node = defaultdict(set)
    node = {}
    if element.tag == "node" or element.tag == "way" :
        #create the dictionary based on exaclty the value in element attribute.
        node = {'created':{}, 'type':element.tag}
        for k in element.attrib:
            try:
                v = element.attrib[k]
            except KeyError:
                continue
            if k == 'lat' or k == 'lon':
                continue
            if k in CREATED:
                node['created'][k] = v
            else:
                node[k] = v
        try:
            node['pos']=[float(element.attrib['lat']),float(element.attrib['lon'])]
        except KeyError:
            pass
        
        if 'address' not in node.keys():
            node['address'] = {}
        #Iterate the content of the tag
        for stag in element.iter('tag'):
            #Init the dictionry

            k = stag.attrib['k']
            v = stag.attrib['v']
            #Checking if indeed prefix with 'addr' and no ':' afterwards
            if k.startswith('addr:'):
                if len(k.split(':')) == 2:
                    content = addresschars.search(k)
                    if content:
                        node['address'][content.group(1)] = v
            else:
                node[k]=v
        if not node['address']:
            node.pop('address',None)
        #Special case when the tag == way,  scrap all the nd key
        if element.tag == "way":
            node['node_refs'] = []
            for nd in element.iter('nd'):
                node['node_refs'].append(nd.attrib['ref'])

        return node
    else:
        return None


def process_map(file_in, pretty = False):
    """
    Process the osm file to json file to be 
    prepared for input file to mongodb
    """
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

data = process_map(osm_file)
pprint.pprint(data[10])


{'created': {'changeset': '47179797',
             'timestamp': '2017-03-26T16:54:07Z',
             'uid': '2154858',
             'user': 'mboeringa',
             'version': '6'},
 'id': '26585151',
 'pos': [52.3773401, 4.9121287],
 'type': 'node'}


The above is just a sample of the new file. it is json format where we can easiely insert it to MongoDB

In [57]:
# This code is from the file insert_osm_data.py

db_name = 'AmsOSM'
# Connect to Mongo DB
client = MongoClient('localhost:27017')  
db = client[db_name]  
c = db.AmsMAP
c.insert(data)
pprint.pprint(c)




Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'AmsOSM'), u'AmsMAP')


Now, some analysis of the generated json file:

In [58]:
# This code is from the file insert_osm_data.py


# the number of data
print('The number of data',db.AmsMAP.count()) 

# the Numbers of way
print ('The number of ways',db.AmsMAP.find({'type':'way'}).count())  

# the number of nodes
print ('The number of nodes',db.AmsMAP.find({'type':'node'}).count()) 

# how many bicycle parkings in Amsterdam old center
print ('The number of bicycle parkings',db.AmsMAP.find({'amenity':'bicycle_parking'}).count())


# how many tourism attractions in Amsterdam old center 
print ('The number of tourism attractions',db.AmsMAP.find({'tourism':'attraction'}).count())


('The number of data', 2341746)
('The number of ways', 238796)
('The number of nodes', 2102932)
('The number of bicycle parkings', 1006)
('The number of tourism attractions', 597)


Looking more deeply into the data with MongoDB. 

In [59]:
# This code is from the file insert_osm_data.py


cuisine = db.AmsMAP.aggregate([
        {"$match" : {"cuisine" : {"$exists" : 1}}},
        {"$group" : {"_id" : "$cuisine",
                     "count" : {"$sum" : 1}}},
        {"$sort" : {"count" : -1}},
        {"$limit" : 5}
    ])
print ('The top 5 cuisine:') 
pprint.pprint([doc for doc in cuisine])

The top 5 cuisine:
[{u'_id': u'italian', u'count': 549},
 {u'_id': u'burger', u'count': 296},
 {u'_id': u'thai', u'count': 229},
 {u'_id': u'regional', u'count': 207},
 {u'_id': u'indian', u'count': 202}]


One more about the most common amenity (top 5)

In [60]:
# This code is from the file insert_osm_data.py

amenity = db.AmsMAP.aggregate([ 
                { "$group" : { "_id" : "$amenity","count": {"$sum": 1 }}},
                { "$sort" : { "count" : -1 }},
                { "$skip" : 1 },
                { "$limit" : 5 }
               ])
print ('top 5 common amrity:') 
pprint.pprint([doc for doc in amenity])

top 5 common amrity:
[{u'_id': u'restaurant', u'count': 5144},
 {u'_id': u'cafe', u'count': 2368},
 {u'_id': u'pub', u'count': 1916},
 {u'_id': u'fast_food', u'count': 1759},
 {u'_id': u'bench', u'count': 1026}]


One more to check the popular shops:

In [61]:
## This code is from the file insert_osm_data.py

shops = db.AmsMAP.aggregate([
        {"$match" : {"shop" : {"$exists" : 1}}},
        {"$group" : {"_id" : "$shop",
                     "count" : {"$sum" : 1}}},
        {"$sort" : {"count" : -1}},
        {"$limit" : 10}
    ])
print ('The top 10 common shops:') 
pprint.pprint([doc for doc in shops])

The top 10 common shops:
[{u'_id': u'clothes', u'count': 2527},
 {u'_id': u'gift', u'count': 1074},
 {u'_id': u'shoes', u'count': 630},
 {u'_id': u'jewelry', u'count': 458},
 {u'_id': u'supermarket', u'count': 431},
 {u'_id': u'convenience', u'count': 373},
 {u'_id': u'bakery', u'count': 324},
 {u'_id': u'books', u'count': 306},
 {u'_id': u'hairdresser', u'count': 279},
 {u'_id': u'alcohol', u'count': 265}]


When I was in Amsterdam, I also found it intersted that University of Amsterdam has many locations. I will check to this  dataset and see what I will get:


In [62]:
## This code is from the file insert_osm_data.py


#University
Uni = db.AmsMAP.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity": "university", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}])
print ('The Universities in Amsterdam:') 
pprint.pprint([doc for doc in Uni])

The Universities in Amsterdam:
[{u'_id': u'Oudemanhuispoort (UvA)', u'count': 40},
 {u'_id': u'Amstelcampus', u'count': 9},
 {u'_id': u'UvA Binnengasthuisterrein', u'count': 9},
 {u'_id': u'UvA Bungehuis', u'count': 9},
 {u'_id': u'UvA PC Hoofthuis', u'count': 9},
 {u'_id': u'Service & Informatiecentrum UvA', u'count': 9},
 {u'_id': u'UvA BG5', u'count': 9},
 {u'_id': u'P', u'count': 9},
 {u'_id': u'Universiteit van Amsterdam', u'count': 9},
 {u'_id': u'Aula UvA', u'count': 9},
 {u'_id': u'Ruimtelijke Wetenschappen (FMG)', u'count': 9},
 {u'_id': u'UvA', u'count': 9},
 {u'_id': u'UvA Faculteit der Maatschappij-en Gedragswetenschappen',
  u'count': 9},
 {u'_id': u'Spui25', u'count': 9},
 {u'_id': u'UvA Faculteit der Economische Wetenschappen en Econometrie',
  u'count': 9},
 {u'_id': u'Kohnstammhuis', u'count': 9},
 {u'_id': u'Theo Thijssenhuis', u'count': 9},
 {u'_id': u'Roeterseiland (UvA)', u'count': 9}]


## What else can be done to the dataset?

 - Amsterdam OSM dataset is very nice and tidy. Many things can be studied and analysies.
 
 - Amsterdam is a bicycle friendly and it is the most papular city with biycles in the world. It will be more inteasted if I checked the ways which contains biycles paths. 
 
 - It is also nice to check the number of places which contains wheelchair access and which dont. The amsterdam city councel or Gementa as the Dutch call it, can use such information to improve the quality of life for those who need such service.
 
 - Bars and pubs can be also checked and find what is the most papular.
 
 - Amsterdam is also has many museums. I noticed that osm contain many of them. That might be a good choice to check.
 
 - Amsterdam's old center has alot of canals. Canals has been build very nicly. The data should contain at least the name of that canal or as the Dutch call Gracht. Gracht is Amsterdam like streets where Amsterdammers and turists enry transporting and moving around the city for site seeing and having fund. 
 
 - Also Amsterdam is also famous of open air free markets. I did not have a look deeply into the current dataset. but it would be great if they are included and then studied and analysed. 
 
 - The [mapzen](https://mapzen.com) where I downloaded the dataset, contains separated geojson (Datasets grouped into individual layers by OpenStreetMap tags (IMPOSM)) and there is an importer called [Imposm](https://imposm.org/docs/imposm/latest/) which can be used. We can insert the fils directly to MongoDB nd do the analysis. 
 
 - I downloaded all those files and you can have a lok into it in a separated file with the submission.
 
 - If somone inteasted, there are other ways to manipulate the OSM data  I will mention just few of them here:

     - [Overpass API](http://wiki.openstreetmap.org/wiki/Overpass_API)
     - [Node-Mongosm](https://github.com/sammerry/node-mongosm)
     - [overpy 0.4](https://pypi.python.org/pypi/overpy/)
     
   and many others. 
   
   (I will add more ideas as suggested from the 1st reviewer.) 
   
  - OSM data can be be fixed or improved to be interactive with the users. A mobile appliction can be a solution for this. Via this application, the user can change directly to the file and add new information, discribtion or rating a particular place. This way will improve and enhance the dataset very fast. 
  
  - However, one main disadvantage of implementing the above idea is the quality of the added information. Also how relaiable is the added information and how usefult it is.
  
  - Another idea is link it with google maps. However, what I dont like about google map is all owned by google and that might affect OSM as open source and then it can be used in advertisments.
  
  - We can link OSM somehow to wikipedia XML for more information and discribtions. MongoDB is usefult in this concept.
  
  - The social media plays a significant role in all internet usage. Linking OSM to th social media espically to twitter might help in different aspects [example1](https://www.theatlantic.com/technology/archive/2013/11/how-online-mapmakers-are-helping-the-red-cross-save-lives-in-the-philippines/281366/) and [example2](https://www.newscientist.com/article/dn24565-social-media-helps-aid-efforts-after-typhoon-haiyan#.U-QaA2MmUro).
  
  - The above are very nice examples in using OSM. However, we should encorage the users to do such great work by allocating prices and nice compitions for the best practice. 
  
  - The current dataset contains many secondary tags. Those can be summerised and grouped. However, we might loose some details which might be important for someusers to answer a  particular question.
  
        

## Conclusion:

This kind of data format has alot of advantages. It is clear from the analysis above that many valuable information can be obtained and extracted. The cleaning process was also informative and easy to get a particular kind of information. Moreover, the auditing is another process to handel such information and change many information in easy way. Finally, there are many other ideas can be done in this king of dataset and further analysiss needed to cover them and to adress some issues to improve the quality of the work. 

## References:

 - https://docs.python.org/2/library/xml.etree.elementtree.html
 
 - Udacity Data Wrangling with MongoDB - Exercises
 http://fch808.github.io/Data%20Wrangling%20with%20MongoDB%20-%20Exercises.html
 
 - http://napitupulu-jon.appspot.com/posts/wrangling-openstreetmap.html