# Open Street Map Data - Part of Hyderabad

The file 'Hyderabad_part' is downloaded from https://mapzen.com/data/metro-extracts/. I selected a part of Hyderabad to procede with my analysis and cleaning the data and importing it to MonoDb.

In [10]:
import xml.etree.ElementTree as ET

node = 0
way = 0
relation = 0
tags = 0
for even,elem in ET.iterparse('Hyderabad_part.osm'): 
    if elem.tag == 'node':
        node += 1
        tag_list = elem.findall('tag')
        tags += len(tag_list)
    if elem.tag == 'way':
        way += 1
    if elem.tag == 'relation':
        relation += 1
print("no of nodes ", node, "\n", "no of 'tag's within nodes ", tags,  "\n", "no of ways ", way, "\n", "no of relations ", relation)

no of nodes  397179 
 no of 'tag's within nodes  5174 
 no of ways  93328 
 no of relations  626


The above statistics show us how many nodes, 'tag' elements within nodes, ways and relations are present in our sample data of Hyderbad.

In [13]:
# let us take a look at number of unique users

unique_id = set()
for even,elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation':
        unique_id.add(elem.attrib['uid'])
len(unique_id)

397

A total of 397 users have contributed to this part of map.

## Familiarising with Data Set

Lets go through the 'tag's, children of 'node' and get the unique values of 'k' attributes, to get a glimpse of this data.

In [14]:
#get unique values of 'k' attributes under each 'node'


k_set = set() # empty set to collect all the unique values of 'k'
for even,elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation':
        tag_list = elem.findall('tag')
        for tag in tag_list:
            k_set.add(tag.attrib['k'])
print(k_set)
            
  

{'oneway', 'surface', 'removed:amenity', 'id', 'atm', 'agricultural', 'AND_a_nosr_r', 'health_facility:type', 'amenity', 'population', 'fuel:cng', 'restrictions', 'gauge', 'fuel:octane_91', 'direction', 'branch', 'contact:website', 'fixme', 'craft', 'denomination', 'height', 'fuel:diesel', 'platforms', 'parking:lane:both', 'type', 'wikipedia', 'wheelchair', 'internet_access:ssid', 'ref', 'operator', 'local_ref', 'name:ru', 'horse', 'shop', 'name', 'full_id', 'goods', 'old_name', 'maxspeed', 'leisure', 'lit', 'internet_access', 'cycleway', 'material', 'usage', 'name:ja', 'phone', 'ref_old', 'power', 'alt_name', 'traffic_calming', 'junction', 'religion', 'smoking', 'lanes', 'opening_hours', 'fee', 'historic', 'level', 'phone_1', 'country', 'train', 'cables', 'motor_vehicle', 'name:ta', 'created_by', 'layer', 'name:mr', 'supervised', 'barrier', 'end_date', 'sport', 'Temple', 'construction', 'ref:old', 'waterway', 'diet:vegetarian', 'disused', 'source', 'start_date', 'building', 'voltage',

If we look at those unique set of values of attribute 'k', we can see some of the values like 'mahesh' ( name of a person ), 'fixme' are looking quite problemating to me, so lets go though those tags and see the values of other attributes and what we can do about them.

## Problems Encountered in the Data Set

In [15]:
# go through each problemating tag and print corresponding values of 'v' attributes

for even, elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way' or elem.tag == 'relation':
        tag_list = elem.findall('tag')
        for tag in tag_list:
            if tag.attrib['k'] == 'mahesh' or tag.attrib['k'] == 'fixme' :
                print( tag.attrib['k'], " ", tag.attrib['v'])
                

fixme   check connection
fixme   check connection
fixme   name
fixme   I don't think this is a railway crossing.
mahesh   Mahesh Kumar
fixme   Please disconnect from highway
fixme   This is out of place
fixme   check connection
fixme   check connection
fixme   add correct type of hospital
fixme   precise location and aa a mosq
fixme   precise tag
fixme   It's a courier service shop. Put an appropriate category of labels.
fixme   check_highway_type
fixme   Is this the outline of a building or the outline of the area which the station occupies?
fixme   check_highway_type


Looks like I have got some interesting work here. Lets see how we fix these and have fun !!. The 'fixme' values of attribute 'k' are those which not perfectly marked in the openstreetmap itself. 
>The fixme key allows contributors to mark objects and places that need further attention. These can be in the form of a "note to self" or request for additional mapping resources.

Source : http://wiki.openstreetmap.org/wiki/Key:fixme

These attributes needs to be fixed in osm data itself, which is out of scope for this project. 

Lets take a look at the key with value 'mahesh' and see what we can do about it.

In [16]:
# print latitude,longitude of the node and  all the tag attributes under the node whose one of the 'k' values is 'mahesh', 
# to figure out what that node represents


for even, elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way':
        tag_list = elem.findall('tag')
        for tag in tag_list:      
            if tag.attrib['k'] == 'mahesh' :
                tags_mahesh_node = elem.findall('tag')
                print( elem.attrib['lat'], " ", elem.attrib['lon'])
                for tag_mahesh_node in tags_mahesh_node:
                    print( tag_mahesh_node.attrib['k'], " ", tag_mahesh_node.attrib['v'])

17.4319158   78.4415517
phone   +91-9391284890
mahesh   Mahesh Kumar
addr:city   Hyderabad
addr:street   Yellareddyguda Road
addr:postcode   500073
addr:housenumber   8-3-799/2


This is probably address of person named 'Mahesh Kumar', so instead the value of the 'key' to be 'mahesh', it would be appropriate if it has the tag 'name'. Let us take care of this during our data injection into MongoDb.


In [17]:
# take a look at all the postal code to make sure they are valid 

for even,elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way':
        tag_list = elem.findall('tag')
        for tag in tag_list:      
            if tag.attrib['k'] == 'addr:postcode':
                print(tag.attrib['v'])

                    
                


500038
500044
500081
500034
500082
500034
500082
500082
500082
500 095
500044
500028
500028
500028
500028
500028
500028
500028
500028
500028
500028
500028
500073
500029
500038
500038
500038
500038
500016
500001
500034
500029
500082
500082
500082
500082
500082
500082
500082
500082
500082
500003
500003
500016
500082
500016
500082
500027
500027
500057
500082
500053
500003
500062
500013
500057
500034
500034
500034
500082
500082
500034
500016
500001
500037
500082
500034
500016
500016
500034
500034
500063
500003
500034
500016
500016
500034
500082
500034
500016
500057
500029
500029
500026
500001
500001
500034
500038
500082
500034
500018
500038
500082
500038
500003
500003
500003
500003
500003
500029
500029
500029
500029
500029
500029
500034
500016
500016
500003
500038
500073
500073
500044
500026
500027
500082
500082
500082
500025
500016
500082
500003
500003
500003
500003
500082
500038
500082
500002
500002
500016
500001
500057
500028
500001
500016
500020
500003
500003
500003
500038
500034
50000

Except one value, where a space included as a separator between 3 numbers. So before inserting into database, let us make sure we remove space for one such value of postcode.

In [16]:
# now let us audit the street names
# there are lot less values of 'tag' elements and even lot less of them have 'addr:street' in it 

import re
import collections
import pprint

expected = ['Road', 'Avenue', 'Colony', 'Lane', 'Nagar']

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

street_types = collections.defaultdict(set)

def is_street(attrib):
    return attrib == 'addr:street'

def audit_type(street_types, street):
    match = street_type_re.search(street)
    if match:
        street_type = match.group()
        if street_type not in expected:
            street_types[street_type].add(street)
    
for even,elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way':
        tag_list = elem.findall('tag')
        for tag in tag_list:
            k_value = tag.attrib['k']
            if is_street(k_value):
                audit_type(street_types, tag.attrib['v'])
                
pprint.pprint(street_types)

defaultdict(<class 'set'>,
            {'1': {'Quena Square, Banjara Hills Road No. 1',
                   'lane number 1',
                   'street number 1'},
             '10': {'Street No. 10'},
             '12': {'Road No. 12'},
             '3': {'Banjarahiils, Rd. No. 3',
                   'Road No 3',
                   'Siddartha Nagar Road number 3'},
             '4': {'Roas No. 4', '4'},
             'Abids': {'Abids'},
             'Amderpet': {'Amderpet'},
             'Ameerpet': {'Ameerpet'},
             'Balkampet': {'Balkampet'},
             'Balkumpet': {'Balkumpet'},
             'Barkatpura': {'Barkatpura'},
             'Bd': {'Soldier Bd'},
             'Begumpet': {'1st Floor,Legend Apartments, Motilal Nehru Marg, '
                          'Begumpet',
                          'Boorugu Vihar, Begumpet',
                          'Kundanbagh, Begumpet',
                          'Naik Estate, Begumpet',
                          'Nishat Bagh Colony Road, 

By observing these values, we can see that there are values where actually street names don't end with street_types, they are ending with either locality name or there's a case where it ended with state name!! These are the cases where a different 'key' value should be given or rather they should be brokendown to different key values, something which is also out of scope for this project. <br>
Let us make following changes

Rd (or) Rd. (or) rd -> Road <br>

Lets also make the changes below to remove incosistencies <br>
number (or) Number ( or ) No -> No.

There's also Soldier Bd, Somajiguda, where Bd refers to Board( this is the city in which I live, so I got this info from friends ).
So let us also change this for better understanding.

Bd - Board

And coming to those street types, which don't end wih the expected street types, there is no need to handle them separately, as they are more common in India and most of the streets have no names. They are identified by schools or any landmarks.

There are even flyovers 
which are mapped within the addr:street tag.

'Flyover': {'Punjagutta Flyover', 'Airport Flyover'}

These need to be reported to osm to look into the issue, whether this is allowed or not. As of now we are not going to map these under 'street' in our MongoDb data.



In [19]:
# take a look at those nodes where street name contain 'flyover' :

for even,elem in ET.iterparse('Hyderabad_part.osm'):
    if elem.tag == 'node' or elem.tag == 'way':
        tag_list = elem.findall('tag')
        for tag in tag_list:      
            if tag.attrib['k'] == 'addr:street':
                if tag.attrib['v'].find('Flyover') != -1 :
                    tags_flyover_node = elem.findall('tag')
                    print( elem.attrib['lat'], " ", elem.attrib['lon'])
                    for tag_flyover_node in tags_flyover_node:
                        print( tag_flyover_node.attrib['k'], ":", tag_flyover_node.attrib['v'])
                    


17.4262464   78.4511365
name : Himalaya Book World
shop : books
addr:city : hyderabad
addr:street : Punjagutta Flyover
17.4440522   78.4691906
addr:street : Airport Flyover
amenity : cafe
name : HeartCup Coffee


So by looking at above nodes, we can see that they are bookstore and a cafe. Flyover are nearer to them and those mapped these nodes, gave street names as these flyover names. As for this project is concerned, I am going to proceed with same, atleast someone can find these shops with the given information. But I am going to report this is osm, to check for further improvement of the map.

In [17]:
# a mapping to improve street names during data injection into MongoDb

mapping = {'Rd' : 'Road',
           'Rd.' : 'Road',
           'rd' : 'Road',
           'rd.' : 'Road',
           'number' : 'No.',
           'Number' : 'No.',
           'Bd' : 'Board'}

Let us use the following data model to inject the osm data into MongoDb


In [18]:
'''
{
"id": "",     
"type: "",
"visible":"",
"created": {
          "version":"",
          "changeset":"",
          "timestamp":"",
          "user":"",
          "uid":""
        },
"pos": [,],
"address": {
          "housenumber": "",
          "postcode": "",
          "street": ""
        },
"amenity": "",
"cuisine": "",
"name": "",
"phone": ""
}'''

''' Data Model for MongoDb '''

' Data Model for MongoDb '

In [19]:
# function to process street names and return the corrected street name

def process_street(street, mapping):
    match = street_type_re.search(street)
    if match:                      # to make sure, we don't replace occurences in other street names
        street_type = match.group()
        if street_type == 'Telangana':  # if name of the state is found, strip if off
            street = street.replace('Telangana','');
            street = street.strip();   # to strip any unwanted spaces
            street = street.rstrip();  # to strip any commas
        for key, value in mapping.items():
            if key in street :
                street = street.replace(key,value)
                break  
        
    return street
    

In [20]:
def process_postcode(postcode):
    if len(postcode) == 6:
        postcode = int(postcode)
    elif len(postcode) == 7:
        postcode = postcode.replace(" ","")
        postcode = int(postcode)
    return postcode

In [21]:
CREATED = [ "version", "changeset", "timestamp", "user", "uid"]

problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

def shape_element(element):
    node = {}
    node['created'] = {}
    node['pos'] = []
    node_addr = {}
    node_ref = []
    if element.tag == "node" or element.tag == "way" :
        node['type'] = element.tag
        if 'id' in element.attrib.keys():
            node['id'] = element.attrib['id']
        if 'visible' in element.attrib.keys():
            node['visible'] = element.attrib['visible']
        if 'lat' in element.attrib.keys():
            node['pos'].append(float(element.attrib['lat']))
        if 'lon' in element.attrib.keys():
            node['pos'].append(float(element.attrib['lon']))
        for create in CREATED:
            if create in element.attrib.keys():
                node['created'][create] = element.attrib[create]
        for tag in element.iter("tag"):
            if problemchars.search(tag.attrib['k']):
                continue
            if tag.attrib['k'] == "mahesh":     # this is just for one tag, where address of person 'mahesh' is being stored
                node_addr["name"] = tag.attrib['v']
            if tag.attrib['k'] == "addr:housenumber":
                node_addr["housenumber"] = tag.attrib['v']
            if tag.attrib['k'] == "addr:postcode":
                node_addr["postcode"] = process_postcode(tag.attrib['v'])
            if tag.attrib['k'] == "addr:street":
                node_addr["street"] = process_street(tag.attrib['v'],mapping)
            if tag.attrib['k'] == 'amenity':
                node['amenity'] = tag.attrib['v']
            if tag.attrib['k'] == 'name':
                node['name'] = tag.attrib['v']
            if tag.attrib['k'] == 'phone':
                node['phone'] = tag.attrib['v']
        if node_addr != {}:
            node['address'] = node_addr
        for nd in element.iter('nd'):
            node_ref.append(nd.attrib['ref'])
        if node_ref != []:
            node['node_refs'] = node_ref
                
        return node
    else:
        return None

## Importing to MongoDb

In [22]:
import codecs
import json

# write each document as a line to a json file, using the same code provided in lesson 6 problem

def process_map(file_in, pretty = False):
    # You do not need to change this file
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in ET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")

                    
    return data

In [23]:
# 

from pymongo import MongoClient

process_map('Hyderabad_part.osm', False)

[{'created': {'changeset': '26587379',
   'timestamp': '2014-11-06T07:57:18Z',
   'uid': '2419471',
   'user': 'Baluhyd',
   'version': '7'},
  'id': '245640524',
  'pos': [17.3951469, 78.4338348],
  'type': 'node'},
 {'created': {'changeset': '26587379',
   'timestamp': '2014-11-06T07:57:18Z',
   'uid': '2419471',
   'user': 'Baluhyd',
   'version': '9'},
  'id': '245640535',
  'pos': [17.3949513, 78.4404526],
  'type': 'node'},
 {'created': {'changeset': '26404937',
   'timestamp': '2014-10-29T05:13:05Z',
   'uid': '2419471',
   'user': 'Baluhyd',
   'version': '7'},
  'id': '245640546',
  'pos': [17.3821322, 78.493346],
  'type': 'node'},
 {'created': {'changeset': '40860317',
   'timestamp': '2016-07-19T14:44:10Z',
   'uid': '1016290',
   'user': 'Amaroussi',
   'version': '6'},
  'id': '245640551',
  'pos': [17.4025469, 78.4522029],
  'type': 'node'},
 {'created': {'changeset': '41212863',
   'timestamp': '2016-08-03T11:59:06Z',
   'uid': '1016290',
   'user': 'Amaroussi',
   'ver

## OverView of the Data

In [24]:
# get the database and collection of our data 
# and output the total number of records in collection
from pymongo import MongoClient

client = MongoClient()
db = client.test
db.records.find().count()

490507

I have already created database under name 'test' and collection 'records' using the json file created from the data set

In [25]:
# get our collection from database
record_collection = db.records

In [26]:
# record which have node tags
record_collection.find({"type":"node"}).count()

397179

In [27]:
# records which have way tags
record_collection.find({"type":"way"}).count()

93328

In [28]:
# we haven't added relation tags to our database
record_collection.find({"type":"relation"}).count()

0

In [29]:
# to get number of unique users using 'uid', who contributed to this part of the map

import pprint
result = record_collection.aggregate( [{ "$group" : { "_id" : "$created.uid", "count" : { "$sum" : 1 }}},
                              { "$sort" : { "count" : -1} },
                              ])
print(len(list(result)))

393


As we haven't included 'relation' tags in our database, unique user count is only 393 compared to the count we got from xml data.

In [40]:
# take a look at number of different amenities in this part of map

result = record_collection.aggregate([{"$match" : { "$or" : [ {"amenity" : { "$eq" : "cafe" } }, 
                                                              {"amenity" : { "$eq" : "bank"}},
                                                              {"amenity" : { "$eq" : "restaurant"}}]}},
                                      {"$group" : { "_id" : "$amenity", "count" : { "$sum" : 1 }}}
                                      ])
pprint.pprint(list(result))

[{'_id': 'cafe', 'count': 48},
 {'_id': 'bank', 'count': 119},
 {'_id': 'restaurant', 'count': 92}]


# Ideas to Improve the DataSet

I would like to summarise the problems encountered in auditing this part of the data and how I handled them :

1. The 'key' with 'mahesh' as value, it would be more approriate if it had been 'name', as that node represents the address of a person whose name is 'Mahesh Kumar'. I have changed this while importing the data into MonogoDb.
2. I have excluded the 'fixme' nodes, as they need further improvement and it is clearly mentioned that no copyrighted content should be used to improve those nodes in osm website.

>Don't use copyrighted data
Most data that you find on the web is copyrighted, including "free" maps like Google Maps. You may never use copyrighted resources because it can cause a lot of trouble to OSM. As a rule of thumb, use no external resources except those available in the editors. If you think you found a non-copyrighted resource that isn't available in the editors, please discuss it first with local mappers using contact channels. Reference : http://wiki.openstreetmap.org/wiki/Beginners_Guide_1.1



3. A certain postcode which has 7 characters in it, I have striped off the space before converting it to int.

4. The street names where 'Telangana'( name of the state ) is included. I have stripped off the 'Telangana' and unwanted commas before importing data into MonogoDb

# Changes made in OpenStreetMap

I have edited the 'mahesh' tag and renamed it to 'name'. You find the same, in the following changeset of osm.

https://www.openstreetmap.org/changeset/47372134

I have also made following change to previuosly found street name issue, from 'Panjagutta Flyover' to 'Nagarjuna Circle Road'

https://www.openstreetmap.org/changeset/47372250

My username on osm is 'potter13'