# OpenStreetMap Data Analysis of New Delhi, India

### Data Loading

First we would extract information from OSM file in JSON format and use it to create MongoDb.

In [1]:
# prcessOsm.py module contains functionality of converting OSM data of XML format into JSON format for mongodb import.
# prcess_map takes flag to clean the data
from processOsm import process_map

In [2]:
# This call generates JSON data corresponding to XML format of OSM
process_map("new-delhi_india.osm")

We create mongodb of above JSON report using below command on shell 
mongoimport.exe --db test --collection delhi --file <filepath>/new-delhi_india.json

In [3]:
# we create pymongo instance to query data base
import pprint
import pymongo
from pymongo import MongoClient 
client = MongoClient("mongodb://localhost:27017")
db = client.test

## Problems Encountered in the Map

Exploration of map data showed lots of problems in data. Below are list of some of problems. I have also selected few queries to illustrate problems.

#### Accuracy of Postal Code
For Delhi, postal code is six digit code. But some of postal codes present in data were
    - Non numeric character
    - Contains spaces between digits
    - Contains less or more than six digits  
    
Some of examples of faulty postal codes present in database are:-
* 2242
* Sunpat House Village
* sdf
* 110 001

#### Inconsistency in amenity labels
There seems to be a lack of common glossary for various type of amenties. Some of problem were-
    - Mixture of lower case and upper case properties of same name.
    - Synomous words are being used e.g. bar/pub, doctor/clinic
    - some of amenity were having non-generic labels like "Netaji Nagar Market" and "Bawana Bus depot".

In [4]:
db.delhi.find({"amenity":"bar"}).count()

22

In [5]:
# pub and bar could be merged
db.delhi.find({"amenity":"pub"}).count()

5

#### Incompleteness of data
We found lots of incomplete information in data like-
    - name of amenity missing in lots of node.
    - incomplete addresses like missing postcode.
    - important labels like religion missing from places of worships.

In [6]:
# amenities with name not specified
agg = db.delhi.aggregate([{"$match":{"amenity":{"$exists":"true"}, "name":{"$exists":"false"}}},
                            {"$group":{"_id":"$amenity", "count":{"$sum":1}}},
                            {"$sort":{"count":-1}},
                            {"$limit":10}])
pprint.pprint(list(agg))

[{u'_id': u'school', u'count': 403},
 {u'_id': u'place_of_worship', u'count': 239},
 {u'_id': u'restaurant', u'count': 150},
 {u'_id': u'hospital', u'count': 139},
 {u'_id': u'bank', u'count': 118},
 {u'_id': u'fast_food', u'count': 92},
 {u'_id': u'college', u'count': 91},
 {u'_id': u'fuel', u'count': 88},
 {u'_id': u'atm', u'count': 71},
 {u'_id': u'parking', u'count': 65}]


#### Inconsistency in abbreviation of "Government" in name of amenities
We found that in name of Governement amenities abbreviation of "Government" is non standard e.g. following are some of different abbreviatio used.
- Govt.
- Govt
- govt
- govt.

## Data Cleaning

We solved some of problems with data programatically like
- removed invalid postal code with non-numeric characters and not adhering to specified number of digits.
    * We made sure that our postal code of Delhi is a 6 digit number by regular expression r'^\d{6}$'
- removed white spaces with otherwise valid postal code.
- mapped some of amenities to unified glossary like changing clinic to doctors and pub to bar.
- Used a common abbereviation in "Government" named amenities.

In [7]:
# extracting new cleaned-data.
process_map("new-delhi_india.osm", isClean = True)

With this new extracted JSON, we create another cleaned collection delhi1 in MongoDb.

## Overview of Data

Here MongoDb queries are run for basic statistics of cleaned Delhi's map.

In [8]:
import os
os.path.getsize("new-delhi_india.osm")

719136645L

Size of OSM file for New Delhi is 719MB.

In [9]:
# Number of documents
db.delhi1.find().count()

3932112

In [10]:
# Number of nodes
db.delhi1.find({"type":"node"}).count()

3272109

In [11]:
#number of ways
db.delhi1.find({"type":"way"}).count()

659854

In [12]:
# number of unique users
len(db.delhi1.distinct("created.user"))

946

In [13]:
# Top 5 contributing users
agg = db.delhi1.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, {"$sort":{"count":-1}}, {"$limit":5}])
pprint.pprint(list(agg))

[{u'_id': u'Oberaffe', u'count': 260843},
 {u'_id': u'saikumar', u'count': 158503},
 {u'_id': u'premkumar', u'count': 154670},
 {u'_id': u'sdivya', u'count': 130144},
 {u'_id': u'anushap', u'count': 122224}]


In [14]:
# numbers of users appearing only once
agg = db.delhi1.aggregate([{"$group":{"_id":"$created.user", "count":{"$sum":1}}}, 
                           {"$group":{"_id":"$count", "num_users":{"$sum":1}}},
                           {"$sort":{"_id":1}}, {"$limit":1}])
pprint.pprint(list(agg))

[{u'_id': 1, u'num_users': 184}]


In [15]:
# top 10 appearing amenities
agg = db.delhi1.aggregate([{"$match":{"amenity":{"$exists":1}}}, 
                           {"$group":{"_id":"$amenity", "count":{"$sum":1}}},
                           {"$sort":{"count":-1}},
                           {"$limit":10}])
pprint.pprint(list(agg))

[{u'_id': u'school', u'count': 902},
 {u'_id': u'place_of_worship', u'count': 328},
 {u'_id': u'parking', u'count': 327},
 {u'_id': u'fuel', u'count': 212},
 {u'_id': u'hospital', u'count': 184},
 {u'_id': u'restaurant', u'count': 166},
 {u'_id': u'atm', u'count': 150},
 {u'_id': u'bank', u'count': 133},
 {u'_id': u'college', u'count': 128},
 {u'_id': u'fast_food', u'count': 106}]


In [16]:
# Numbers of historic sites
agg = db.delhi1.aggregate([{"$match":{"historic":{"$exists":"true"}}},
                          {"$group":{"_id": "historic", "count": {"$sum":1}}}])
pprint.pprint(list(agg))

[{u'_id': u'historic', u'count': 189}]


Delhi being a historic city has lots of heritage places which appears in above results also.

In [17]:
# Number of tourism attractions
db.delhi1.find({"tourism":"attraction"}).count()

62

In [18]:
# Numbers of historic tourism attractions
agg = db.delhi1.aggregate([{"$match":{"historic":{"$exists":"true"}, "tourism":{"$exists":"true"}, "tourism":"attraction"}}])
pprint.pprint(len(list(agg)))

21


In [19]:
# Numbers of schools
db.delhi1.find({"amenity":"school"}).count()

902

In [20]:
# Top 10 school names
agg = db.delhi1.aggregate([{"$match":{"amenity":{"$exists":"true"}, "amenity":"school"}},
                        {"$group":{"_id":"$name", "count":{"$sum":1}}},
                        {"$sort":{"count":-1}},
                        {"$limit":10}])
pprint.pprint(list(agg))

[{u'_id': None, u'count': 499},
 {u'_id': u'Delhi Public School', u'count': 7},
 {u'_id': u'Kendriya Vidyalaya', u'count': 5},
 {u'_id': u'Government School', u'count': 5},
 {u'_id': u'Ryan International School', u'count': 3},
 {u'_id': u'Modern School', u'count': 3},
 {u'_id': u'Salwan Public School', u'count': 3},
 {u'_id': u'Manav Sthali School', u'count': 3},
 {u'_id': u'DAV School', u'count': 2},
 {u'_id': u'Blind School', u'count': 2}]


In [21]:
# Number of universities
db.delhi1.find({"amenity":"university"}).count()

35

In [22]:
# Number of colleges
db.delhi1.find({"amenity":"college"}).count()

128

In [23]:
# Number of embassy
db.delhi1.find({"amenity":"embassy"}).count()

41

In [24]:
# Top religions
agg = db.delhi1.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship"}},
                    {"$group":{"_id":"$religion", "count": {"$sum":1}}},
                    {"$sort":{"count":-1}}])
pprint.pprint(list(agg))

[{u'_id': u'hindu', u'count': 132},
 {u'_id': None, u'count': 75},
 {u'_id': u'muslim', u'count': 47},
 {u'_id': u'christian', u'count': 34},
 {u'_id': u'sikh', u'count': 29},
 {u'_id': u'jain', u'count': 6},
 {u'_id': u'buddhist', u'count': 3},
 {u'_id': u'bahai', u'count': 1},
 {u'_id': u'zoroastrian', u'count': 1}]


Hindu is predominant religion as expected per demographics of India.

In [25]:
# places of worship that are tourist attractions
agg = db.delhi1.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"place_of_worship", "tourism":{"$exists":1}}},
                          {"$project":{"name":1,"_id":0}}])
pprint.pprint(list(agg))

[{u'name': u'Jama Masjid'},
 {u'name': u'Lotus Temple'},
 {u'name': u'Moth Ki Masjid'}]


In [26]:
# Popular cuisine resaurants
agg = db.delhi1.aggregate([{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant"}},
                           {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},
                           {"$sort":{"count":-1}},
                           {"$limit":5}])
pprint.pprint(list(agg))

[{u'_id': None, u'count': 104},
 {u'_id': u'indian', u'count': 13},
 {u'_id': u'regional', u'count': 8},
 {u'_id': u'pizza', u'count': 7},
 {u'_id': u'chinese', u'count': 5}]


## Additional Ideas

#### STANDARDISATION OF GLOSSARY

There seem to be a missing unified terminology for data. We could make standard glossary so that terms across the data could be consistent.
This standardisation would also combat false terms being inserted like amenity of type "Bawana Bus Depot", which is just a particular instance of Bus Depot.

One drawback of standardization is that it could hinder development of addition types and might make addition of new data tedious.

#### IMPUTATION OF MISSING DATA

It is possible to fill lots of missing data using imputation with available data. Some of examples are-
- postcode of a place could be known from its street address.
- religion of a place of worship could be known through keywords like "temple", "mandir" or "mosque" in name.
- If word like fort or tomb appears, then it is likely to be a historic place.

This approach of filling missing data might introduce extra errors in data e.g. there might be duplicate street name with different postcode or it might be that we propgate messy data by using it in imputation.