

### Introduction ###
The aim of the priject is to select the map from the region of world, get the data, audit the data and fix the problems encountered in data. Then import the cleaned data into databse and perform several queries against it.  

### OSM Data ###

* I am using OpenStreetMap data of Mountain View city, California downloaded from [mapzen](https://mapzen.com/data/metro-extracts/). The date of downloading the dataset is March 27, 2017 at 10:21 AM.
* The format of datafile is in XML format, and we can find the description og Open Street XML format [here](http://wiki.openstreetmap.org/wiki/OSM_XML).

### Database###

* MongoDB

### Issues in Mountiav View OSM data ###

* There are some inconsistencies in the names of streets, some are incorrect and abbreviated.
* Few inconsistent zip codes.
* There ae inconsistency in phone numbers stored by users.


### Overview of Mountain View OSM data ###

The dataset description is given as


#### Size of data file ####
* MountainView.osm(The original downloaded OpenStreetMap in xml format): 209MB
* MountainView.osm.json(The processed OpenStreetMap in json format): 346MB

#### Summary of descriptive statistics of dataset ####

* Number of documents: 5754659
* Number of unique users: 880
* Number of nodes: 5136303
* Number of ways: 618301

### References ###

1. [Udacity Sample Data Wrangling Project](https://docs.google.com/document/d/1F0Vs14oNEs2idFJR3C_OPxwS6L0HPliOii-QpbmrMo4/pub)

2.  <https://zelite.github.io/Wrangle-OpenStreetMap-Data/>

3. <https://english.stackexchange.com/questions/29009/standard-format-for-phone-numbers>

### Code and Results ###
There several queries generated for look deeep insight of data which is follwed by conclusion .





<b>Import Libraries</b>

In [2]:
# load libraries
import os
import xml.etree.cElementTree as cET
from collections import defaultdict
import pprint
import re
import codecs
import json
import string
from pymongo import MongoClient

In [3]:
# set up map file path
filename = "MountainView.osm" # osm filename
# filename = "sample200.osm" # Sample osm filename
path = "/Users/seemamishra/Desktop/Udacity/Data_Wrangling/P3_Data" # directory contain the osm file
MountainViewosm = os.path.join(path, filename)

# MountainViewosm = "MountainView.osm" # osm filename
# path = "d:\GithubRepos\Udacity\P3" # directory contain the osm file
lower = re.compile(r'^([a-z]|_)*$') 
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
# initial version of expected street names
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane",
            "Road", "Trail", "Parkway", "Commons", "highway"]
MountainViewosm

'/Users/seemamishra/Desktop/Udacity/Data_Wrangling/P3_Data/MountainView.osm'

#### Count the number of Tags ###

In [4]:
# Iterative parsing
def count_tags(filename):
    
    # make empty defaultdict
#     from collections import defaultdict
    tags_dict = defaultdict(int)
    
    # use the iterparse method to find all the tags
    for event, element in cET.iterparse(filename, events=("start", "end")):
#         print event
        tags_dict[element.tag] += 1
        
    # return your results 
    return tags_dict

if __name__ == "__main__":
    print count_tags(MountainViewosm)

defaultdict(<type 'int'>, {'node': 2048376, 'nd': 2318540, 'bounds': 2, 'member': 10530, 'tag': 835590, 'osm': 2, 'way': 246602, 'relation': 2532})


####  Tags types ###

In [5]:
# Tag types
def key_type(element, keys):
    if element.tag == "tag":
    
        k = element.attrib['k']
#         print k
        # serach k to see if it matches each regular expression
        if lower.search(k):
            keys['lower'] += 1
        elif lower_colon.search(k):
            keys['lower_colon'] += 1
        elif problemchars.search(k):
            keys['problemchars'] += 1
        else:
            keys['other'] += 1
           
    return keys



def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in cET.iterparse(filename):
        keys = key_type(element, keys)

    return keys


if __name__ == "__main__":
    print process_map(MountainViewosm)

{'problemchars': 25, 'lower': 226487, 'other': 4962, 'lower_colon': 186321}


#### Audit the street names ###

In [6]:
def audit_street_type(street_types, street_name):
    # add unexpected street name to a list
    m = street_type_re.search(street_name)
#     print m
    if m:
        street_type = m.group()
#         street_type
        if street_type not in expected:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    # determine whether a element is a street name
    return (elem.attrib['k'] == "addr:street")

def audit_street(osmfile):
    # iter through all street name tag under node or way and audit the street name value
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in cET.iterparse(osm_file, events=("start","end")):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    return street_types
if __name__ == '__main__':
    st_types = audit_street(MountainViewosm)
    # print out unexpected street names
    pprint.pprint(dict(st_types))




{'114': set(['West Evelyn Avenue Suite #114']),
 '140': set(['Hamilton Ave #140']),
 '2': set(['Showers Drive STE 2']),
 '7': set(['Showers Drive STE 7']),
 '9': set(['East Charleston Road APT 9']),
 'AA': set(['Showers Drive BLDG AA']),
 'Alley': set(['Jackson Alley']),
 'Ave': set(['California Ave',
             'E Duane Ave',
             'El Monte Ave',
             'Hollenbeck Ave',
             'Portage Ave',
             'S California Ave',
             'University Ave',
             'W Maude Ave',
             'W Washington Ave']),
 'Ave.': set(['Menalto Ave.']),
 'B': set(['Leghorn Street #B']),
 'Bruno': set(['Serra San Bruno']),
 'C': set(['Plymouth Street #C']),
 'Calle': set(['La Calle']),
 'Central': set(['Plaza Central']),
 'Circle': set(['Bobolink Circle',
                'Carlson Circle',
                'Comstock Circle',
                'Continental Circle',
                'Distel Circle',
                'Duluth Circle',
                'East Meadow Circle',
      

#### Update the street name ###

In [7]:
# Street name updatation
# creating a dictionary for correcting street names
mapping = { "AA" :"Aberdeen Athletic Center",
            "Ct": "Court",
            "Ct.": "Court",
            "St.": "Street",
            "St,": "Street",
            "ST": "Street",
            "street": "Street",
            "STE": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "ave": "Avenue",
            "Rd.": "Road",   
            "rd.": "Road",
            "Rd": "Road",    
            "Hwy": "Highway",
            "HIghway": "Highway",
            "BLDG": "Building",
            "APT": "Apartment",
           "West Evelyn Avenue Suite #114":"West Evelyn Avenue Suite",
           "Showers Drive STE 2": "Showers Drive Street",
           "Showers Drive STE 7": "Showers Drive Street",
           "East Charleston Road APT 9": "East Charleston Road Apartment",
           "Leghorn Street #B": "Leghorn Street",
           "Plymouth Street #C": "Plymouth Street",
           "Hamilton Ave #140": "Hamilton Ave",
           "W. El Camino Real": "West El Camino Real",
           "W El Camino Real":"West El Camino Real",
           "E. El Camino Real": "East El Camino Real",
           "E El Camino Real" : "East El Camino Real",
           "West Dana St": "West Dana Street"
           }
           
                     
# function that corrects incorrect street names
def update_name(name, mapping):    
    for key in mapping:
        if key in name:
            name = string.replace(name,key,mapping[key])
    return name
if __name__ == '__main__':
    for st_type, ways in st_types.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print name, "=>", better_name

Villa Vista => Villa Vista
Roble Ridge => Roble Ridge
Vanderbilt Court West => Vanderbilt Court West
Wolfe Rd => Wolfe Road
Homestead Rd => Homestead Road
E Middlefield Rd => E Middlefield Road
Embarcadero Rd => Embarcadero Road
West Evelyn Avenue Suite #114 => West Evelyn Avenuenue Suite
Serra San Bruno => Serra San Bruno
Devonshire Way => Devonshire Way
Aspen Way => Aspen Way
Flicker Way => Flicker Way
Asbury Way => Asbury Way
Davenport Way => DAvenuenuenport Way
Madera Way => Madera Way
Wintergreen Way => Wintergreen Way
La Jennifer Way => La Jennifer Way
Hansen Way => Hansen Way
Murray Way => Murray Way
Elbridge Way => Elbridge Way
Enderby Way => Enderby Way
Acacia Way => Acacia Way
Alley Way => Alley Way
Brahms Way => Brahms Way
Primrose Way => Primrose Way
Bond Way => Bond Way
Hudson Way => Hudson Way
Dunnock Way => Dunnock Way
Golden Way => Golden Way
Forge Way => Forge Way
Anaconda Way => Anaconda Way
Prince Edward Way => Prince Edward Way
Old Middlefield Way => Old Middlefield

In [8]:
# zip code
def audit_zipcodes(osmfile):
    # iter through all zip codes, collect all the zip codes that does not start with 94
    osm_file = open(osmfile, "r")
    zip_codes = {}
    for event, elem in cET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if tag.attrib['k'] == "addr:postcode" and not tag.attrib['v'].startswith('94'):
                    if tag.attrib['v'] not in zip_codes:
                        zip_codes[tag.attrib['v']] = 1
                    else:
                        zip_codes[tag.attrib['v']] += 1
    return zip_codes

zipcodes = audit_zipcodes(MountainViewosm)
for zipcode in zipcodes:
    print zipcode, zipcodes[zipcode]
# zipcodes

30188 1
95014-0137 1
95014-6357 1
95014-030 1
95014-0497 1
95014-0496 1
95014-0495 1
95014-0494 1
95014-0493 1
95014-0492 2
95014-0556 1
95014-0552 4
95014-0499 1
95014-0551 4
95014-6340 1
95014-6342 1
95014-6344 1
95014-6346 1
95014-6348 1
95014-0549 10
95014-0548 10
95014-0545 9
95014-0547 4
95014-0546 5
95014-0540 5
95014-0116 7
95014-0117 3
95014-0114 1
95014-0115 6
95014-0112 3
95014-0113 1
95014-0111 1
95014-0448 6
95051 23
95014-0444 12
95014-0445 8
95014-0446 5
95014-0447 4
95014-0440 3
95014-0607 1
95014-6371 1
95014-0102 2
95014-0103 4
95014-0353 1
95014-0453 5
95014-0451 8
95014-0450 1
95014-0457 10
95014-0456 10
95014-0455 5
95014-0454 5
95014-6363 1
95014-6360 1
95014-0565 5
95014-0564 4
95014-0560 3
CA 94085 1
CA 94086 1
95014-0202 1
95014-0200 1
95014-0568 8
95014-0500 1
95014-0292 1
95014-6318 5
95014-6319 3
95014-6316 8
95014-6317 5
95014-6314 3
95014-6315 7
95014-6561 1
95014-6560 5
95014-6562 5
95014-0535 4
95014-0536 2
95014-0531 4
95014-0532 5
95014-0533 4
95014-05

#### strategy for updating zip code ###

Since the data also includes the area of Santa Clara, Cupetino, San Jose and Sunnyvale. I have only updated he zipcode of Moutain view which stats from '94' using mapping dictionary.

In [70]:


mapping = { "CA 94085":"94085",
            "CA 94086":"94086"
           }
           
                     
# function that corrects incorrect street names
def update_zipcode(zipcode, mapping):    
    for key in mapping:
        if key in zipcode:
            zipcode = string.replace(zipcode, key,mapping[key])
        return zipcode
       
          
if __name__ == '__main__':
    for zipcode in zipcodes:
        better_zipcode = update_zipcode(zipcode, mapping)
        print zipcode, "=>", better_zipcode
        

30188 => 30188
95014-0137 => 95014-0137
95014-6357 => 95014-6357
95014-030 => 95014-030
95014-0497 => 95014-0497
95014-0496 => 95014-0496
95014-0495 => 95014-0495
95014-0494 => 95014-0494
95014-0493 => 95014-0493
95014-0492 => 95014-0492
95014-0556 => 95014-0556
95014-0552 => 95014-0552
95014-0499 => 95014-0499
95014-0551 => 95014-0551
95014-6340 => 95014-6340
95014-6342 => 95014-6342
95014-6344 => 95014-6344
95014-6346 => 95014-6346
95014-6348 => 95014-6348
95014-0549 => 95014-0549
95014-0548 => 95014-0548
95014-0545 => 95014-0545
95014-0547 => 95014-0547
95014-0546 => 95014-0546
95014-0540 => 95014-0540
95014-0116 => 95014-0116
95014-0117 => 95014-0117
95014-0114 => 95014-0114
95014-0115 => 95014-0115
95014-0112 => 95014-0112
95014-0113 => 95014-0113
95014-0111 => 95014-0111
95014-0448 => 95014-0448
95051 => 95051
95014-0444 => 95014-0444
95014-0445 => 95014-0445
95014-0446 => 95014-0446
95014-0447 => 95014-0447
95014-0440 => 95014-0440
95014-0607 => 95014-0607
95014-6371 => 95014-63

In [None]:
# Audit phone number



#### Process OSM XML file to JSON ###
Only the elements of type “node” and “way” will be imported to the database. The data model we’re going to use follows the format of this example:
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
         "version":"2",
         "changeset":"17206049",
         "timestamp":"2013-08-03T16:43:42Z",
         "user":"linuxUser16",
         "uid":"1219059"
       },
"pos": [41.9757030, -87.6921867],
"address": {
         "housenumber": "5157",
         "postcode": "60625",
         "street": "North Lincoln Ave"
       },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"
}



In [11]:

CREATED = [ "version", "changeset", "timestamp", "user", "uid"]
def shape_element(element):
    node = {}
    node["created"]={}
    node["address"]={}
    node["pos"]=[]
#     node["amenity"] ={}
#     node["cuisine"] = {}
    refs=[]
    
    # I only process the node and way tags
    if element.tag == "node" or element.tag == "way" :
        if "id" in element.attrib:
            node["id"]=element.attrib["id"]
        node["type"]=element.tag

        if "visible" in element.attrib.keys():
            node["visible"]=element.attrib["visible"]
      
        # the key-value pairs with attributes in the CREATED list are added under key "created"
        for elem in CREATED:
            if elem in element.attrib:
                node["created"][elem]=element.attrib[elem]
                
        # attributes for latitude and longitude are added to a "pos" array
        # include latitude value        
        if "lat" in element.attrib:
            node["pos"].append(float(element.attrib["lat"]))
        # include longitude value    
        if "lon" in element.attrib:
            node["pos"].append(float(element.attrib["lon"]))

        
        for tag in element.iter("tag"):
            if not(problemchars.search(tag.attrib['k'])):
                if tag.attrib['k'] == "addr:housenumber":
                    node["address"]["housenumber"]=tag.attrib['v']
                    
                if tag.attrib['k'] == "addr:postcode":
                    node["address"]["postcode"]=tag.attrib['v']
                
                # handling the street attribute, update incorrect names using the strategy developed before   
                if tag.attrib['k'] == "addr:street":
                    node["address"]["street"]=tag.attrib['v']
                    node["address"]["street"] = update_name(node["address"]["street"], mapping)

                if tag.attrib['k'].find("addr")==-1:
                    node[tag.attrib['k']]=tag.attrib['v']
                    
        for nd in element.iter("nd"):
             refs.append(nd.attrib["ref"])
                
        if node["address"] =={}:
            node.pop("address", None)

        if refs != []:
           node["node_refs"]=refs
            
        return node
    else:
        return None



def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    with codecs.open(file_out, "w") as fo:
        for _, element in cET.iterparse(file_in):
            el = shape_element(element)
            if el:
                data.append(el)
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    fo.write(json.dumps(el) + "\n")
    return data

In [None]:
# process the file
data = process_map(MountainViewosm, True)
# for d in data:
#     print d


#### Insert the JSON data into MongoDB Database ####

In [None]:
client = MongoClient()
db = client.MountainViewosm
collection = db.MountainViewMAP
collection.insert(data)

#### Size of original XML file ####

In [22]:

os.path.getsize(os.path.join(path, "MountainView.osm"))/1024/1024

209

#### Size of processed JSON  file ####

In [14]:

os.path.getsize(os.path.join(path, "MountainView.osm.json"))/1024/1024

346

#### Number of documents ####

In [15]:
collection.find().count()


5754659

#### Number of unique users ####

In [19]:
# Number of unique users
len(collection.group(["created.uid"], {}, {"count":0}, "function(o, p){p.count++}"))


880

#### Number of nodes ####

In [20]:
# Number of nodes
collection.find({"type":"node"}).count()

5136303

#### Number of ways ####

In [21]:
collection.find({"type":"way"}).count()

618301

#### Top 10 methods used to create data entry ####

In [38]:

pipeline = [{"$group":{"_id": "$created_by",
                       "count": {"$sum": 1}}},
                     {"$sort": {"count": -1}},
                    {"$limit": 10}]
           
result = collection.aggregate(pipeline)
for r in result:
    print r
# assert len(result['result'])

# print(len(result['result']))
# print result[result]

{u'count': 5746814, u'_id': None}
{u'count': 4363, u'_id': u'JOSM'}
{u'count': 1490, u'_id': u'Potlatch 0.10f'}
{u'count': 896, u'_id': u'Potlatch 0.9c'}
{u'count': 290, u'_id': u'Potlatch 0.10b'}
{u'count': 243, u'_id': u'Potlatch 0.10'}
{u'count': 215, u'_id': u'Potlatch 0.8c'}
{u'count': 105, u'_id': u'Potlatch 0.9a'}
{u'count': 98, u'_id': u'Potlatch 0.10e'}
{u'count': 80, u'_id': u'OSMPointy v0.4 iPhone'}


#### Top 5 users contributions ####

In [23]:
# top three users with most contributions
pipeline = [{"$group":{"_id": "$created.user",
                       "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 5}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 760464, u'_id': u'RichRico'}
{u'count': 707471, u'_id': u'ediyes'}
{u'count': 601207, u'_id': u'samely'}
{u'count': 582349, u'_id': u'karitotp'}
{u'count': 439029, u'_id': u'calfarome'}


### Top 10 amenity

In [18]:
pipeline = [{'$match': {'amenity': {'$exists': 1}}}, 
                                {'$group': {'_id': '$amenity', 
                                            'count': {'$sum': 1}}}, 
                                {'$sort': {'count': -1}}, 
                                {'$limit': 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 4593458, u'_id': {}}
{u'count': 6007, u'_id': u'parking'}
{u'count': 2905, u'_id': u'restaurant'}
{u'count': 2254, u'_id': u'bicycle_parking'}
{u'count': 1442, u'_id': u'bench'}
{u'count': 1221, u'_id': u'school'}
{u'count': 917, u'_id': u'fast_food'}
{u'count': 871, u'_id': u'post_box'}
{u'count': 819, u'_id': u'cafe'}
{u'count': 756, u'_id': u'place_of_worship'}


#### Most popular fast food resturant ####

In [24]:
# Most popular cuisines
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity":"restaurant", "cuisine":{"$exists":1}}}, 
            {"$group":{"_id":"$cuisine", "count":{"$sum":1}}},        
            {"$sort":{"count":-1}}, 
            {"$limit":10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 376, u'_id': {}}
{u'count': 190, u'_id': u'mexican'}
{u'count': 175, u'_id': u'chinese'}
{u'count': 170, u'_id': u'japanese'}
{u'count': 145, u'_id': u'indian'}
{u'count': 130, u'_id': u'pizza'}
{u'count': 80, u'_id': u'thai'}
{u'count': 75, u'_id': u'italian'}
{u'count': 65, u'_id': u'american'}
{u'count': 65, u'_id': u'vietnamese'}


#### Name of Universities ####

In [25]:
# University
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "university", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r


{u'count': 5, u'_id': u'Stanford University'}
{u'count': 5, u'_id': u'Carnegie Mellon University Silicon Valley'}
{u'count': 5, u'_id': u'Singularity University Classroom '}
{u'count': 5, u'_id': u'Singularity University'}
{u'count': 5, u'_id': u'20'}
{u'count': 5, u'_id': u'Nine Star University of Health Sciences'}


#### 10 Places for worship ####

In [23]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "place_of_worship", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 12, u'_id': u'Mountain View Chinese Christian Church'}
{u'count': 12, u'_id': u'First United Methodist Church'}
{u'count': 12, u'_id': u'First Church of Christ Scientist'}
{u'count': 12, u'_id': u'Peninsula Bible Church'}
{u'count': 12, u'_id': u'Seventh Day Adventist Church'}
{u'count': 12, u'_id': u'Trinity United Methodist Church'}
{u'count': 12, u'_id': u'The Church of Jesus Christ of Latter-day Saints'}
{u'count': 6, u'_id': u'Holy Korean Martyrs Catholic Church'}
{u'count': 6, u'_id': u"Saint Mark's Missionary Baptist Church"}
{u'count': 6, u'_id': u'New Hope International'}


#### Gas stations ####

In [27]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "fuel", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 50, u'_id': u'Shell'}
{u'count': 40, u'_id': u'Valero'}
{u'count': 35, u'_id': u'Chevron'}
{u'count': 30, u'_id': u'76'}
{u'count': 30, u'_id': u'Arco'}
{u'count': 10, u'_id': u'valero'}
{u'count': 10, u'_id': u'ARCO'}
{u'count': 5, u'_id': u'World Oil'}
{u'count': 5, u'_id': u'Fair Oaks 76'}
{u'count': 5, u'_id': u'Willow Cove Gas'}
{u'count': 5, u'_id': u'Alliance Gasoline'}
{u'count': 5, u'_id': u'Westmore Chevron'}
{u'count': 5, u'_id': u'Shell Gas'}
{u'count': 5, u'_id': u'Conoco Phillips 76'}
{u'count': 5, u'_id': u'Union 76'}
{u'count': 5, u'_id': u'SAP Vehicles Network Demo - Gas Station'}
{u'count': 5, u'_id': u"Ranier's Service Station"}
{u'count': 5, u'_id': u'Alliance'}
{u'count': 5, u'_id': u'Rotten Robbie'}


#### 10 Most popular Fast food cuisines ####

In [24]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "fast_food", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 84, u'_id': u'Subway'}
{u'count': 42, u'_id': u"McDonald's"}
{u'count': 30, u'_id': u'Taco Bell'}
{u'count': 30, u'_id': u"Togo's"}
{u'count': 18, u'_id': u'KFC'}
{u'count': 18, u'_id': u'Round Table Pizza'}
{u'count': 18, u'_id': u'Burger King'}
{u'count': 12, u'_id': u'Jack in the Box'}
{u'count': 12, u'_id': u'Jamba Juice'}
{u'count': 12, u'_id': u'In-N-Out Burger'}


#### Number of hospitals ####

In [29]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "hospital", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 5, u'_id': u'Grant Cuesta Sub Acute Rehabilitation Center'}
{u'count': 5, u'_id': u'VA Palo Alto Health Care System'}
{u'count': 5, u'_id': u'VA Medical Center Menlo Park'}
{u'count': 5, u'_id': u'El Camino Hospital'}
{u'count': 5, u'_id': u'PAMF Menlo Park Surgical Hospital'}
{u'count': 5, u'_id': u'Kaiser Permanente'}
{u'count': 5, u'_id': u'Kaiser Permanente Santa Clara Medical Center'}
{u'count': 5, u'_id': u'Camino Medical Group'}
{u'count': 5, u'_id': u'Palo Alto Medical Foundation'}
{u'count': 5, u'_id': u'Health Services'}


#### Beauty Salon ####

In [30]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "beauty", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 8, u'_id': u'Salon Elizabeth'}
{u'count': 5, u'_id': u'Salon 121'}


#### Libraries ####

In [31]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "public_bookcase", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 13, u'_id': u'Little Free Library'}


#### 10 most poular schools ####

In [26]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "school", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 12, u'_id': u'Union Academy'}
{u'count': 12, u'_id': u'Pinewood School'}
{u'count': 12, u'_id': u'Laurelwood Elementary School'}
{u'count': 12, u'_id': u'Stratford School'}
{u'count': 12, u'_id': u'Palo Verde Elementary School'}
{u'count': 12, u'_id': u'Kumon'}
{u'count': 12, u'_id': u'Athena Academy'}
{u'count': 12, u'_id': u'Ohlone Elementary School'}
{u'count': 12, u'_id': u'Lucille M Nixon Elementary School'}
{u'count': 9, u'_id': u'Jane Lathrop Stanford Middle School'}


#### 10 Most popular Parkings ####

In [27]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "parking", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 78, u'_id': u"Visitor's Parking"}
{u'count': 63, u'_id': u'Apartment Visitor Parking'}
{u'count': 18, u'_id': u'Employee Parking'}
{u'count': 18, u'_id': u'Customer Parking'}
{u'count': 12, u'_id': u'Lot 8'}
{u'count': 12, u'_id': u'Lot 6'}
{u'count': 12, u'_id': u'Sunnyvale Caltrain Station'}
{u'count': 12, u'_id': u'Lot 7'}
{u'count': 12, u'_id': u'Lot 4'}
{u'count': 12, u'_id': u'Lot 1'}


####  10 Most popular Car wash ####

In [34]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "car_wash", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 5, u'_id': u'SV Express'}
{u'count': 5, u'_id': u'Clear Water Car Wash'}
{u'count': 5, u'_id': u'Thrifty'}
{u'count': 5, u'_id': u'Lozano Brushless Car Wash'}
{u'count': 5, u'_id': u"Lozano's Car Wash"}
{u'count': 5, u'_id': u'Car Wash'}
{u'count': 5, u'_id': u'Shell'}
{u'count': 5, u'_id': u'Bubbles Hand Wash'}


#### Post office ####

In [35]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "post_box", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 5, u'_id': u'Mailbox'}


#### 10 most populr Coffe shops ####

In [29]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "amenity": "cafe", "name":{"$exists":1}}},
            {"$group":{"_id":"$name", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 126, u'_id': u'Starbucks'}
{u'count': 18, u'_id': u'Starbucks Coffee'}
{u'count': 18, u'_id': u'Tea Era'}
{u'count': 12, u'_id': u"Peet's Coffee & Tea"}
{u'count': 12, u'_id': u'Peets Coffee'}
{u'count': 12, u'_id': u"Peet's Coffee"}
{u'count': 12, u'_id': u'Cloud Cafe'}
{u'count': 12, u'_id': u'Philz Coffee'}
{u'count': 6, u'_id': u'Threads'}
{u'count': 6, u'_id': u'Dana Street Roasting Company'}




#### Top 10 unique contributor of data ####

In [31]:
pipeline =[{"$group":{"_id":"$created.user", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r


{u'count': 912096, u'_id': u'RichRico'}
{u'count': 848559, u'_id': u'ediyes'}
{u'count': 721083, u'_id': u'samely'}
{u'count': 698475, u'_id': u'karitotp'}
{u'count': 526563, u'_id': u'calfarome'}
{u'count': 420660, u'_id': u'oldtopos'}
{u'count': 361137, u'_id': u'dannykath'}
{u'count': 316620, u'_id': u'n76'}
{u'count': 297894, u'_id': u'Luis36995'}
{u'count': 225096, u'_id': u'matthieun'}


#### Top 10 version of contribution of data ####

In [39]:
pipeline =[{"$group":{"_id":"$created.version", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 4786041, u'_id': u'1'}
{u'count': 1625298, u'_id': u'2'}
{u'count': 290949, u'_id': u'3'}
{u'count': 96186, u'_id': u'4'}
{u'count': 37587, u'_id': u'5'}
{u'count': 21669, u'_id': u'6'}
{u'count': 13233, u'_id': u'7'}
{u'count': 7656, u'_id': u'8'}
{u'count': 5469, u'_id': u'9'}
{u'count': 3681, u'_id': u'10'}


#### Top 10 timestamps when the data is  contributed  ####

In [None]:
pipeline =[{"$group":{"_id":"$created.timestamp", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}]
result = collection.aggregate(pipeline)
for r in result:
    print r

#### Highway ways in mountain view  ####

In [None]:
pipeline =[{"$group":{"_id":"$highway", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
           {"$limit": 10}
           ]
result = collection.aggregate(pipeline)
for r in result:
    print r

#### Types and number of ways in mountain view  ####

In [44]:
pipeline =[{"$group":{"_id":"$exit_to", "count":{"$sum":1}}},
            {"$sort":{"count":-1}}
           ]
result = collection.aggregate(pipeline)
for r in result:
    print r

{u'count': 6901788, u'_id': None}
{u'count': 18, u'_id': u'Mathilda Avenue South'}
{u'count': 12, u'_id': u'El Monte Road;Moody Road'}
{u'count': 12, u'_id': u'US-101 South;Mathilda Avenue'}
{u'count': 12, u'_id': u'Middlefield Road;Maude Avenue'}
{u'count': 12, u'_id': u'Mathilda Avenue North'}
{u'count': 12, u'_id': u'SR-82 South;Sunnyvale;El Camino Real'}
{u'count': 12, u'_id': u'Moffett Boulevard;NASA Parkway'}
{u'count': 12, u'_id': u'Ellis Street;Moffett South Gate'}
{u'count': 12, u'_id': u'SR-85 South;Cupertino;Santa Cruz'}
{u'count': 12, u'_id': u'Evelyn Avenue'}
{u'count': 12, u'_id': u'SR-82 North;Mountain View;El Camino Real'}
{u'count': 12, u'_id': u'US-101 North'}
{u'count': 12, u'_id': u'Fair Oaks Avenue'}
{u'count': 12, u'_id': u'University Avenue'}
{u'count': 12, u'_id': u'Rengstorff Avenue'}
{u'count': 6, u'_id': u'SR-237 East;Alviso;Milpitas'}
{u'count': 6, u'_id': u'Moffett Boulevard'}
{u'count': 6, u'_id': u'US-101 South;San Jose'}
{u'count': 6, u'_id': u'Homestead

### Other Ideas about data set 
Since the data consist of inconsistent phone numbers like "650-322-2554", "+16508570333" etc. During the collection of the data from user, it should follow the rule format of phone number of given country or area which is generraly in ITU E.123 standard:
* "+"
* the national code (1 for the USA)
* space
* the area/regional code
* space
* the local exchange
* space
* the local number

There are some fields of node is missing like County, the data collection should follow the structured format. But it trivial because Nosql database can pretty much handle the non structured data.

### Conclusion
After reviewing the data of mountain view, much of the information has been extracted about the city. The data has been well cleaned for the purpose of enough information extraction. Thinking about the compnies and startups established in Mountain view, i couldn't find any information regarding that. If the data is stored as in the name of comapnies and startups buildings, it would be really helpful to gain insight of number of the comapanies establishd in given city.