### Part 1: Description and dealing with the data
This report is about wrangling street map data of Saint Paul in Minnesota, USA.
I downloaded the data set from [Minneapolis/Saint Paul](https://mapzen.com/data/metro-extracts/metro/minneapolis-saint-paul_minnesota/).

Now I get the data, what I need to do first is to get the probably understand of this data set, I can see the data structure from the wiki [Here](https://wiki.openstreetmap.org/wiki/OSM_XML).

There are three main elements in this data set, they are *node*, *way* and *relation*. Also, the detailed description can be seen on this [wiki site](http://wiki.openstreetmap.org/wiki/Elements).

With the knowledge of the data, then I should do is to deal with these data in order to get addressable format for MongoDB.

In [1]:
#import the needable packages
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json

FILENAME = 'minneapolis-saint-paul_minnesota.osm'

lower = re.compile(r'^([a-z]|_)*$')

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

mapping = { "St": "Street",
            "St.": "Street",
            "Rd": "Road",
            "Rd.": "Road",
            "Ave": "Avenue",
            "Ave.": "Avenue"
            }

CREATED = ["version", "changeset", "timestamp", "user", "uid"]

In [2]:
#this is a function of judging if a value is float type or not 
def isfloat(value):
    try:
        float(value)    
        return True
    except:
        return False

In [3]:
#this is a function of updating the street name into the same ending.
def update_name(name, mapping):
    split_name = name.split(' ')
    name = ""
    for word in split_name:
        if word in mapping:
            name += mapping[word]
            break
        name += word
        name += " "
    return name

In [4]:
#this is a function of dealing with the node refs which exists only under the 'way' tag
def dealing_node_refs(element):
    node_refs = []
    for nd in element.iter("nd"):
        node_refs.append(nd.attrib["ref"])

    return node_refs

In [5]:
#this is a function of dealing with the created information
def dealing_created(element):
    created = {}
    for tag in CREATED:
        if tag in element.attrib:
            created[tag] = element.attrib[tag]

    return created

In [6]:
#this is a function of judging if the tag under address is just containing alphabet letters
def is_correct_tag(tag_name):
    tag_name = tag_name.lower()
    if lower.search(tag_name):
        return True

    return False

In [11]:
#this is a function of dealing with the member information under the relation element
def dealing_member(element):
    data = []
    for mem in element.iter("member"):
        member = {}
        if "type" in mem.attrib:
            member["type"] = mem.attrib["type"]
        if "ref" in mem.attrib:
            member["ref"] = mem.attrib["ref"]
        if "role" in mem.attrib:
            member["role"] = mem.attrib["role"]
        data.append(member)
    return data

In [8]:
def shape_element(element):
    node = {}
    if element.tag == "node" or element.tag == "way" or element.tag == "relation":  
        #make the result smaller, if these is no tag information under a element, I think it's useless
        if element.find("tag") is None:
            return None 
        
        if "id" in element.attrib:
            node["id"] = element.attrib["id"]
          
        node["tag"] = element.tag

        if "visible" in element.attrib:
            node["visible"] = element.attrib["visible"]

        created = dealing_created(element)
        if len(created) > 0:
            node["created"] = created

        pos = []    
        # if the lat or lon information is absence, then the pos information should not be recorded.
        if "lat" in element.attrib and "lon" in element.attrib:
            lat = element.attrib["lat"]
            lon = element.attrib["lon"]
            if isfloat(lat) and isfloat(lon):
                pos = [float(lat), float(lon)]
                node["pos"] = pos
 
        address = {}
        for tag in element.iter("tag"):
            if tag.attrib['k'] == "addr:street":
                address["street"] = update_name(tag.attrib['v'], mapping)
            else:
                split_k = tag.attrib['k'].split(":")
                ###let the key in dictionary be lower letter. 
                ###If there is a second ":" that separates the type/direction of a street, then the tag will be ignored.
                if len(split_k) == 1 and is_correct_tag(split_k[0]):
                    node[split_k[0].lower()] = tag.attrib['v']
                elif len(split_k) == 2 and split_k[0] == 'addr' and is_correct_tag(split_k[1]):
                    address[split_k[1].lower()] = tag.attrib['v']

        if len(address) > 0:
            node["address"] = address

        if element.tag == "way":
            node_refs = dealing_node_refs(element)
            if len(node_refs) > 0:
                node["node_refs"] = node_refs
        
        if element.tag == "relation":
            member = dealing_member(element)
            if len(member) > 0:
                node["member"] = member
          
        return node
    else:
        return None

In [9]:
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    count = 0
    with codecs.open(file_out, "w") as fo:
        fo.write("[")
        for _, element in ET.iterparse(file_in):
            # when I am testing data in my local enviroment, I let the dataset be smaller
#             if count > 10000:
#                 break
            el = shape_element(element)
            if el:
                #data.append(el) the format of json file is very important
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    count += 1
                    if count == 1:
                        fo.write(json.dumps(el))
                    else:
                        fo.write("," + "\n" + json.dumps(el))
        fo.write("]")
    ##return data

In [12]:
# generate the josn format data which can be insert into a mongodb example
process_map(FILENAME, False)

### Conclusion of Part 1:
As the code shows, I have added notes.While several tips I want to add are as follows: 

I notice that there is a tag providing BUS information, if there is more information about which bus stopping, that will be better, because people will know how they can go to this place from other place by bus. While it indeed brings work for data engineers, because they need to calculate the best route from one place to another.

If there is more information such as open and close time, it will be more accurate for recommending.

### Part 2: Select interesting result from the dataset
In this part, I connected MongoDB locally on my computer first, then I made several pipeline to select interesting results from the dataset. select the number of unique users, select the number of different tags, select the number of different type of building, then sort it with descend order 

In [13]:
import json
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.examples

In [14]:
# insert the data into a collection
with open('minneapolis-saint-paul_minnesota.osm.json') as f:
    data = json.loads(f.read())
    for a in data:
        db.saint.insert_one(a)

In [15]:
#calculate the number of element
db.saint.find().count()

416975

In [18]:
def aggregate(db, pipeline):
    return [doc for doc in db.saint.aggregate(pipeline)]

In [19]:
#calculate the number of unique users
def unique_users():
    pipeline = [{"$match":{"created.uid":{"$ne": None}}},
                {"$group":{"_id":"$created.uid", "count":{"$sum":1}}},
                {"$group":{"_id":None, "count":{"$sum":1}}}]
    return pipeline

In [25]:
pipeline = unique_users()

In [26]:
result = aggregate(db, pipeline)

In [27]:
result[0]

{u'_id': None, u'count': 1399}

In [20]:
#calculate the number of different tags
def different_tags():
    pipeline = [{"$group":{"_id":"$tag", "count":{"$sum":1}}}]
    return pipeline

In [28]:
pipeline = different_tags()

In [29]:
result = aggregate(db, pipeline)

In [30]:
result

[{u'_id': None, u'count': 10001},
 {u'_id': u'node', u'count': 71270},
 {u'_id': u'relation', u'count': 2931},
 {u'_id': u'way', u'count': 332773}]

In [21]:
#calculate the number of different type of building, then sort it with descend order
def different_highways():
    pipeline = [{"$match":{"highway":{"$ne": None}}},
                {"$group":{"_id":"$highway", "count":{"$sum":1}}},
                {"$sort":{"count":-1}}]
    return pipeline

In [31]:
pipeline = different_highways()

In [32]:
result = aggregate(db, pipeline)

In [33]:
result

[{u'_id': u'residential', u'count': 67906},
 {u'_id': u'service', u'count': 51939},
 {u'_id': u'bus_stop', u'count': 14857},
 {u'_id': u'footway', u'count': 14473},
 {u'_id': u'turning_circle', u'count': 13290},
 {u'_id': u'tertiary', u'count': 9087},
 {u'_id': u'path', u'count': 5280},
 {u'_id': u'traffic_signals', u'count': 3957},
 {u'_id': u'motorway_link', u'count': 2877},
 {u'_id': u'secondary', u'count': 2622},
 {u'_id': u'cycleway', u'count': 2182},
 {u'_id': u'crossing', u'count': 2146},
 {u'_id': u'motorway', u'count': 1809},
 {u'_id': u'primary', u'count': 1093},
 {u'_id': u'unclassified', u'count': 782},
 {u'_id': u'motorway_junction', u'count': 691},
 {u'_id': u'track', u'count': 584},
 {u'_id': u'trunk', u'count': 513},
 {u'_id': u'stop', u'count': 483},
 {u'_id': u'steps', u'count': 449},
 {u'_id': u'tertiary_link', u'count': 397},
 {u'_id': u'trunk_link', u'count': 307},
 {u'_id': u'secondary_link', u'count': 301},
 {u'_id': u'primary_link', u'count': 215},
 {u'_id': u'p

In [22]:
#I want to know which way has the most refence node, then I calculated it and sort it
def get_node_refs():
    pipeline = [{"$match":{"tag":{"$eq": "way"}, "node_refs":{"$ne": None}}},
                {"$unwind":"$node_refs"},
                {"$group":{"_id":"$id", "count":{"$sum":1}}},
                {"$sort":{"count":-1}},
                {"$limit": 10}]
    return pipeline

In [34]:
pipeline = get_node_refs()

In [35]:
result = aggregate(db, pipeline)

In [36]:
result

[{u'_id': u'130711268', u'count': 1999},
 {u'_id': u'40127978', u'count': 1544},
 {u'_id': u'147571320', u'count': 1516},
 {u'_id': u'130724182', u'count': 1429},
 {u'_id': u'40128078', u'count': 1328},
 {u'_id': u'415163548', u'count': 1230},
 {u'_id': u'47489483', u'count': 1091},
 {u'_id': u'136796601', u'count': 1089},
 {u'_id': u'60497361', u'count': 973},
 {u'_id': u'169168105', u'count': 969}]

In [37]:
#find a certain node from the dataset
def search_certain_node(node_id):
    pipeline = [{"$match":{"id":{"$eq": node_id}}}]
    return pipeline

In [41]:
pipeline = search_certain_node("130711268")

In [42]:
result = aggregate(db, pipeline)

In [43]:
result

[{u'_id': ObjectId('5842590d24781100a435f995'),
  u'created': {u'changeset': u'37972026',
   u'timestamp': u'2016-03-21T10:02:35Z',
   u'uid': u'145231',
   u'user': u'woodpeck_repair',
   u'version': u'7'},
  u'id': u'130711268',
  u'natural': u'water',
  u'node_refs': [u'1438972785',
   u'1438758393',
   u'1438154732',
   u'1438154733',
   u'1438154719',
   u'1438154737',
   u'1438154735',
   u'1438154751',
   u'1438154745',
   u'1438154746',
   u'1438154744',
   u'1438758371',
   u'1438154738',
   u'1438154731',
   u'1438154726',
   u'1438154752',
   u'1438154759',
   u'1438154768',
   u'1438154716',
   u'1438154742',
   u'1438154721',
   u'1438154764',
   u'1438154717',
   u'1438154763',
   u'1438154722',
   u'1438154762',
   u'1438154724',
   u'1438154766',
   u'1438154729',
   u'1438154743',
   u'1438758662',
   u'1438758540',
   u'1438758647',
   u'1438758387',
   u'1438758424',
   u'1438758417',
   u'1438758366',
   u'1438758391',
   u'1438758535',
   u'1438758372',
   u'143875

### Conclusion of part 2:
As the results shown, there are 416975 records after dealing with the data. The reason why there are so less records is that I skipped those records whose tag is None, because I think those records are useless. 
 
Among those useful records, I calculated the number of unique uid, 1399, which means there are 1399 users. When I select the numbers of different tags, I find that the way tag is the most, while where still are several ids with None information. From the selection pipeline different_highways, the most building are used for residential, next is for service.

