### Part 1: Description and dealing with the data
This report is about wrangling street map data of Saint Paul in Minnesota, USA.
I downloaded the data set from [Minneapolis/Saint Paul](https://mapzen.com/data/metro-extracts/metro/minneapolis-saint-paul_minnesota/).

Now I get the data, what I need to do first is to get the probably understand of this data set, I can see the data structure from the wiki [Here](https://wiki.openstreetmap.org/wiki/OSM_XML).

There are three main elements in this data set, they are *node*, *way* and *relation*. Also, the detailed description can be seen on this [wiki site](http://wiki.openstreetmap.org/wiki/Elements).

With the knowledge of the data, then I should do is to deal with these data in order to get addressable format for MongoDB.

In [1]:
#import the needable packages
import xml.etree.cElementTree as ET
import pprint
import re
import codecs
import json

FILENAME = 'minneapolis-saint-paul_minnesota.osm'

lower = re.compile(r'^([a-z]|_)*$')

street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

mapping = { "St": "Street",
            "St.": "Street",
            "Rd": "Road",
            "Rd.": "Road",
            "Ave": "Avenue",
            "Ave.": "Avenue"
            }

CREATED = ["version", "changeset", "timestamp", "user", "uid"]

#  Get the mininum and maxinum latitude and longitude of this map data for checking the input data's validity.
MINLAT = 44.471
MINLON = -94.013
MAXLAT = 45.415
MAXLON = -92.543

In [2]:
#this is a function of judging if a value is float type or not 
def isfloat(value):
    try:
        float(value)    
        return True
    except:
        return False

In [3]:
#this is a function of updating the street name into the same ending.
def update_name(name, mapping):
    split_name = name.split(' ')
    name = ""
    for word in split_name:
        if word in mapping:
            name += mapping[word]
            break
        name += word
        name += " "
    return name

In [4]:
#this is a function of dealing with the node refs which exists only under the 'way' tag
def dealing_node_refs(element):
    node_refs = []
    for nd in element.iter("nd"):
        node_refs.append(nd.attrib["ref"])

    return node_refs

In [5]:
#this is a function of dealing with the created information
def dealing_created(element):
    created = {}
    for tag in CREATED:
        if tag in element.attrib:
            created[tag] = element.attrib[tag]

    return created

In [6]:
#this is a function of judging if the tag under address is just containing alphabet letters
def is_correct_tag(tag_name):
    tag_name = tag_name.lower()
    if lower.search(tag_name):
        return True

    return False

In [None]:
# check the element's location in a right place, if the location is wrong, then I will skip this element.
def is_right_location(element):
    if "lat" in element.attrib and "lon" in element.attrib:
        lat = element.attrib["lat"]
        lon = element.attrib["lon"]
        if isfloat(lat) and isfloat(lon):
            if MINLAT <= float(lat) <= MAXLAT and MINLON <= float(lon) <= MAXLON:
                return True
    
    return False             

In [11]:
#this is a function of dealing with the member information under the relation element
def dealing_member(element):
    data = []
    for mem in element.iter("member"):
        member = {}
        if "type" in mem.attrib:
            member["type"] = mem.attrib["type"]
        if "ref" in mem.attrib:
            member["ref"] = mem.attrib["ref"]
        if "role" in mem.attrib:
            member["role"] = mem.attrib["role"]
        data.append(member)
    return data

In [3]:
def get_multiple_location(txtfile):
    location = []
    with open(txtfile, 'r') as f:
        for line in f.readline():
            pos = line.split(',')
            location.append(pos[0]+"_"+pos[1])
    
    return location

In [None]:
def is_right_postcode(postcode):
    if isfloat(postcode):
        return True
    else:
        return False

In [8]:
def shape_element(element, multilocations):
    node = {}
    if element.tag == "node" or element.tag == "way" or element.tag == "relation":  
        #make the result smaller, if these is no tag information under a element, I think it's useless
        if element.find("tag") is None:
            return None 
        
        pos = []    
        # if the lat or lon information is absence, then the pos information should not be recorded.
        # if the location is multiple, then I skip this element, (or just keep the first element)
        if is_right_location(element):
            lat = element.attrib["lat"]
            lon = element.attrib["lon"]
            if (lat + "_" + lon) in multilocations:
                return None
            pos = [float(lat), float(lon)]
            node["pos"] = pos
        else:
            return None
        
        if "id" in element.attrib:
            node["id"] = element.attrib["id"]
          
        node["tag"] = element.tag

        if "visible" in element.attrib:
            node["visible"] = element.attrib["visible"]

        created = dealing_created(element)
        if len(created) > 0:
            node["created"] = created        
 
        address = {}
        for tag in element.iter("tag"):
            if tag.attrib['k'] == "addr:street":
                address["street"] = update_name(tag.attrib['v'], mapping)
            else:
                split_k = tag.attrib['k'].split(":")
                ###let the key in dictionary be lower letter. 
                ###If there is a second ":" that separates the type/direction of a street, then the tag will be ignored.
                if len(split_k) == 1 and is_correct_tag(split_k[0]):
                    if split_k[0].lower() == "postcode" and is_right_postcode(tag.attrib['v']):
                        node["postcode"] = tag.attrib['v']
                    else:
                        node[split_k[0].lower()] = tag.attrib['v']
                elif len(split_k) == 2 and split_k[0] == 'addr' and is_correct_tag(split_k[1]):
                    address[split_k[1].lower()] = tag.attrib['v']

        if len(address) > 0:
            node["address"] = address

        if element.tag == "way":
            node_refs = dealing_node_refs(element)
            if len(node_refs) > 0:
                node["node_refs"] = node_refs
        
        if element.tag == "relation":
            member = dealing_member(element)
            if len(member) > 0:
                node["member"] = member
          
        return node
    else:
        return None

In [9]:
def process_map(file_in, pretty = False):
    file_out = "{0}.json".format(file_in)
    data = []
    count = 0
    
    #read the multiple places, compare every location if it has been in this string, then just skip this element
    multilocations = get_multiple_location('multiple-location.txt')
    
    with codecs.open(file_out, "w") as fo:
        fo.write("[")
        for _, element in ET.iterparse(file_in):
            # when I am testing data in my local enviroment, I let the dataset be smaller
#             if count > 10000:
#                 break
            el = shape_element(element, multilocations)
            if el:
                #data.append(el) the format of json file is very important
                if pretty:
                    fo.write(json.dumps(el, indent=2)+"\n")
                else:
                    count += 1
                    if count == 1:
                        fo.write(json.dumps(el))
                    else:
                        fo.write("," + "\n" + json.dumps(el))
        fo.write("]")
    ##return data

In [12]:
# generate the josn format data which can be insert into a mongodb example
process_map(FILENAME, False)

### Conclusion of Part 1:
As the code shows, I have added notes. While several tips I want to add are as follows: 

1.I find the part 1 and part 2 should be done alternately. When I clean the data by the first time, then I can insert these data into a MongoDB example, I can select some calculated results. Then I will find some problems with my cleaned data, for example, some wrong format postcode, a single location is multiple for several points, then I went back to clean the dataset by second time.

2.I notice that there is a tag providing BUS information, if there is more information about which bus stopping, that will be better, because people will know how they can go to this place from other place by bus. While it indeed brings work for data engineers, because they need to calculate the best route from one place to another.

If there is more information such as open and close time, it will be more accurate for recommending.

### Part 2: Select interesting result from the dataset
In this part, I connected MongoDB locally on my computer first, then I made several pipeline to select interesting results from the dataset. select the number of unique users, select the number of different tags, select the number of different type of building, then sort it with descend order 

In [2]:
import json
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.examples

In [41]:
# insert the data into a collection
with open('minneapolis-saint-paul_minnesota.osm.json') as f:
    data = json.loads(f.read())
    for a in data:
        db.saint.insert_one(a)

In [3]:
def different_postcode():
    pipeline = [{"$match":{"address.postcode":{"$ne": None}}},
                {"$group":{"_id":"$address.postcode", "count":{"$sum":1}}}]
    return pipeline

In [6]:
pipeline = different_postcode()
result = aggregate(db, pipeline)

#### find some postcode like '55108-1003','MN 55430', while the number of these postcode is small, so what I need to do is to skip these postcode

In [47]:
def is_location_unique():
    pipeline = [{"$group":{"_id":"$pos", "count":{"$sum":1}}},
               {"$match":{"count":{"$gt":1}}},
               {"$group":{"_id":None, "count":{"$sum":1}}}]
    return pipeline

In [49]:
pipeline = is_location_unique()
result = aggregate(db, pipeline)
result

[{u'_id': None, u'count': 49}]

In [44]:
def multiple_user_location():
    pipeline = [ {"$match":{"pos":{"$ne":None}}},
                {"$group":{"_id":{"uid": "$created.uid", "pos": "$pos"}, "count":{"$sum":1}}},
                {"$match":{"count":{"$gt":1}}},
                {"$group":{"_id":"$_id.uid", "count":{"$sum":"$count"}}},
                {"$sort":{"count":-1}},
                {"$limit": 10}]
    return pipeline

In [None]:
pipeline = multiple_user_location()
result = aggregate(db, pipeline)
result

In [50]:
def find_multiple_location():
    pipeline = [{"$group":{"_id":"$pos", "count":{"$sum":1}}},
               {"$match":{"count":{"$gt":1}}},
               {"$project":{"pos":"$pos"}}]
    return pipeline

In [51]:
pipeline = find_multiple_location()
result = aggregate(db, pipeline)

In [52]:
count = 0
file_out = "multiple-location.txt"
with codecs.open(file_out, "w") as fo:
    for pos in result:
        if pos['_id']:
            fo.write(str(pos['_id'][0]) + "," + str(pos['_id'][1]) + "\n")

In [46]:
def unique_users():
    pipeline = [{"$match":{"created.uid":{"$ne": None}}},
                {"$group":{"_id":"$created.uid", "count":{"$sum":1}}},
                {"$group":{"_id":None, "count":{"$sum":1}}}]
    return pipeline

In [42]:
#calculate the number of element
db.saint.find().count()

406974

In [7]:
def aggregate(db, pipeline):
    return [doc for doc in db.saint.aggregate(pipeline)]

In [55]:
#calculate the number of unique users
def users_count():
    pipeline = [{"$match":{"created.uid":{"$ne": None}}},
                {"$group":{"_id":"$created.uid", "count":{"$sum":1}}},
                {"$sort":{"count":-1}},
                {"$limit": 10}]
    return pipeline

In [None]:
pipeline = users_count()
result = aggregate(db, pipeline)
result

In [25]:
pipeline = unique_users()

In [26]:
result = aggregate(db, pipeline)

In [20]:
#calculate the number of different tags
def different_tags():
    pipeline = [{"$group":{"_id":"$tag", "count":{"$sum":1}}}]
    return pipeline

In [28]:
pipeline = different_tags()

In [29]:
result = aggregate(db, pipeline)

In [21]:
#calculate the number of different type of building, then sort it with descend order
def different_highways():
    pipeline = [{"$match":{"highway":{"$ne": None}}},
                {"$group":{"_id":"$highway", "count":{"$sum":1}}},
                {"$sort":{"count":-1}}]
    return pipeline

In [31]:
pipeline = different_highways()

In [32]:
result = aggregate(db, pipeline)

In [22]:
#I want to know which way has the most refence node, then I calculated it and sort it
def get_node_refs():
    pipeline = [{"$match":{"tag":{"$eq": "way"}, "node_refs":{"$ne": None}}},
                {"$unwind":"$node_refs"},
                {"$group":{"_id":"$id", "count":{"$sum":1}}},
                {"$sort":{"count":-1}},
                {"$limit": 10}]
    return pipeline

In [34]:
pipeline = get_node_refs()

In [35]:
result = aggregate(db, pipeline)

In [37]:
#find a certain node from the dataset
def search_certain_node(node_id):
    pipeline = [{"$match":{"id":{"$eq": node_id}}}]
    return pipeline

In [41]:
pipeline = search_certain_node("130711268")

In [42]:
result = aggregate(db, pipeline)

### Conclusion of part 2:
As the results shown, there are 416975 records after dealing with the data. The reason why there are so less records is that I skipped those records whose tag is None, because I think those records are useless. 
 
Among those useful records, I calculated the number of unique uid, 1399, which means there are 1399 users. When I select the numbers of different tags, I find that the way tag is the most, while where still are several ids with None information. From the selection pipeline different_highways, the most building are used for residential, next is for service.

