# Data Wrangle OpenStreetMaps Data
## by Rica Enriquez, July 1, 2015
<p>In this project, the OpenStreetMap data for Cambridge, United Kingdom is explored. It was downloaded from https://mapzen.com/data/metro-extracts on July 1, 2015. It was prepared for MongoDB using nanoproject_2.py and then added into a local 'udacity' database as the 'cambridge' collection.</p>

### Import the database for querying below

In [1]:
from pymongo import MongoClient
import pprint
import os

client = MongoClient()
db = client["udacity"]

# Section 1. Problems Encountered in the Map
<p> There were a few problems with the street names. "chieftain" and "sweetpea" were not capitalized. This was fixed in "nanoproject_2_prep.py" using the mapping scheme similar to the "Improving Street Names" script from Lesson 6.11.</p>
<p> When importing the data, all labels were included. Many of these labels were only available for a small fraction of the documents. So in the following sections, underutilized upper labels and lower lables are removed. If a subset of the dataset is used in the future, removal of labels may  need to be more judiciously.</p>

##Removing underutilized labels
<p>Using MongoDB, only upper labels that had at least 1000 documents were kept. The "nanoproject_2_prep.py" file prints the list of upper labels. This list is then looped through in "nanoproject_2_query.py" to check the number of documents it is used in. If there are less than 1000 documents, that label is removed and the database is updated.</p>

In [2]:
all_labels = ['real_ale', 'fhrs', 'anglican', 'dance', 'sS052', 'demodified', 'maxspeed', 'smoking', 'openplaques',
                  'is_in', 'max_age', 'created_by', 'fax', 'cctv', 'mph', 'icao', 'automatic_door', 'motor_vehicle',
                  'school', 'level', 'notes', 'bus_stop', 'ncn_1', 'real_cider', 'disused', 'clothes', 'bicycle',
                  'cost',
                  'exit_to', 'leaf_type', 'access', 'fast_food', 'eligibility', 'water', 'tracks', 'address', 'hoops',
                  'used_to_be', 'permit_holders', 'microbrewery', 'survey', 'military', 'amenity', 'alt_name', 'fee',
                  'lwn',
                  'vehicle', 'have_riverbank', 'type', 'start_date', 'entrance', 'drinkable', 'club',
                  'campaigned_for_by',
                  'give_way', 'visibility', 'site', 'phone', 'traffic_calming', 'room', 'tunnel', 'det', 'roof', 'male',
                  'history', 'estate', 'lock', 'currency', 'ncn_ref', 'pond', 'species', 'information', 'monitoring',
                  'gate', 'uk_postcode_centroid', 'FIXME', 'description', 'alt_ref', 'hazard', 'leisure', 'date',
                  'rental',
                  'natural', 'lcn_ref', 'wheelchair', 'outdoor_seating', 'healthcare', 'patio', 'office', 'trade',
                  'postal_code', 'motorcycle', 'int_ref', 'pitch', 'covered', 'derelict', 'old_ref', 'junction', 'food',
                  'material', 'foot', 'tourism', 'smoothness', 'fixme', 'name', 'designation', 'osmarender',
                  'embankment',
                  'crossing', 'kerb', 'name_1', 'frequency', 'naptan', 'access_land', 'loc_name', 'network', 'bus_bay',
                  'highways_agency', 'ref', 'brewery', 'highway', 'barrier', 'post_box', 'cars', 'maxweight',
                  'electrified',
                  'was_called', 'old_amenity', 'accommodation', 'tenant', 'noexit', 'segregated', 'route', 'atm',
                  'box_type', 'turn', 'place', 'high_capacity', 'support', 'note', 'owner', 'horse', 'service',
                  'priority',
                  'motorcar', 'park_ride', 'enforcement', 'noname', 'est_width', 'artist_name', 'old_old_name', 'ncn',
                  'population', 'multi_storey', 'royal_cypher', 'aeroway', 'landuse', 'tracktype', 'builder', 'bridge',
                  'occupier', 'nqa', 'sidewalk', 'hgv', 'lit', 'takeaway', 'overall_site', 'payment', 'old_shop',
                  'aerodrome', 'url', 'medical', 'tactile_paving', 'shop', 'golf', 'indoor', 'social_facility',
                  'last_survey', 'gauge', 'mapillary', 'wood', 'fuel', 'iata', 'abutters', 'bracket_ref', 'tourist_bus',
                  'artist', 'motorboat', 'public_transport', 'power_source', 'automatic', 'int_name', 'locale',
                  'lamp_type',
                  'route_ref', 'parking', 'sport', 'power_supply', 'capacity', 'maxwidth', 'wikipedia', 'state',
                  'boundary',
                  'email', 'screen', 'denomination', 'key', 'substation', 'junction_ref', 'bar_billiards', 'railway',
                  'genus', 'comment', 'maintainer', 'wall', 'loading_gauge', 'outside_seating', 'recycling', 'height',
                  'ele', 'alt_description', 'boat', 'speech_output', 'mkgmap', 'waste', 'bicycle_parking', 'website',
                  'direction', 'lanes', 'building_1', 'craft', 'official_name', 'mail', 'grills', 'replaces', 'busway',
                  'parking_space', 'replaced', 'overtaking', 'layer', 'ons_code', 'backrest', 'telephone', 'surface',
                  'guided_busway', 'beer_garden', 'waterway', 'cuisine', 'education', 'surveillance',
                  'collection_times',
                  'status', 'wires', 'cyclestreets_id', 'fence_type', 'fruit', 'ownership', 'colour', 'contact',
                  'oneway',
                  'landmark', 'left', 'taxi', 'livestock', 'proposed', 'hour_off', 'not', 'voltage', 'seats',
                  'guest_house',
                  'isced', 'toilets', 'generator', 'TODO', 'bench', 'source', 'bollard', 'usage', 'emergency',
                  'historic',
                  'lcn', 'psv', 'furniture', 'vending', 'tower', 'internet_access', 'right', 'twitter', 'platforms',
                  'local_ref', 'man_made', 'religion', 'artwork_type', 'power', 'trees', 'Comment', 'incline',
                  'footway',
                  'industry', 'taxon', 'supervised', 'step_count', 'female', 'operator', 'area', 'unisex',
                  'opening_hours',
                  'museum', 'width', 'occupier3', 'occupier2', 'admin_level', 'bus', 'brand', 'delivery',
                  'construction',
                  'diaper', 'courts', 'old_name', 'real_fire', 'circuits', 'books', 'dispensing', 'display',
                  'crossing_ref',
                  'cinema', 'carriageway_ref', 'maxheight', 'cafe', 'cables', 'recycling_type', 'hour_on', 'locality',
                  'interior_decoration', 'cycleway', 'department', 'denotation', 'shelter', 'latest_survey_date',
                  'diet',
                  'min_age', 'maxstay', 'opened', 'building', 'yelp', 'wifi', 'traffic_signals']

# Number of documents in the collection
N = db.cambridge.find().count()

removed = []
kept = []

# Remove labels used in less than 1000 documbets
for label in all_labels:
    pipeline = [{"$group": {"_id": "$" + label, "count": {"$sum": 1}}}, {"$match": {"_id": None}}]
    result = list(db.cambridge.aggregate(pipeline))
    if len(result) > 0:
        n = result[0]["count"]
        if n >= N - 1000:
            db.cambridge.update({}, {"$unset": {label: ""}}, multi=True)
            removed.append(label)
        else:
            kept.append(label)
print len(removed), "labels were removed and", len(all_labels) - len(removed), "labels were kept."
print "The labels kept are:", kept

336 labels were removed and 11 labels were kept.
The labels kept are: ['address', 'amenity', 'entrance', 'natural', 'foot', 'highway', 'barrier', 'landuse', 'source', 'operator']


##Removing underutilized sublabels
<p>Similarly, only lower labels that had at least 500 documents were kept. The "kept" list from the MongoDB query above is used in "nanoproject_2_prep.py" to print a dictionary of the list of kept upper labels and their sublabels. This dictionary is then looped through in "nanoproject_2_query.py" to check the number of documents each sublabel is used in, using MongoDB. If there are less than 500 documents, that sublabel is removed and the database is updated. If an upper label does not contain any sublabels, it is also removed. The final structure of the collection is printed out.</p>

In [3]:
kept_sublabels = {'building': ['name', 'level', 'levels', 'min_level', 'material', 'levels:underground'],
                  'maxspeed': ['type', 'ype'],
                  'name': ['cy', 'eo', 'ru', 'sr', 'uk', 'zh', 'en', 'zh_pinyin', 'he', 'de'],
                  'service': ['bicycle:pump', 'bicycle:chain_tool'], 'access': ['conditional'],
                  'source': ['crossing', 'addr', 'name', 'phone', 'population', 'maxwidth', 'ref', 'detail', 'info',
                             'location', 'start_date', 'fhrs:id', 'opening_hours', 'housenumber', 'postcode', 'access',
                             'ele', 'cost', 'taxon', 'database', 'position', 'geometry', 'maxspeed', 'lit', 'oneway',
                             'traffic_calming', 'designation', 'operator', 'width', 'bus:backward', 'taxi:backward',
                             'bicycle:backward', 'tourist_bus:backward', 'occupier', 'description', 'highway', 'noname',
                             'hgv', 'outline', 'maxspeed:date', 'addr:housenumber', 'ons_code', 'wifi', 'pkey',
                             'bridge', 'addr:postcode', 'tracktype', 'height'],
                  'address': ['street', 'postcode', 'housenumber', 'housename', 'full', 'city', 'country',
                              'interpolation', 'flat', 'flats', 'place', 'town'],
                  'ref': ['university_of_cambridge', 'observado']}

# Remove sublabels used in less than 500 documents
removed_sub = {}
kept_sub = {}
for label in kept_sublabels.keys():
    for sublabel in kept_sublabels[label]:
        pipeline = [{"$group": {"_id": "".join(["$", label, ".", sublabel]), "count": {"$sum": 1}}},
                    {"$match": {"_id": None}}]
        result = list(db.cambridge.aggregate(pipeline))
        if len(result) > 0:
            n = result[0]["count"]
            if n >= N - 500:
                db.cambridge.update({}, {"$unset": {"".join([label, '.', sublabel]): ""}}, multi=True)
                try:
                    removed_sub[label].append(sublabel)
                except:
                    removed_sub[label] = [sublabel]
            else:
                try:
                    kept_sub[label].append(sublabel)
                except:
                    kept_sub[label] = [sublabel]

# Remove the upper labels that no longer have sublabels
for label in removed_sub.keys():
    if label not in kept_sub.keys():
        db.cambridge.update({}, {"$unset": {label: ""}}, multi=True)

# Print the final structure of the collection
final_labels = {}
for label in kept:
    if label in kept_sub:
        final_labels[label] = kept_sub[label]
    elif label not in removed_sub:
        final_labels[label] = None

print "The final structure of the collection is:"
pprint.pprint(final_labels)

The final structure of the collection is:
{'address': ['street',
             'postcode',
             'housenumber',
             'housename',
             'city',
             'country',
             'interpolation'],
 'amenity': None,
 'barrier': None,
 'entrance': None,
 'foot': None,
 'highway': None,
 'landuse': None,
 'natural': None,
 'operator': None,
 'source': ['name']}


##Remove postcodes that do not start with "CB"
<p>All postcodes in Cambridge start with CB. Listing the postcodes in the collection, documents with a postcode of "SG8 5TF" is discovered and should be removed since it is for a place in Royston and the Stevenage postcode area. Since there are only two documents in the collection with this postcode, it is an error and not an approach to extend the collection the surrounding area. Additionally the postcode "CB1" is incomplete - there should be a second set of three characters. Documents with this are also removed.</p>

In [4]:
db.cambridge.remove({"address.postcode": "CB1"})
db.cambridge.remove({"address.postcode": "SG8 5TF"})

{u'n': 0, u'ok': 1}

##Update Cities to Cambridge
<p>Some cities were "cambridge" and not "Cambridge", overspecified to "Girton" or "South Cambridgeshire", or listed as "11". However, the other information does show that each entry is in Cambridge. Therefore, the "city" is updated to "Cambridge".</p>

In [5]:
db.cambridge.update({"address.city": "cambridge"}, {"$set": {"address.city": "Cambridge"}}, upsert=False,
                    multi=True)
db.cambridge.update({"address.city": "South Cambridgeshire"}, {"$set": {"address.city": "Cambridge"}}, upsert=False,
                    multi=True)
db.cambridge.update({"address.city": "Girton"}, {"$set": {"address.city": "Cambridge"}}, upsert=False, multi=True)
db.cambridge.update({"address.city": "11"}, {"$set": {"address.city": "Cambridge"}}, upsert=False, multi=True)

{u'n': 0, u'nModified': 0, u'ok': 1, 'updatedExisting': False}

##Pare down barrier, entrance, highway, landuse, and operator
<p>Some entries for these sublabels were the same, but in a different format. They were updated to be more consistent.</p>

In [6]:
# Pare down barrier
pipeline = [{"$group": {"_id": "$barrier", "count": {"$sum": 1}}},
            {"$sort": {"_id": -1}}]
result = list(db.cambridge.aggregate(pipeline))
pprint.pprint(result)

db.cambridge.update({"barrier": "fence;wall"}, {"$set": {"barrier": "fence"}}, upsert=False, multi=True)
db.cambridge.update({"barrier": "fence;wall"}, {"$set": {"barrier": "fence"}}, upsert=False, multi=True)
db.cambridge.update({"barrier": "fedr"}, {"$set": {"barrier": None}}, upsert=False, multi=True)
db.cambridge.update({"barrier": "bollards"}, {"$set": {"barrier": "bollard"}}, upsert=False, multi=True)

# Pare down entrance
pipeline = [{"$group": {"_id": "$entrance", "count": {"$sum": 1}}},
            {"$sort": {"_id": -1}}]
result = list(db.cambridge.aggregate(pipeline))
pprint.pprint(result)

db.cambridge.update({"entrance": "secondary_entrance"}, {"$set": {"entrance": "secondary"}}, upsert=False,
                    multi=True)
db.cambridge.update({"entrance": "main_entrance; porters"}, {"$set": {"entrance": "main_entrance;porters"}},
                    upsert=False, multi=True)
db.cambridge.update({"entrance": "porters;main_entrance"}, {"$set": {"entrance": "main_entrance;porters"}},
                    upsert=False, multi=True)
db.cambridge.update({"entrance": "main"}, {"$set": {"entrance": "main_entrance"}}, upsert=False, multi=True)
db.cambridge.update({"entrance": "emegency"}, {"$set": {"entrance": "emergency"}}, upsert=False, multi=True)
db.cambridge.update({"entrance": "main_entrance;porters;"}, {"$set": {"entrance": "main_entrance;porters"}},
                    upsert=False, multi=True)

# Pare down highway
pipeline = [{"$group": {"_id": "$highway", "count": {"$sum": 1}}},
            {"$sort": {"_id": -1}}]
result = list(db.cambridge.aggregate(pipeline))
pprint.pprint(result)

db.cambridge.update({"highway": "bus_stand"}, {"$set": {"highway": "bus_stop"}}, upsert=False, multi=True)

# Pare down landuse
pipeline = [{"$group": {"_id": "$barrier", "count": {"$sum": 1}}},
            {"$sort": {"_id": -1}}]
result = list(db.cambridge.aggregate(pipeline))
pprint.pprint(result)
db.cambridge.update({"landuse": "institututional"}, {"$set": {"landuse": "institutional"}}, upsert=False,
                    multi=True)

# Pare down operator
pipeline = [{"$group": {"_id": "$operator", "count": {"$sum": 1}}},
            {"$sort": {"_id": -1}}]
result = list(db.cambridge.aggregate(pipeline))
pprint.pprint(result)
db.cambridge.update({"operator": "YourSpace"}, {"$set": {"operator": "Your Space Apartments"}}, upsert=False,
                    multi=True)
db.cambridge.update({"operator": "Your Space"}, {"$set": {"operator": "Your Space Apartments"}}, upsert=False,
                    multi=True)
db.cambridge.update({"operator": "Trinity College"},
                    {"$set": {"operator": "Trinity College (University of Cambridge)"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "St John's College"},
                    {"$set": {"operator": "St John's College (University of Cambridge)"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "Lucy Cavendish College"},
                    {"$set": {"operator": "Lucy Cavendish College (University of Cambridge)"}}, upsert=False,
                    multi=True)
db.cambridge.update({"operator": "Lloyds"}, {"$set": {"operator": "Lloyds TSB"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "King's College"},
                    {"$set": {"operator": "King's College (University of Cambridge)"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "King's College (University Of Cambridge)"},
                    {"$set": {"operator": "King's College (University of Cambridge)"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "Needham Institute"}, {"$set": {"operator": "Needham Research Institute"}},
                    upsert=False, multi=True)
db.cambridge.update({"operator": "Gonville and Caius College (University of Cambridge)"},
                    {"$set": {"operator": "Gonville & Caius College (University of Cambridge)"}}, upsert=False,
                    multi=True)
db.cambridge.update({"operator": "EDF"}, {"$set": {"operator": "EDF Energy"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "Clare College"},
                    {"$set": {"operator": "Clare College (University of Cambridge)"}}, upsert=False, multi=True)
db.cambridge.update({"operator": "Christ's College"},
                    {"$set": {"operator": "Christ's College (University of Cambridge)"}}, upsert=False, multi=True)

[{u'_id': u'yes', u'count': 286},
 {u'_id': u'wall', u'count': 669},
 {u'_id': u'tree', u'count': 1},
 {u'_id': u'swing_gate', u'count': 1},
 {u'_id': u'stile', u'count': 6},
 {u'_id': u'steps', u'count': 1},
 {u'_id': u'rising_kerb', u'count': 19},
 {u'_id': u'ramp', u'count': 1},
 {u'_id': u'railing', u'count': 19},
 {u'_id': u'pram_handles', u'count': 2},
 {u'_id': u'lift_gate', u'count': 59},
 {u'_id': u'kissing_gate', u'count': 12},
 {u'_id': u'kerb', u'count': 34},
 {u'_id': u'height_restrictor', u'count': 1},
 {u'_id': u'hedge', u'count': 490},
 {u'_id': u'gravestone', u'count': 1},
 {u'_id': u'gate', u'count': 807},
 {u'_id': u'full-height_turnstile', u'count': 1},
 {u'_id': u'fence', u'count': 5405},
 {u'_id': u'entrance', u'count': 155},
 {u'_id': u'ditch', u'count': 1},
 {u'_id': u'cycle_barrier', u'count': 39},
 {u'_id': u'chicane', u'count': 2},
 {u'_id': u'cattle_grid', u'count': 36},
 {u'_id': u'car_trap', u'count': 9},
 {u'_id': u'boom', u'count': 1},
 {u'_id': u'bollar

{u'n': 0, u'nModified': 0, u'ok': 1, 'updatedExisting': False}

#Section 2. Overview of the Data
<p> A statistical overview of the dataset with the MongDB queries used to obtain such statistics are below.</p>

In [7]:
print "The size of 'cambridge_england.osm' is", os.stat("cambridge_england.osm").st_size / 1e6, "MB."
print "The size of 'cambridge_england.osm.json' is", os.stat("cambridge_england.osm.json").st_size / 1e6, "MB."
N2 = db.cambridge.find().count()
print "There are", N, "documents in the original set and", N2, "documents in the cleaned set."
pipeline = [{"$group": {"_id": "$created.user", "count": {"$sum": 1}}}]
print "There are", len(list(db.cambridge.aggregate(pipeline))), "unique users."
print "There are", db.cambridge.find({"type": "node"}).count(), "nodes."
print "There are", db.cambridge.find({"type": "way"}).count(), "ways."
pipeline = [{"$group": {"_id": "$created.user", "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$limit": 1}]
print list(db.cambridge.aggregate(pipeline))[0]['_id'], "contributed the most to this collection with", \
    list(db.cambridge.aggregate(pipeline))[0]['count'], "documents."
pipeline = [{"$group": {"_id": "$created.user", "count": {"$sum": 1}}},
            {"$group": {"_id": "$count", "num_users": {"$sum": 1}}},
            {"$sort": {"_id": 1}},
            {"$limit": 1}]
print list(db.cambridge.aggregate(pipeline))[0]['num_users'], "users contributed once."
pipeline = [{"$group": {"_id": "$amenity", "count": {"$sum": 1}}},
            {"$match": {"_id": {"$ne": None}}},
            {"$sort": {"count": -1}},
            {"$limit": 5}]
print list(db.cambridge.aggregate(pipeline))[0]["_id"], list(db.cambridge.aggregate(pipeline))[1][
    "_id"], "are the top two amenities."

The size of 'cambridge_england.osm' is 61.828774 MB.
The size of 'cambridge_england.osm.json' is 87.071371 MB.
There are 306428 documents in the original set and 306428 documents in the cleaned set.
There are 453 unique users.
There are 257058 nodes.
There are 49351 ways.
smb1001 contributed the most to this collection with 81443 documents.
120 users contributed once.
university bicycle_parking are the top two amenities.


##Section 3. Additional Ideas

<p>Some addresses have house names. It would be interesting to know if there's a certain postal code with the most and if there is an operator that is popular. Additionally, it would be interesting where the top amenities are located.</p>

In [8]:
# Find the top postcodes with housenames
pipeline = [{"$match": {"address.housename": {"$exists": True}}},
                {"$group": {"_id": "$address.postcode", "count": {"$sum": 1}}},
                {"$sort": {"count": -1}}]
pprint.pprint(list(db.cambridge.aggregate(pipeline)))

# Find the top operators with housenames
pipeline = [{"$match": {"address.housename": {"$exists": True}}},
                {"$group": {"_id": "$operator", "count": {"$sum": 1}}},
                {"$sort": {"count": -1}}]
pprint.pprint(list(db.cambridge.aggregate(pipeline)))

# Find the top amenties with postcodes and sort by postcode
pipeline = [{"$match": {"amenity": {"$ne": None}}},
            {"$match": {"amenity": {"$ne": "university"}}},
            {"$match": {"address.postcode": {"$ne": None}}},
            {"$group": {"_id": {"amenity": "$amenity",
                                "postcode": "$address.postcode"},
                        "count": {"$sum": 1}}},
            {"$sort": {"count": -1}},
            {"$group": {"_id": "$_id.amenity",
                        "info": {"$push": {
                            "postcode": "$_id.postcode",
                            "count": "$count"},},
                        "count": { "$sum": "$count"}}},
            {"$sort": {"count": -1}},
            {"$limit": 5}]
pprint.pprint(list(db.cambridge.aggregate(pipeline)))

[{u'_id': None, u'count': 200},
 {u'_id': u'CB4 1HG', u'count': 17},
 {u'_id': u'CB3 0EY', u'count': 11},
 {u'_id': u'CB1 2LJ', u'count': 9},
 {u'_id': u'CB5 8HU', u'count': 8},
 {u'_id': u'CB4 1HH', u'count': 8},
 {u'_id': u'CB1 2LG', u'count': 7},
 {u'_id': u'CB1 8RG', u'count': 7},
 {u'_id': u'CB3 9NF', u'count': 5},
 {u'_id': u'CB2 3QZ', u'count': 5},
 {u'_id': u'CB3 0AE', u'count': 5},
 {u'_id': u'CB4 1AJ', u'count': 5},
 {u'_id': u'CB2 8EX', u'count': 5},
 {u'_id': u'CB1 7PH', u'count': 5},
 {u'_id': u'CB2 1NS', u'count': 4},
 {u'_id': u'CB3 0AJ', u'count': 4},
 {u'_id': u'CB5 8HT', u'count': 4},
 {u'_id': u'CB2 3EA', u'count': 4},
 {u'_id': u'CB1 8QL', u'count': 4},
 {u'_id': u'CB4 1ST', u'count': 4},
 {u'_id': u'CB3 9NQ', u'count': 3},
 {u'_id': u'CB1 7RU', u'count': 3},
 {u'_id': u'CB4 3LG', u'count': 3},
 {u'_id': u'CB4 3JD', u'count': 3},
 {u'_id': u'CB2 1QJ', u'count': 3},
 {u'_id': u'CB4 1AL', u'count': 3},
 {u'_id': u'CB5 8AF', u'count': 3},
 {u'_id': u'CB5 8AQ', u'count'

## Section 4. Conclusions

<p>The Cambridge, England OpenStreetMap dataset is full of information. However, this can also be cumbersome to analyze. During the cleaning up stages, many labels and sublables were removed. It may be useful to move the information to an exisiting label rather than having it removed. For example, the information in "have_riverbank" and "trees" can be moved to the kept "natural" label. Additionally, there are some labels that would have been useful for analysis. For example, "cuisine" was prematurely removed. The top cuisine for this area couldn't be examined with MongoDB.</p>