#Problems Encountered in the Map
Since Atlanta is traditionally street names seperated into *oridinal directions*, majority of the street names end with one of these oridinal directions (Southwest, Southeast, etc..). Although all of the street names that end with an ordinal direction seemed to have been cleaned well, I thought it would be appropriate to move the direction into a seperate key/value pair.  Thus we will add the key attribute **streetSuffix** to the dataset to describe the ordinal direction of the street.  There are some street names that have some abbreviations as well as some numbers in the street names. The cleaning plan is as follows:
- Fix street names that end with a direction initial and expand them to the full name
- Move all street names direction suffix to a new key/value pair called *streetSuffix*
- Fix all abbreviated street names(Dr., Ave., Blvd.) to their full name

There are only a small number of addresses with numbers at the end so we will omit them in this project. Most of them belonged to a state Highway.
##Auditing Process on Postal Codes
Some of the postal codes have the full 9 digit code, and there are some that have the State code as a prefix.  Thus our cleaning process will involve:
- Removing the state code from the postal codes
- Move the last 4 digits of the postal code into a new key/value pair called *postExt*

The golden standard of the postal codes was generated by Screen Scraping **zipcodestogo.com**. After a initial audit, I noticed that a large portion of the zipcodes do not belong to the City of Atlanta. Therefore I generated a seperate key/value pair called *zipInAtlanta* so that we can perform an analysis on how many entries do not belong to the City of Atlanta.
##Auditing Process on County field
Looking at the county field, I saw that the none of the counties reside in the City of Atlanta. Also, some of the counties have the state code appended to it. Thus as part of the cleaning process, we remove the state code from the end of the county field.   After performing an audit on a sample set, it seems that a good portion of the counties are counties that are not in the City of Atlanta, but belong to the Atlanta Metropolitan area.

##Data Overview
We first run some queries to find out the number of entries and some other auxillary data.

###Size of OSM file
atlanta_georgia.osm - **2.27GB**


In [41]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.osm

In [42]:
count = db.atlanta.count()

In [43]:
entries_with_address = {"address" : {"$exists" : 1}}
addr_entries = db.atlanta.count(filter=entries_with_address)

In [44]:
entries_with_pos = {"pos": {"$exists" : 1}}
pos_entries = db.atlanta.count(filter=entries_with_pos)

In [45]:
entries_with_postcodes = {"address.postcode" : {"$exists" : 1}}
postcode_entries = db.atlanta.count(filter=entries_with_postcodes)

In [46]:
print "Number of entries in database ", count
print "Entries with an address field: ",addr_entries
print "Entries with postcodes: ",postcode_entries
print "Entries with position information", pos_entries

Number of entries in database  12167874
Entries with an address field:  100744
Entries with postcodes:  82402
Entries with position information 11390067


From these inital queries, it seems although many entries have position information, and less than 1% have address information.  Most of the entries with address information seem to have the postal code information as well which is a good thing.

## Which street suffix was used the most in the address's street names?
As I stated above, all street names in the city of Atlanta end with a ordinal direction as a suffix.  Here we find out which street suffix is the most used in our address field. This may give us some insight as to which quandrant of the city is best documented.

In [47]:
match_pipe = {"$match" : {"address" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$address.streetSuffix", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
pipeline = [match_pipe,group_pipe,sort_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print(entry)

{u'count': 88496, u'_id': None}
{u'count': 4261, u'_id': u'Northwest'}
{u'count': 3158, u'_id': u'Northeast'}
{u'count': 2995, u'_id': u'Southwest'}
{u'count': 1602, u'_id': u'Southeast'}
{u'count': 101, u'_id': u'East'}
{u'count': 51, u'_id': u'South'}
{u'count': 50, u'_id': u'West'}
{u'count': 30, u'_id': u'North'}


It seems that Northwest Atlanta is the most used street suffix.  As a fun fact, I went to high school on this side of town.  Most entries in this dataset don't have a suffix at all which hints to me that this dataset is mostly the Atlanta Metro Area outside the city limits.

##What is the ratio of Zipcodes that are actually in the City of Atlanta?
To further my hypothesis of whether most of the entries are not in the city of Atlanta, I seek out to find out how many of the zipcodes actually are within the city of Atlanta

In [48]:
match_pipe = {"$match" : {"address.zipInAtlanta" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : 'null', "count" : {"$sum" : 1}}}
pipeline = [match_pipe,group_pipe]
result = db.atlanta.aggregate(pipeline)
total = 0
for entry in result:
    total = entry['count']
    print(entry)

{u'count': 82402, u'_id': u'null'}


In [49]:
match_pipe = {"$match" : {"address.zipInAtlanta" : 'T'}}
group_pipe = {"$group" : {"_id" : "$address.zipInAtlanta", "count" : {"$sum" : 1}}}
pipeline = [match_pipe,group_pipe]
result = db.atlanta.aggregate(pipeline)
in_atl = 0
for entry in result:
    in_atl = entry['count']
    print(entry)

{u'count': 17895, u'_id': u'T'}


In [50]:
print "Ratio of zip codes entries that are in Atlanta City Limits:", float(in_atl)/float(total)

Ratio of zip codes entries that are in Atlanta City Limits: 0.217167059052


The fact that only 22% of zip codes are within the city limits increases my confidence that majority of this dataset is Atlanta Metro Area.  One can also conclude that most of the contributions are done on areas not within the city of Atlanta.

##Which Zipcode is the most common?

In [51]:
match_pipe = {"$match" : {"address.postcode" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 7259, u'_id': 30114}


This zipcode belongs to Canton, Georgia which is apart of the Atlanta Metropolitan area.

##What is the most common amenity in Atlanta?

In [52]:
match_pipe = {"$match" : {"amenity" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 5609, u'_id': u'place_of_worship'}


There is no surprise that the most common amenity is a place of worship.  The only thing bigger than Texas in the Southern U.S. is Religion.

## Which user contributed the most to this dataset?

In [53]:
group_pipe = {"$group" : {"_id" : "$created.user", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 4057, u'_id': u'iandees'}


This doesn't look like a bot name so I must commend *iandees* for making so many contributions to the Atlanta OSM map.

##Additional Ideas for improving the dataset
Since this dataset is the Atlanta Metro area, it would be nice if the county field was more populated so that one can perform data analysis on a county by county granularity.  There are quite a few counties within the Atlanta Metro area so it is worthy field within the dataset.  Also there needs to be an audit on the data to remove the data generated by the **TIGER** project as most of the entries I saw didn't seem correct(Most of them seem to belong to Alabama), but these were *way* nodes and there was not a lot of auditing done by me on the *way* nodes.  Another analysis that I thought would be good to do is get the min/max longitude and latitude values to assure that all of the entries are within the Atlanta Metro boundaries.  Below I give some other suggestions to not only improve the City of Atlanta's dataset, but OpenStreetMap data as a whole.

###University-level involvement
There are two big universities in downtown Atlanta (Georgia State University, and Georgia Institute of Technology) that could become involved with the OSM project.  There are many core concepts that can be learned by the students while cleaning up the dataset in the process.  Some that comes to mind are:
- Data Structures course:  Have a project assignment in which students create a data structure that generates a standard document format. For extra credit, students could submit their changes to OSM for review
- Algorithms course: Have an assignment in which students creates an algorithm that takes some OSM data as input and outputs the data in some given format.  The assignment could be tailored in which students will try to create the most optimal algorithm and evaluate the run-time of their algorithm.

The benefits of this idea is that it gives students some hands-on experience with real-world application of the knowledge they have obtained while also giving benefits to the OSM community.

###Challenge-based involvement
Another opportunity for improving the dataset is to turn it into a challenge or contest on platforms such as **Kaggle** or **Hackerrank**.  Participants could be given a unique but constant size dataset that is similar in complexity and see who can clean the data the fastest.  Submissions could be passed into a checker tool which can give some measurement of how clean the dataset is based on some given criteria. The participants with the highest measure of "cleanliness" wins the challenge. The challenges could be sweetened by giving away some prizes such as T-shirts or some award plaque that could be added to one's professional profile.

###Global Day of Cleaning OpenStreetMap
Similar to the Global Day of CodeRetreat, the OSM community could hold a Global Day of Cleaning OpenStreetMap.  The event could also be driven by professional organizations such as ACM or IEEE in which each chapter can register for the event.  To avoid duplication, there could be central sites for the various regions that are participating and they could be assigned data within some perimeter of their location.  Each team could share their progress by posting before/after pictures of maps that they cleaned onto a specific site or through social media.  One drawback from this method is that areas where there is not any expertise in working with OSM data gets left out.  So as an alternative strategy, the OSM community could select regions where there are no participants or have limited resources due to their economic circumstances and have programmers around the world help clean up the map data in those areas.  This brings a "social good" proponent to the activity which benefits the global community.

