# OpenStreetMap project

## Revision
- v1: first submission
- v2: second submission:
    - auditing and cleaning the zip codes
    - auditing and cleaning the city names
    - docstring added to new functions
    - new section "Ideas for database improvement"
    - all the auditing/cleaning details are available through direct links to python notebooks


## Table of Contents
<ul>
<li><a href="#area">OpenStreetMap chosen area</a></li>
<li><a href="#breakdown">Project breakdown</a></li>
<li><a href="#problems">Problems encountered</a></li>
<li><a href="#statistics">Some statistics</a></li>
<li><a href="#analysis">Further analysis</a></li>
<li><a href="#improvement">Ideas for database improvement</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>

***
<a id='area'></a>
### OSM area
Toulouse, FRANCE *(home, sweet home)*

Toulouse is the fourth French city in terms of population, thus big enough for our project (OSM file size criteria).
- [OSM link to area](https://www.openstreetmap.org/relation/35738)
![title](img/toulouse.png)
<center>https://www.openstreetmap.org/relation/35738#map=5/45.706/3.516</center>


***
<a id='breakdown'></a>
### Project breakdown

The current document summarizes the whole sequence built and played during the project. The work broke down as follows:
- from OSM XML to tabular CSV data
- from tabular CSV data to SQL database
- SQL data check
- data cleaning
- data analysis

Some helper functions have been developed and gathered in a single user module `myeasysql.py` (<a href="myeasysql.html" target="_blank">html link</a>)
- tables creation
- tables listing and schemes display (to check database content)
- tables emptying (to allow test iterations)

****
#### ***1. from OSM to CSV***  (<a href="openstreetmap_osm2csv.html" target="_blank">openstreetmap_osm2csv</a>)
The strategy from udacity course "Case Study: OpenStreetMap Data [SQL]" has been applied and adapted.
In addition to the case study, the key "valid" has been added to the dictionnary of tag attributes to allow later on the analysis of characters problems specified by the regex proposed in the case study.

```python
tag_attribs['valid'] = 'no' if problem_chars.match(tag.attrib['k']) else 'yes'
```
Another improvement was the proper typing of attributes:

```python
NODE_FIELDS_TYPES = [int, float, float, str_encode, int, str_encode, int, str_encode]
WAY_FIELDS_TYPES  = [int, str_encode, int, str_encode, int, str_encode]  

if  element.tag == 'node':
    for i, attr in enumerate(node_attr_fields):
        node_attribs[attr] = apply(NODE_FIELDS_TYPES[i], [element.attrib[attr]])            
elif element.tag == 'way':
    for i, attr in enumerate(way_attr_fields):
        way_attribs[attr] = apply(WAY_FIELDS_TYPES[i], [element.attrib[attr]])
```

****
#### ***2. from CSV to SQL***  (<a href="openstreetmap_csv2sql.html" target="_blank">openstreetmap_csv2sql</a>)
This step was the opportunity to test my ideas for feeding the SQL database:
- lazy coding technique with the one-liner `pandas.to_sql()` function, which in the end raised some drawbacks
- lazy loading technique based on `csv.DictReader` and `yield` functions, which allowed very low core memory consumption while inserting data

****
#### ***3. SQL audit***  (<a href="openstreetmap_audit_sql.html" target="_blank">openstreetmap_audit_sql</a>)
Auditing the database has been processed using the following python structure, based on my helper module `myeasysql.py` and standard SQL queries specified as python strings:

```python
QUERY = "SELECT COUNT(*) FROM (SELECT type FROM node_tags WHERE type!='regular' GROUP BY type);"
nb_node_tags_types = db_query(c, QUERY)[0][0]
print 'Number of different tag types for nodes is %d, first types, sorted by frequency, are:' % nb_node_tags_types
QUERY = """
SELECT type, count(*) as num FROM node_tags
WHERE type!='regular'
GROUP BY type
ORDER BY num DESC LIMIT 10;
"""
for (k,v) in db_query(c, QUERY): print '%s:%d' % (k,v)
```
This query example returns the specific node and way tags, and their frequencies:
```
Number of different tag types for nodes is 83, first types, sorted by frequency, are:
addr:100254
source:58016
fire_hydrant:1141
recycling:791
contact:544
post_box:241
name:198
fuel:139
operator:126
surveillance:104
```

Audit targeted the specific tag type `addr` which is the most used, related problems are described <a href="#problems">there</a>.

****
#### ***4. CSV cleaning***  (<a href="openstreetmap_clean_csv.html" target="_blank">openstreetmap_clean_csv</a>)
The wrong street types namings have been repaired within CSV files via a simple python cleaning script `openstreetmap_clean_csv.ipynb`

The cleaning procedure targeted the case usage which is not standardized at all in our database and leads to redundant street names. Cleaning has been developed using regular expressions which brought some robustness in the code:

```python
def capitalize_streetname(name):
    # initialize with lowering full name
    s0 = name.lower()
    # 1st step: capitalize first letter of every word
    s1 = re.sub(r'((^|[\.\'\s-])\w{1})', lambda pattern: pattern.group(1).upper(), s0)
    # 2d step: lower case of every junction word, such as 'le', 'la'...
    s2 = re.sub(r'([DLE][aeut]?[s]?[\'\s-])', lambda pattern: pattern.group(1).lower(), s1)
    return s2

def standardize_zipcode(code):
    # extracts or completes the given string to get the standard 5 digits code
    # searching for 2 to 5 digits
    mysearch = re.search(r'(\d{2,5})',code)
    if mysearch:
        scode = mysearch.group(1)
        for i in range(len(scode), 5): scode += '0'
    else:
        scode = ''
    return scode
```

The procedure was validated with representative fakes street names and zipcodes:
```
rue du Rendez-vous de l'estrapade ==> Rue du Rendez-Vous de l'Estrapade
rue de la dalbade ==> Rue de la Dalbade
AVENUE JEAN RIEUX du t.o.E.c ==> Avenue Jean Rieux du T.O.E.C
'31000' ==> '31000'
'31' ==> '31000'
' 31200 blabla' ==> '31200'
'blabla3' ==> ''
```
The outcome of the cleaning are two new CSV files with standardized node and way tags, that can be reinjected in our SQL database instead of first CSV files...

****
#### ***5. SQL Analysis***  (<a href="openstreetmap_analyze_sql.html" target="_blank">openstreetmap_analyze_sql</a>)

Our analysis broke down in two steps.

The first step extracted some <a href="#statistics">statistics</a> as required by our project.
The second step covered our  <a href="#analysis">ideas for further analysis</a>.


***
<a id='problems'></a>
### Problems encountered in our map database

Audit focused on tags of type `adress`, keys `street`, `postcode` and `city`:

****
#### Streets naming


After some coding we have access to the list of street types, extracted as first field of key `street` value:
```
10, 7, ALLEE, allée, Allée, Allées, allées, André, Angle, Av., AVENUE, avenue, Avenue, Barrière, Bd, BIS, Bis, Boulevard, boulevard, bvd, C.c., CC, Centre, Chemin, chemin, Cheminement, Clos, Descente, esplanade, Esplanade, face, Frédéric, Grande, Impasse, impasse, La, la, Lotissement, Mail, Passage, Place, place, Port, Promenade, Quai, R.n., Rond-Point, ROUTE, Route, route, rte, Rue, rue, RUE, Savary, Square, Sur, Voie, voie
```
- the case should be standardized, e.g. "Allée" instead of "ALLEE" or "allée"
- the numbers ("10","7") or number extensions ("Bis") should be hosted by `housenumber` key, not `street`
- the abbreviates should be removed, e.g "rte" replaced by "Route"
- some other values are clearly not street types ("La", "Mail", Sur") and should be prefixed by a real street type or moved to `street_` additional tag
- whitespace instead of dash
- missing french accents `éèàîâ`...

The first problem was fixed with code `openstreetmap_clean_csv.ipynb`.
After cleaning, the list of street names is:
```
10, 7, Allee, Allée, Allées, André, Angle, Av., Avenue, Barrière, Bd, Bis, Boulevard, Bvd, C.C., Cc, Centre, Chemin, Cheminement, Clos, Descente, Esplanade, Face, Frédéric, Grande, Impasse, la, Lotissement, Mail, Passage, Place, Port, Promenade, Quai, R.N., Rond-Point, Route, Rte, Rue, Savary, Square, Sur, Voie
```

Conclusion: upper/lower case is now properly standardized.

****
#### zipcodes

Automated cleaning of wrong zipcodes has been applied to the whole database, by forcing the values to the expected 5 digits format:
```
********* Cleaning node_tags ***********
31000;31100;31200;31300;31400;31500 ==> 31000
31000;31100;31200;31300;31400;31500 ==> 31000
31200‎ ==> 31200
31200‎ ==> 31200
31520 Ramonville Saint Agne ==> 31520
********* Cleaning way_tags ***********
31700 BLAGNAC ==> 31700
3140 ==> 31400
31000;31100;31200;31300;31400;31500 ==> 31000
```

The corresponding code did the job, as required:
```python
# searching for 2 to 5 digits
    mysearch = re.search(r'(\d{2,5})',code)
    if mysearch:
        scode = mysearch.group(1)
        for i in range(len(scode), 5): scode += '0'
    else:
        scode = ''
```

****
#### cities

Some city names are redundant, cleaning based on trivial regexp matching has been implemented:

```python
RE_CITYMAP = {
    'Toulouse': re.compile(r'.*toulouse.*', re.I),
    'Ramonville-Saint-Agne': re.compile(r'.*ramonville.*', re.I),
    'Rouffiac-Tolosan': re.compile(r'.*rouffiac.*', re.I),
    'Saint-Orens-de-Gameville': re.compile(r'.*saint.*orens.*', re.I)
}
def mapping_city(name):
    'replace given name by mapping dict key if given name matches mapping dict corresponding regexp'
    for k,v in RE_CITYMAP.items():
        if v.match(name): return k
    return name
```

Before cleaning:
```
Aucamville
Auzeville-Tolosane
Balma
Beauzelle
Blagnac
Castanet-Tolosan
Colomiers
Cornebarrieu
Cugnaux
Fenouillet
L'Union
Labège
Montrabé
Portet-sur-Garonne
Quint-Fonsegrives
Ramonville
Ramonville ST Agne
Ramonville Saint Agne
Ramonville Saint-Agne
Ramonville-Saint-Agne
Rouffiac
Rouffiac-Tolosan
Saint-Jean
Saint-Orens-de-Gameville
TOULOUSE
TOULOUSE Cedex 5
Toulouse
Toulouse Cedex 1
Tournefeuille
Vieille-Toulouse
Villeneuve-Tolosane
saint orens
toulouse
```

After cleaning:
```
Aucamville
Auzeville-Tolosane
Balma
Beauzelle
Blagnac
Castanet-Tolosan
Colomiers
Cornebarrieu
Cugnaux
Fenouillet
L'Union
Labège
Montrabé
Portet-sur-Garonne
Quint-Fonsegrives
Ramonville-Saint-Agne
Rouffiac-Tolosan
Saint-Jean
Saint-Orens-de-Gameville
Toulouse
Tournefeuille
Villeneuve-Tolosane
```




Details to be found here below:
- <a href="openstreetmap_audit_sql.html" target="_blank">openstreetmap_audit_sql.html</a>.
- <a href="openstreetmap_clean_csv.html" target="_blank">openstreetmap_clean_csv.html</a>.
- <a href="openstreetmap_cleancsv2sql.html" target="_blank">openstreetmap_cleancsv2sql.html</a>.


***
<a id='statistics'></a>
### Some statistics

- OSM file size, once uncompressed, is about 460Mb
- number of nodes is almost 2 millions
- number of ways is above 300000
- number of unique users is 1360
- number of users with single contribution is 309
- number of users with contribution level above average is 55
- number of restaurants is 773

These statistics were extracted using standard SQL queries, such as (example of unique users):
```sql
SELECT COUNT(DISTINCT(e.uid))          
FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
```

Details to be found <a href="openstreetmap_analyze_sql.html" target="_blank">here</a>.


***
<a id='analysis'></a>
### Further analysis

The idea was the exploration in details of a given type of object.
While sorting amenities by frequency we learnt that buildings were described with ways rather than nodes, in order to precisely specify their physical limits.
We decided to focus on amenities of type 'retirement home' (not very funky!) since we found one of them specified as node, not way. It raised some questions, such as "are we sure that this node is not redundant with the other ways?". How to answer this last question? Answer  For example by computing and comparing amenities locations! Here below some `python and sql code snippet to illustrate:

- new table listing buildings with their center location:
```sql
CREATE TABLE IF NOT EXISTS buildings AS
SELECT wid, avglat, avglon
FROM (
    SELECT way_nodes.id as wid, avg(lat) as avglat, avg(lon) as avglon
    FROM way_tags
    JOIN way_nodes ON way_tags.id=way_nodes.id
    JOIN nodes ON nodes.id=way_nodes.node_id
    WHERE way_nodes.position > 0 AND way_tags.key='building' AND way_tags.value='yes'
    GROUP BY wid
    )
```

- function computing distance between two sets of objects coordinates
```python
def equi_rect_distance(lat1deg,lon1deg,lat2deg,lon2deg):
    '''
    Computes equirectangular distance between two points expressed in (lat, lon) in degrees
    '''
    lat1rad,lon1rad,lat2rad,lon2rad = tuple(map((lambda x:x*np.pi/180.), (lat1deg,lon1deg,lat2deg,lon2deg)))
    R = 6371000.  #radius of the earth in m
    dx = (lon1rad - lon2rad) * np.cos( 0.5*(lat1rad + lat2rad) )
    dy = lat1rad - lat2rad
    d = np.sqrt( dx*dx + dy*dy )
    return R*d
```

This work allowed to state that our 'retirement home' node was in fact redundant with another way object.

Another idea, not covered by the current project, would be: "can we build a simple tool able to extract some kind of amenities sorted by proximity from our location?". Functions and tables build during the project are a good starting point...

Details to be found <a href="openstreetmap_analyze_sql.html" target="_blank">here</a>.



***
<a id='improvement'></a>
### Ideas for database improvement

A good and robust way of improving the content of the database would be to develop a cleaning robot that fixes all `adress` tags by querying the proper street name to the french official website <a href="https://www.cadastre.gouv.fr/scpc/rechercherPlan.do" target="_blank">cadastre</a> which references all the streets in France. This could be achieved by applying the tips and tricks presented during the lesson 3 "Data in More Complex Formats" which teaches how to scrap data from any http server, by identifying the part of the http content in charge of our query  and extracting the relevant information with python module BeautifulSoup.
***
Another idea would be to encourage contributors, rather than simply declaring an `adress` tag, with sometimes wrong street names, to reference an existing street way through a `relation` OSM feature.
***
Last idea would be to develop an engine that for every node and way tag that uses `adress:street` and/or `adress:housenumber` entries:
- get its node or way parent
- get its exact location (node) or average location (way)
- retrieves the closest and available street way
- adds the needed relation features if not existing
***

These ideas were not implemented in this project, however we can foresee the following benefits and difficulties:
- Benefits:
    - Street names would be defined once, thus names would be unique! This property is the basis of any robust database
    - Fully exploit the relational property of OSM schema would avoid redundancies
- Possible issues:
    - Scraping the official "cadastre" website is maybe not as simple as scraping the TranStats aircraft traffic website!
    - Automating the construction of missing features such as housenumber nodes or ways relation may be tricky...


***
<a id='conclusion'></a>
### Conclusion

This OSM project was a great opportunity to:
- discover OSM as opensource alternative for my maps
- grasp the limits of opensource contributions
- think of how theses contributions could be improved
- think of memory management while feeding the SQL database with big data
- practice SQL queries and database management
- learn `yield` python function and custom generators construction
- practice regular expressions for cleaning with `re.sub` function
- make progress on Markdown layout