# SQLite database audit

In [1]:
import sys, os, sqlite3, pandas, pprint, re
import numpy as np
project_path = "C:\\Users\\TO72078\\Documents\\BIG_DATA\\UDACITY\\projects\\openstreetmap"
if project_path not in sys.path: sys.path.append(project_path)
from myeasysql import db_query
db_name = 'toulouse'
db_conn = sqlite3.connect(os.path.join(project_path, '%s.db' % db_name))
c = db_conn.cursor()

Let's start with checking the database consistency in terms of nodes and way_nodes. For example let's check that every way_node does exist as node

In [69]:
way_nodes_ids = db_query(c, "SELECT id, node_id FROM way_nodes GROUP BY node_id;")
for (wid,nid) in way_nodes_ids:
    result = db_query(c, "SELECT id FROM nodes WHERE id = %d;" % nid)
    if len(result) ==  0: print 'no match for node_id %d referenced by way %d!' % (nid,wid)

Database is consistent in terms of way_nodes/nodes id attributes

Now let's look for unvalid tags as detected by `openstreetmap_osm2csv` step (key `valid` with value `no`)

In [70]:
db_query(c, 'SELECT COUNT(*) FROM (SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags) as u WHERE u.valid="no";')

[(0,)]

None of the node/way tags seems to include the following unwanted characters `=\+/&<>;\'"\?%#$@\,\. \t\r\n`.

We may assume that other trainees have already chosen Toulouse as udacity openstreetmap project and have fixed bad tags ;-). Or more likely some active contributors already did the cleaning job.

Let's have a look on specific tag types that have been captured ("specific" means "not equal to `regular`")

In [71]:
QUERY = "SELECT COUNT(*) FROM (SELECT type FROM node_tags WHERE type!='regular' GROUP BY type);"
nb_node_tags_types = db_query(c, QUERY)[0][0]
print 'Number of different tag types for nodes is %d, first types, sorted by frequency, are:' % nb_node_tags_types
QUERY = """
SELECT type, count(*) as num FROM node_tags
WHERE type!='regular'
GROUP BY type
ORDER BY num DESC LIMIT 10;
"""
for (k,v) in db_query(c, QUERY): print '%s:%d' % (k,v)

Number of different tag types for nodes is 83, first types, sorted by frequency, are:
addr:100254
source:58016
fire_hydrant:1141
recycling:791
contact:544
post_box:241
name:198
fuel:139
operator:126
surveillance:104


In [72]:
QUERY = "SELECT COUNT(*) FROM (SELECT type FROM way_tags WHERE type!='regular' GROUP BY type);"
nb_way_tags_types = db_query(c, QUERY)[0][0]
print 'Number of different tag types for ways is %d, first types, sorted by frequency, are:' % nb_way_tags_types
QUERY = """
SELECT type, count(*) as num FROM way_tags
WHERE type!='regular'
GROUP BY type
ORDER BY num DESC LIMIT 10;
"""
for (k,v) in db_query(c, QUERY): print '%s:%d' % (k,v)

Number of different tag types for ways is 79, first types, sorted by frequency, are:
addr:3364
roof:2481
building:1787
source:922
name:692
zone:510
cycleway:389
turn:261
maxspeed:216
parking:191


Since tag type `addr` is far much often than others, for both nodes and ways, let's focus on this tag type for our cleaning step.

### Auditing `addr` fields

Let's extract and sort by frequency all the keys used by the tag `addr`:

In [73]:
QUERY = """
SELECT key, count(*) as num FROM (
    SELECT key, type FROM node_tags UNION ALL SELECT key, type FROM way_tags
    ) WHERE type='addr' GROUP BY key ORDER by num DESC;
"""
for (k,v) in db_query(c, QUERY): print '%s:%d' % (k,v)

housenumber:77351
street:14089
city:5195
country:4778
postcode:2008
housename:169
unit:14
street_:7
full:5
interpolation:1
place:1


The key `street_` seems weird, why adding an underscore character? Let's display an object using this tag:

In [2]:
QUERY = """
SELECT id FROM (
    SELECT id, key, type FROM node_tags UNION ALL SELECT id, key, type FROM way_tags
    ) WHERE type='addr' AND key='street_' LIMIT 1;
"""
wnid = db_query(c, QUERY)[0][0]
print wnid
QUERY = """
SELECT key, value FROM (
    SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags
    ) WHERE id = %d;
""" % wnid
for (k,v) in db_query(c, QUERY): print '%s:%s' % (k,v)

3172251261
city:Portet-sur-Garonne
country:FR
full:ZI de Toulouse Sud - 3 allée Pablo PicassoRoute d'Espagne
housenumber:3
postcode:31120
street:allée Pablo Picasso
street_:ZI de Toulouse Sud - Route d'Espagne
air_conditioning:yes
amenity:restaurant
phone:+33561448019
name:Ristorante Del Arte
opening_hours:Mo-Su 11:30-14:30, 18:30-22:30
credit_cards:yes
website:http://www.delarte.fr/restaurant-italien-pizzeria-toulouse-portet-sur-garonne.html
wheelchair:yes


The key `street_` is in fact a simple extension of the existing `street` key to cover aditionnal fields in adresses.

## street tag type
Now let's focus on key `street` which is the most often used:

In [75]:
QUERY = """
SELECT value FROM (
    SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags
    ) WHERE type='addr' AND key='street' GROUP BY value;
"""
street_names = db_query(c, QUERY)

In [76]:
for (v,) in street_names[10:15]: print v

Allée Gabriel Biénès
Allée Gabriel Faure
Allée Henri Sellier
Allée Jacques Chaban-Delmas
Allée Jean Griffon


## street type
Let's extract the street type as first field of the street name:

In [77]:
street_types = set()
for (street,) in street_names:
    street_types.add(street.split()[0])
print street_types

set([u'rte', u'Boulevard', u'Angle', u'La', u'ROUTE', u'esplanade', u'Passage', u'Cheminement', u'Chemin', u'Sur', u'Savary', u'bvd', u'R.n.', u'Mail', u'Port', u'AVENUE', u'all\xe9e', u'Bd', u'Fr\xe9d\xe9ric', u'boulevard', u'Centre', u'C.c.', u'avenue', u'Voie', u'ALLEE', u'Rue', u'Promenade', u'Av.', u'Quai', u'7', u'chemin', u'BIS', u'Barri\xe8re', u'All\xe9es', u'CC', u'Lotissement', u'voie', u'rue', u'Route', u'Impasse', u'Place', u'Grande', u'Rond-Point', u'10', u'Square', u'all\xe9es', u'face', u'la', u'route', u'Descente', u'Esplanade', u'Bis', u'impasse', u'RUE', u'place', u'Clos', u'Avenue', u'Andr\xe9', u'All\xe9e'])


For the sake of readability, let's encode the unicode strings and build a sorted list of street types:

In [78]:
def str_encode(v):
    """Return string object properly encoded if necessary"""
    return v.encode('utf-8') if isinstance(v, unicode) else str(v)
pretty_street_types = sorted([str_encode(stype) for stype in street_types], key=str.lower)
print ', '.join(pretty_street_types)

10, 7, ALLEE, allée, Allée, Allées, allées, André, Angle, Av., AVENUE, avenue, Avenue, Barrière, Bd, BIS, Bis, Boulevard, boulevard, bvd, C.c., CC, Centre, Chemin, chemin, Cheminement, Clos, Descente, esplanade, Esplanade, face, Frédéric, Grande, Impasse, impasse, La, la, Lotissement, Mail, Passage, Place, place, Port, Promenade, Quai, R.n., Rond-Point, ROUTE, Route, route, rte, Rue, rue, RUE, Savary, Square, Sur, Voie, voie


Now it's easy to list the defects or redundancies:
- the numbers or "Bis" terms should be hosted by `housenumber` type
- the case should be standardized, e.g. "Allée" instead of "ALLEE" or "allée"
- the abbreviates should be removed, e.g "rte" replaced by "Route"
- some other values are clearly not street types ("La", "Sur", "face"...), let's display these weird values:

In [79]:
weird_values = list()
for (v,) in street_names:
    st = v.split()[0]
    st = str_encode(st)
    if st.lower() in ('sur', 'face', 'la', 'grande', 'frédéric'):
        weird_values.append(v)
        print v

Frédéric Petit
Grande Rue Nazareth
Grande Rue Saint-Michel
Grande rue Saint-Michel
La Pyrénéenne
Sur facade du Théâtre face 1 place du Capitole
Sur parking face à la rue Porte Sardane
face 5 place du Capitole
la lauragaise


- Some street names miss the street type: should be "Rue Frédéric Petit" instead of "Frédéric Petit", 
- Extracting the first field is not enough for the type "Grande Rue"
- A local usage in Toulouse gives street names without street types: "La Lauragaise", "La Pyrénéenne"
- Some other values like those starting with "Sur" need full display to be interpreted:

In [80]:
for value in weird_values:
    # looping only on unexplained values:
    if value.split()[0].lower() not in ('sur', 'face'): continue
    print '************************'
    QUERY = """
    SELECT id FROM (
        SELECT id, value, type FROM node_tags UNION ALL SELECT id, value, type FROM way_tags
        ) WHERE type='addr' AND value='%s';
    """ % value
    wnid = db_query(c, QUERY)[0][0]
    QUERY = """
    SELECT key, value FROM (
        SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags
        ) WHERE id = %d;
    """ % wnid
    for (k,v) in db_query(c, QUERY): print '%s:%s' % (k,v)

************************
name:Municipale-20
man_made:surveillance
street:Sur facade du Théâtre face 1 place du Capitole
************************
name:Municipale-13
man_made:surveillance
street:Sur parking face à la rue Porte Sardane
************************
name:Municipale-18
man_made:surveillance
street:face 5 place du Capitole


Got it! The above tags should have been recorded as `surveillance` tag type and not as `addr`

## street names
Now let's hunt the unicity defects within street names list:

In [81]:
# building street names without street type
names_list = list()
for (value,) in street_names:
    stype = value.split()[0]
    sname = ' '.join(value.split()[1:])
    names_list.append(sname)
print names_list[100:105]

[u'Crampel', u'Debat-Ponsan', u'Didier Daurat', u'Eisenhower', u'Etienne Billi\xe8res']


In [82]:
# some helper function for strings comparison, looking for small discrepancies from one string to another
def diff_strings(a,b):
    ''' computes number of unequal characters between both input strings'''
    a = a.strip()
    b = b.strip()
    diff_len = min(len(a), len(b))
    diff_array = np.array([char for char in a[:diff_len]]) != np.array([char for char in b[:diff_len]])
    return np.where(diff_array==True)[0].size

# testing function, expecting result=1
a="Rue d'Alsace Lorraine"
b="   Rue d'Alsace-Lorraine   "
assert diff_strings(a,b) == 1 # difference is the dash character instead of whitespace
a="Rue d'Alsace Lorraine"
b="Rue d'alsace"
assert diff_strings(a,b) == 1 # case difference

Now we are able to compare two neighbours in the street names list and detect same street with duplicate naming:

In [83]:
# comparing Nth vs. (N-1)th, displaying names with at most 2 different characters
arr1 = names_list[:-1]
arr2 = names_list[1:]
vect_diff_strings = np.vectorize(diff_strings)
diff_strings_array = vect_diff_strings(arr1, arr2)
arr1_diff = np.array(arr1)[diff_strings_array==1]
arr2_diff = np.array(arr2)[diff_strings_array==1]
pandas.DataFrame({'N-1':arr1_diff, 'N':arr2_diff})

Unnamed: 0,N,N-1
0,Jean Jaurès,Jean Jaures
1,de Grand-Selve,de Grand Selve
2,Jean rieux,Jean Rieux
3,Saint-Exupéry,Saint Exupéry
4,des États-Unis,des États Unis
5,du Château d’Eau,du Château d'Eau
6,Étienne Billières,Étienne Billieres
7,du professeur Léopold Escande,du Professeur Léopold Escande
8,Leclerc - Rue du commerce,leclerc
9,rue Saint-Michel,Rue Saint-Michel


The same street can be written in different ways for the following reasons:
- whitespace instead of dash
- lower instead of upper case
- missing french accents `éèàîâ`...

## zipcodes / cities
Now let's check the zipcodes and the cities:

In [84]:
QUERY = """
SELECT value FROM (
    SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags
    ) WHERE type='addr' AND key='postcode' GROUP BY value;
"""
zipcodes = db_query(c, QUERY)

In [85]:
print zipcodes

[(u'31000',), (u'31000;31100;31200;31300;31400;31500',), (u'31015',), (u'31018',), (u'31020',), (u'31021',), (u'31022',), (u'31024',), (u'31026',), (u'31027',), (u'31028',), (u'31035',), (u'31036',), (u'31037',), (u'31047',), (u'31053',), (u'31055',), (u'31060',), (u'31062',), (u'31065',), (u'31070',), (u'31076',), (u'31079',), (u'31081',), (u'31094',), (u'31100',), (u'31120',), (u'31127',), (u'31130',), (u'31140',), (u'31150',), (u'31170',), (u'31180',), (u'31200',), (u'31200\u200e',), (u'31240',), (u'31242',), (u'31270',), (u'31300',), (u'31320',), (u'3140',), (u'31400',), (u'31432',), (u'31500',), (u'31506',), (u'31520',), (u'31520 Ramonville Saint Agne',), (u'31650',), (u'31670',), (u'31700',), (u'31700 BLAGNAC',), (u'31701',), (u'31750',), (u'31770',), (u'31776',), (u'31840',), (u'31850',), (u'31901',), (u'68199',)]


Let's display zipcodes which are not 5-digits standardized:

In [86]:
for (zc,) in zipcodes:
    if not re.match(r'^\d{5}$', zc):
        print repr(zc)

u'31000;31100;31200;31300;31400;31500'
u'31200\u200e'
u'3140'
u'31520 Ramonville Saint Agne'
u'31700 BLAGNAC'


Some zipcodes are clearly badly shaped, mixing both city name and code or simply wrongly written.
Furthermore it appears than cities outside of Toulouse are part of our OSM, let's display them:

In [88]:
QUERY = """
    SELECT DISTINCT(value) FROM (
    SELECT * FROM node_tags UNION ALL SELECT * FROM way_tags
    ) WHERE type='addr' AND key='city' AND value NOT LIKE '%toulouse%' ORDER BY value;
"""
for (c,) in db_query(c, QUERY): print c

Aucamville
Auzeville-Tolosane
Balma
Beauzelle
Blagnac
Castanet-Tolosan
Colomiers
Cornebarrieu
Cugnaux
Fenouillet
L'Union
Labège
Montrabé
Portet-sur-Garonne
Quint-Fonsegrives
Ramonville
Ramonville ST Agne
Ramonville Saint Agne
Ramonville Saint-Agne
Ramonville-Saint-Agne
Rouffiac
Rouffiac-Tolosan
Saint-Jean
Saint-Orens-de-Gameville
Tournefeuille
Villeneuve-Tolosane
saint orens


First observation: some city names are redundant and require some cleaning.
Second observation: the openstreetmap extraction of Toulouse references some nodes/ways which stand beyond the city boundaries. It is not surprising since there is no "dead zone" between the big Toulouse and the small or mid cities around it.
Let's display an example of tag referencing a neighbour city:

In [22]:
QUERY = """
    SELECT id, key, value, type FROM way_tags
    WHERE id IN
    (SELECT DISTINCT(ID) FROM way_tags
    WHERE type='addr' AND key='city' AND value='Blagnac' LIMIT 1)
    ;
"""
for (a,b,c,d) in db_query(c, QUERY): print '[%s] %s: %s (%s)' % (a,b,c,d)

[64403963] city: Blagnac (addr)
[64403963] housenumber: 10 bis (addr)
[64403963] postcode: 31700 (addr)
[64403963] street: Avenue du Général Compans (addr)
[64403963] amenity: restaurant (regular)
[64403963] building: yes (regular)
[64403963] cuisine: regional (regular)
[64403963] name: Terre bretonne (regular)
[64403963] opening_hours: 11:45-14:00,19:00-21:30 (regular)
[64403963] takeaway: no (regular)
[64403963] website: http://www.creperie-terre-bretonne.fr/ (regular)


This particular way object (amenity) has clearly nothing to do with Toulouse city. Thus we may assume that its presence in our OSM is due to the method used for extraction. For example, since some specific objects, such as bus or metro lines, may cross the city limits, we may assume that the extraction avoids to split this kind of objects and thus increases the extracted area...