# Data wrangling on OpenStreetMaps data
In this project -- which is part of the Udacity Data Analysis Nanodegree -- I will apply some data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean an specific area from OpenStreetMap data. After it, in order to try database manipulation in Python, I will load the cleaned data to a MongoDB collection (installed locally in my machine) and apply some simple statistics on it.

## Choosing an OpenStreetRegion: Missões!
My region of interest in this project is Santo Ângelo, a countryside small city in the southest estate of Brazil which were my birthplace. However, since there's few data for this city and, for this project, I'm supposed to deal with databases larger than 50MB, I will consider all the neighboring cities, which in turn constitute the "Missões" region [1] and represent an important chapter in the South American history, since the first settlements were founded during the Spanish colonial missions [2].  

Although today there are 46 municipalities composing this region, in the early eighteenth century there were only 7 villages, nowadays known in Portuguese as the "Sete Povos das Missões":
- São Miguel das Missões  
- Santo Ângelo  
- São Borja  
- São Nicolau  
- São Luiz Gonzaga  
- São Lourenço  
- Entre-Ijuís (where remains the ruins of the town of São João Batista)  

## Some basic statements

In [1]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re

# Dataset file name:
FILENAME = 'Missoes.osm'

In [2]:
#% Regular expression functions:
def avalia_regex(dataset, regex, nsamples=True, returnList=False):
    '''Função auxiliar para avaliar o resultado de uma dada expressão regular (regex) em um texto (string type). 
    Syntaxe: avalia_regex(dataset, regex, nsamples=True, returnList=False),
        dataset = texto tipo string que será avaliado;
        regex = expressão regular a ser encontrada;
        nsamples = True, mostra todas as amostras encontradas e, em caso contrário, apenas as 3 primeiras, se houver. 
        returnList = false. Se veradeiro, retorna uma lista com as expressões encontradas.
        
    Exemplo de uso: avalia_regex(t03, '\.{2,}\s')'''
    filtro = re.findall(regex, dataset)
    count = len(filtro)
    print('Foram encontrados {} matches para a expressão "{}."'.format(count, regex))
    if nsamples:
        for item in filtro:
            print(item, end=', ')
    else:
        if count > 2:
            print('\te.g.: {}, {}, {}.'.format(filtro[0], filtro[1], filtro[2]))
        elif count > 0:
            print('\te.g.: {}.'.format(filtro[0]))
    if returnList:
        return filtro
    else:
        return

In [3]:
def substitui_regex(dataset, regex, subst):
    '''Função auxiliar para substituir a coincidência de uma dada expressão regular (regex) em um texto (string type). 
    Syntaxe: avalia_regex(dataset, regex, subst), onde dataset é o texto tipo string; regex é a expressão regular a ser encontrada; subst é a string a qual será substituída. A função retorna a nova string.'''
    filtro = re.findall(regex, dataset)
    count = len(filtro)
    print('Foram encontrados {} matches para a expressão "{}."'.format(count, regex))
    newText = re.sub(regex, subst, dataset)
    return newText

## Getting and reading the data:
The data was obtained from ... 

After downloading the data, the first step I should do if I did not know the data model would be a simple "less" shell command to figure out what kind of data were in it. Since OpenStreetMaps provides us with a data model, which in turn tells us how the information is organized inside the database, we get to know that the information we are interested in are in keys called 'tag'. Just to check how many of them we will have to process on this data:

In [4]:
#%% Getting acquainted to the dataset
def count_tags(filename):
    tags = {}
    for event, element in ET.iterparse(filename):
        if element.tag not in tags:
            tags[element.tag] = 1
        else: 
            tags[element.tag] += 1
    return tags

tags = count_tags(FILENAME)
pprint.pprint(tags)

{'bounds': 1,
 'member': 15589,
 'meta': 1,
 'nd': 444388,
 'node': 380322,
 'note': 1,
 'osm': 1,
 'relation': 949,
 'tag': 93511,
 'way': 45320}


There are 93.511 tags we'll be dealing with in the next steps. We can move forward to the next step: starting to audit our data. 

## Auditing data:  
The auditing questions comes when we start exploring the data or, if it's the case we have a prior knowledge of the problem, we already have in mind some issues to investigate. Considering there are available on Internet some similar analysis on OpenStreetMap data [3,4]; and also considering my previous knowledge about this region, I intend to audit the following issues:  
- Are the cities names correct?
- Are the street names correct?
- Are there abbreviations?
- Are the postal codes consistent?  

It must be said that here the data are being first explored iteratively. Besides it is recommended to have one script for each field that is being audited, the whole process will be done through this Jupyter notebook in order to give an overview of the cleaning process. At the end, the code will be transferred to a standalone Python script (.py), in order to facilitate its automation when converting, cleaning and exporting data to a MongoDB collection, for example. 

### Are the cities names correct?
In order to answer this question, I need first to know where this information is in the OpenStreetMaps (OSM) data model, which can be found in [5]. Consulting the documentation we get to know we are looking for the *addr:city* key. 

In [5]:
#%% Finding the cities in the dataset
def list_cities(filename):
    cities = []
    for _, elem in ET.iterparse(filename):
        if elem.tag == 'tag':
            k = elem.attrib['k']
            v = elem.attrib['v'].lower()  #Lowering the uppercase text
            if k == 'addr:city':
                if v not in cities:
                    cities.append(v)
    print('There are {0} distinct cities in the dataset.'.format(len(cities)))
    return cities

cities = list_cities(FILENAME)
print(cities)

There are 18 distinct cities in the dataset.
['santa rosa', 'condor', 'ijuí', 'panambi', 'santo ângelo', 'três de maio', 'panambi - rs', 'santo cristo', 'eugênio de castro', 'santo augusto', 'santo angelo', 'cruz alta', 'vila sírio', 'cerro largo', 'são josé do mauá', 'são miguel das missões', 'horizontina', 'ijui']


#### (1) Same cities are recorded with distinct names due to hyphenization or accentuation
Even though I choose to use lowercase text, there are cities whose names are written with accentuation or hyphenized with the State abbreviation. A possible way to fix it is mapping the correct name to each case:

In [40]:
#%% Cleaning the cities names:
expected_cities = ['santa rosa', 'condor', 'ijuí', 'panambi', 'santo ângelo', 'três de maio',
            'santo cristo', 'eugênio de castro', 'santo augusto', 'cruz alta', 'vila sírio', 
            'cerro largo', 'são josé do mauá', 'são miguel das missões', 'horizontina']

In [47]:
def audit_cities(expected_cities, filename):
    weird = []
    for _, elem in ET.iterparse(filename):
        if elem.tag == 'tag':
            k = elem.attrib['k']
            v = elem.attrib['v'].lower()  #Lowering the uppercase text
            if k == 'addr:city':
                if v not in expected_cities:
                    weird.append(v)
    weird = set(weird)
    print('There are {0} not expected cities in the dataset.'.format(len(weird)))
    return weird

In [48]:
audit_cities(expected_cities, FILENAME)

There are 3 not expected cities in the dataset.


{'ijui', 'panambi - rs', 'santo angelo'}

In [49]:
# After running the whole block of code, I could define the mapping:
mapping_cities = {'ijui': 'ijuí',
                  'santo angelo': 'santo ângelo',
                  'panambi - rs': 'panambi'
                 }

In [61]:
## When exporting data, the cities names must be corrected.
def update_city(name, mapping):
    for key in mapping:
        if key in name:
            return name.replace(key, mapping[key])
    return name

### Are the street names correct? Are there any abbreviation?
We will now iterate over all the registers to find wrong street names or abbreviations.

In [16]:
expected = ["Rua", "Avenida", "Praça", "Via", "Estrada", "Travessa", "Linha", "Alameda", "Largo", "Parque", "Rodovia"]

### IMPORTANT: Brazilian street types are in the beginning of the phrase:
street_type_re = re.compile(r'^\b\S+\.?', re.IGNORECASE)

In [17]:
def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

In [18]:
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

In [19]:
def audit(osmfile):
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])

    return street_types

In [20]:
def test():
    st_types = audit(FILENAME)
    pprint.pprint(dict(st_types))

In [21]:
test()

{'14': {'14 de Julho'},
 '15': {'15 de Novembro'},
 'Av.': {'Av. Santa Bárbara', 'Av. Gustav Kuhlmann'},
 'BR': {'BR 285'},
 'BR-285': {'BR-285'},
 'BR-392': {'BR-392'},
 'BR158': {'BR158'},
 'Dom': {'Dom Pedro II'},
 'ERS-342': {'ERS-342'},
 'Getúlio': {'Getúlio Vargas'},
 'Padre': {'Padre Afonso Rodrigues'},
 'Paulo': {'Paulo Klemann'},
 'RS': {'RS 218'},
 'Santa': {'Santa Lucia'}}


From the 'weirdos' found above, I will now write some mapping to clean the data:

In [32]:
# After running the whole block of code, I could define the mapping:
mapping = {'Av.': 'Avenida',
           'BR ': 'BR-',
           'BR158': 'BR-158',
           'ERS-': 'RS-',
           'RS ': 'RS-'
          }

In [33]:
def update_name(name, mapping):
    for key in mapping:
        if key in name:
            return name.replace(key, mapping[key])
    return name

In [34]:
def test():
    st_types = audit(FILENAME)
    pprint.pprint(dict(st_types))

    for st_type, ways in st_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "=>", better_name)

In [35]:
test()

{'14': {'14 de Julho'},
 '15': {'15 de Novembro'},
 'Av.': {'Av. Santa Bárbara', 'Av. Gustav Kuhlmann'},
 'BR': {'BR 285'},
 'BR-285': {'BR-285'},
 'BR-392': {'BR-392'},
 'BR158': {'BR158'},
 'Dom': {'Dom Pedro II'},
 'ERS-342': {'ERS-342'},
 'Getúlio': {'Getúlio Vargas'},
 'Padre': {'Padre Afonso Rodrigues'},
 'Paulo': {'Paulo Klemann'},
 'RS': {'RS 218'},
 'Santa': {'Santa Lucia'}}
Getúlio Vargas => Getúlio Vargas
BR-285 => BR-285
Av. Santa Bárbara => Avenida Santa Bárbara
Av. Gustav Kuhlmann => Avenida Gustav Kuhlmann
BR 285 => BR-285
RS 218 => RS-218
15 de Novembro => 15 de Novembro
14 de Julho => 14 de Julho
BR-392 => BR-392
ERS-342 => RS-342
Padre Afonso Rodrigues => Padre Afonso Rodrigues
Dom Pedro II => Dom Pedro II
BR158 => BR-158
Santa Lucia => Santa Lucia
Paulo Klemann => Paulo Klemann


## References
[1] https://en.wikipedia.org/wiki/Miss%C3%B5es  
[2] https://en.wikipedia.org/wiki/Spanish_missions_in_South_America  
[3] https://jasonicarter.github.io/openstreetmap-data-wrangling-mongodb/  
[4] https://eberlitz.github.io/2015/09/18/data-wrangle-openstreetmaps-data/  
[5] https://wiki.openstreetmap.org/wiki/Key:addr  

## BACKUP

A. Application issues:   
- Do the dataset contains more than the 46 current cities of the Missões region?
- Do I have information from the 6 cities evolved form the ancient villages?

B. Data issues:  