### Step 1 - Define the relocator profile

---

### Step 2 - Create a Neighborhood Dataframe for the _Target Location_

For this project the defined target location is __São Paulo, Brazil__.

There are 96 neighborhoods in the city of São Paulo, the official city web site contains its regions and neighborhoods division, which is a table in  XLSX format into a HTML page, the table can be viewed at the following link http://www.prefeitura.sp.gov.br/cidade/secretarias/upload/urbanismo/infocidade/htmls/3_regioes_prefeituras_regionais_e_distrito_2017_10895.html.

To create a dataframe for the neighborhoods of São Paulo, a request to the above URL will be done, its content parsed and used to create the __Neighborhoods Dataframe__.

First thing, import required libraries

1. __Credentials__: user defined library to store API credential
1. __Pandas__: manipulate dataframe objects
1. __Numpy__: manipulate arrays and matrices
1. __Requests__: send and receive url requests
1. __BeautifulSoup__: parse url content

In [1]:
import credentials
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs

Send request to the above URL using _Requests_.

In [140]:
# define a variable for the url
url = 'http://www.prefeitura.sp.gov.br/cidade/secretarias/upload/urbanismo/infocidade/htmls/3_regioes_prefeituras_regionais_e_distrito_2017_10895.html'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content using _BeautifulSoup_ with _html_ parser.

In [141]:
# parse the raw data
par = bs(raw.text, 'html.parser')
print('url content parsed.')

url content parsed.


Check the content returned from the URL, initially, how many tables it contains, the HTML tag _table_ will be the reference for counting.

In [142]:
# print number of tables
print('{} table(s) found in the parsed URL.'.format(len(par.find_all('table'))))

1 table(s) found in the parsed URL.


As it is a table in XLSX format, all its content is stored in one unique table, including table description, headers, data, summary and footnotes. The data should be extract from this unique table, but, first get the table from the URL content, HTML tag _table_ will be the reference as well.

In [143]:
# get the table from the URL content
par_table = par.find_all('table')[0]

Check if there is any table header and how many rows the table contains. HTML tags _th_ and _tr_ will be respectively used for counting.

In [144]:
# print the number of headers and rows
print('{} header(s) and {} row(s) found in the table.'.format(len(par_table.find_all('th')),
                                                              len(par_table.find_all('tr'))))

0 header(s) and 110 row(s) found in the table.


From the 110 rows, 96 should be the required neighborhoods data, the remaining rows should be titles, summaries and footnotes rows.

Extract the data from the table to a list and check where the required data is located. This will be done running a nested loop through the rows, on the first level, and columns, on the second level. HTML tag _tr_ will be used to extract rows and _td_ to extract the columns.

In [145]:
# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

110 rows loaded.


All the 110 rows have been loaded. Check the head and tail of the list to define the range of required data.

In [146]:
# print the first 10 rows
print('Head 10 rows')
tabletmp[:9]

Head 10 rows


[['Regiões, Prefeituras Regionais e Distritos Municipais',
  '\xa0',
  '',
  '',
  '',
  '',
  ''],
 ['Município de São Paulo', '\xa0', '', '', '', ''],
 ['2017', '\xa0', '\xa0', '', '', '', ''],
 ['', '', '', '', ''],
 ['Regiões',
  'Prefeituras\r\n  Regionais',
  'Distritos',
  'Área (ha)',
  'Área (km²)',
  '',
  '',
  ''],
 ['Centro', 'Sé', 'Bela Vista', '271,77', '2,72', '', '', ''],
 ['Bom Retiro', '420,54', '4,21', '', '', ''],
 ['Cambuci', '392,42', '3,92', '', '', ''],
 ['Consolação', '381,51', '3,82', '', '', '']]

The first 4 rows are the table titles and can be ignored, the 5<sup>th</sup> row contains the columns headers.

Check the tail of the table.

In [147]:
#print the last 10 rows
print('Tail 10 rows')
tabletmp[-11:-1]

Tail 10 rows


[['Saúde', '931,12', '9,31', '', '', '', ''],
 ['Vila Mariana', '859,56', '8,60', '', '', '', ''],
 ['Município de São\r\n  Paulo', '', '152.753,58', '1.527,54', '', '', '', ''],
 ['', '', '', '', '', '', '', ''],
 ['Fonte:\r\n  Prefeitura do Município de São Paulo. /\xa0\r\n  Instituto\xa0 Geográfico e\r\n  Cartográfico\xa0 do Estado de São Paulo.',
  '\xa0',
  '',
  '',
  '\xa0'],
 ['Elaboração:\r\n  SMUL/Deinfo', '\xa0', '\xa0', '', '', '\xa0'],
 ['Nota: Distritos Lei\r\n  nº 11.220/1992', '', '', ''],
 ['Subprefeituras\r\n  Lei nº 13.399/2002, alterada pelas Leis nº 13.682/2003 e nº 15.764/2013',
  '',
  '',
  ''],
 ['Base\r\n  de cálculo das áreas: Mapa Digital da Cidade (MDC) - UTM/SAD69-96.',
  '',
  '',
  ''],
 ['', '', '', '', '', '', '', '']]

The last 8 rows are the columns summary and footnotes, will also be ignored.

Check the columns headers, row 5 (index 4).

In [128]:
# check the columns headers
tabletmp[4]

['Regiões',
 'Prefeituras\r\n  Regionais',
 'Distritos',
 'Área (ha)',
 'Área (km²)',
 '',
 '',
 '']

There are 8 columns, but the last three are empty, for the five required ones an english header will be defined for the __Neighborhood Dataframe__ as following:

* __region__ (Regiões), the first column is the city Region to which the neighborhood belongs to
* __region_hall__ (Prefeituras Regionais), the second column is the location of the Regional Hall
* __neighborhood__ (Distritos), the third column is the Neighborhood name
* __area_ha__ (Área (ha)), the fourth column is the neighborhood land area in hectares
* __area_sqkm__ (Área (km<sup>2</sup>)), the fifth column is the neighborhood land area in square kilometers

In [129]:
# define the columns names
column_names = ['region','region_hall','neighborhood','area_ha','area_sqkm']

Beeing 4 title rows and 1 header row, the first 5 rows at the top will be ignored. At the bottom 8 row will be ignored, 1 summary and 7 footnotes rows. The neighborhoods data ranges from row 6 (index 5) to row 101 (index 100), counting 96 neighborhoods in São Paulo.

Extract only the data rows from the table and count the number of rows left.

In [148]:
tabletmp = tabletmp[5:101]
print('{} rows left.'.format(len(tabletmp)))

96 rows left.


Before extractgin the data, check the first 10 rows to see how the data is stored.

In [149]:
# check the first 10 rows
tabletmp[:9]

[['Centro', 'Sé', 'Bela Vista', '271,77', '2,72', '', '', ''],
 ['Bom Retiro', '420,54', '4,21', '', '', ''],
 ['Cambuci', '392,42', '3,92', '', '', ''],
 ['Consolação', '381,51', '3,82', '', '', ''],
 ['Liberdade', '365,07', '3,65', '', '', ''],
 ['República', '239,67', '2,40', '', '', ''],
 ['Santa Cecília', '375,92', '3,76', '', '', ''],
 ['Sé', '219,36', '2,19', '', '', ''],
 ['Leste',
  'Aricanduva/Formosa/Carrão',
  'Aricanduva',
  '695,83',
  '6,96',
  '',
  '',
  '']]

The table seems to be structured in a group format, which is normally done for good visualization in XLSX format tables. The region names, e.g. __Centro__, appear only once at its first regional hall __Sé__, and so the Region Hall __Sé__ for its first neighborhood __Bela Vista__. There are rows with 5 columns, rows with 4 columns and rows with 3 columns, it makes things interesting.

Something else to notice is that the columns are filled from left to right, in some of the rows the first column contains the region name, in some it contains the regional hall, and in many of them it contains the neighborhood, meaning that the columns are not at the same positions for all the rows. To extract the data the reading should be from right to left, and for the rows missing the region and/or regional hall, the group parent should be used, even more interesting.

The last notice here is that the last 3 columns are empty, the will be ignored.

Extract the data to a temporary list, taking all the remarks above into consideration. This will be achieved by a nested loop, the first level runs through the rows (which actually are lists), and second level runs through the columns (actually items from the list).

In [150]:
# create an empty list to store temporary the data
listtmp = []

# run a loop through the rows
# [actually it is a list object filled with lists]
for i, row in enumerate(tabletmp):

    # run a loop through the columns for the current row
    # [actually items of each list inside the big list object]
    for j in range(len(row)):
        
        # read the data from right [index -1] to left [index 0]
        # skip the last 3 empty columns [actually items]
        if row[len(row)-j-1] != '':
            # store the numbner of features the current row contains (3,4,5) and break the inner loop
            nfeatures = (len(row)-j)
            break

    # check the number of features
    if nfeatures == 5:
        # five features means complete row with region and regional hall, store them in variables
        vregion = row[0]
        vrghall = row[1]

    elif nfeatures == 4:
        # four features means region missing, but with regional hall, store it in a variable
        vrghall = row[0]

    # three features means region and regional hall missing, the variables above will be used

    # append the current row to the temporary list
    listtmp.append([vregion, vrghall, row[nfeatures-3], row[nfeatures-2], row[nfeatures-1]])

Check the resulting list, head and tail.

In [151]:
# check the first 10 rows
listtmp[:10]

[['Centro', 'Sé', 'Bela Vista', '271,77', '2,72'],
 ['Centro', 'Sé', 'Bom Retiro', '420,54', '4,21'],
 ['Centro', 'Sé', 'Cambuci', '392,42', '3,92'],
 ['Centro', 'Sé', 'Consolação', '381,51', '3,82'],
 ['Centro', 'Sé', 'Liberdade', '365,07', '3,65'],
 ['Centro', 'Sé', 'República', '239,67', '2,40'],
 ['Centro', 'Sé', 'Santa Cecília', '375,92', '3,76'],
 ['Centro', 'Sé', 'Sé', '219,36', '2,19'],
 ['Leste', 'Aricanduva/Formosa/Carrão', 'Aricanduva', '695,83', '6,96'],
 ['Leste', 'Aricanduva/Formosa/Carrão', 'Carrão', '790,12', '7,90']]

In [152]:
# check the last 10 rows
listtmp[-11:]

[['Sul', 'Jabaquara', 'Jabaquara', '1.401,09', '14,01'],
 ['Sul', "M'Boi Mirim", 'Jardim Ângela', '3.741,13', '37,41'],
 ['Sul', "M'Boi Mirim", 'Jardim São Luís', '2.604,72', '26,05'],
 ['Sul', 'Parelheiros', 'Marsilac', '20.818,52', '208,19'],
 ['Sul', 'Parelheiros', 'Parelheiros', '15.260,75', '152,61'],
 ['Sul', 'Santo Amaro', 'Campo Belo', '876,98', '8,77'],
 ['Sul', 'Santo Amaro', 'Campo Grande', '1.295,08', '12,95'],
 ['Sul', 'Santo Amaro', 'Santo Amaro', '1.603,53', '16,04'],
 ['Sul', 'Vila Mariana', 'Moema', '907,87', '9,08'],
 ['Sul', 'Vila Mariana', 'Saúde', '931,12', '9,31'],
 ['Sul', 'Vila Mariana', 'Vila Mariana', '859,56', '8,60']]

The list is ready to be stored in a dataframe, the columns names are defined, the data is cleaned and organized. Create the __Neighborhoods Dataframe__, using _Pandas_.

In [153]:
# create the dataframe
df_neighborhoods = pd.DataFrame(data=listtmp, columns=column_names)

Check the dataframe head, tail and shape.

In [154]:
# check the head, tail and shape
df_neighborhoods

Unnamed: 0,region,region_hall,neighborhood,area_ha,area_sqkm
0,Centro,Sé,Bela Vista,27177,272
1,Centro,Sé,Bom Retiro,42054,421
2,Centro,Sé,Cambuci,39242,392
3,Centro,Sé,Consolação,38151,382
4,Centro,Sé,Liberdade,36507,365
...,...,...,...,...,...
91,Sul,Santo Amaro,Campo Grande,"1.295,08",1295
92,Sul,Santo Amaro,Santo Amaro,"1.603,53",1604
93,Sul,Vila Mariana,Moema,90787,908
94,Sul,Vila Mariana,Saúde,93112,931


---