# Capstone Project
# Relocation Support
### Sidclay da Silva
### June 2020
---

### Introduction

This notebook contains the Capstone Project as the Week 5 peer-graded assignment for the Course IBM Applied Data Science Capstone on Coursera, which requires to develop a solution for the problem proposed in the Project Proposal, the stablished condition requires to make use of __API Foursquare__ to solve the proposed problem.

The proposed problem for this project is to support people to relocate giving directions by creating a neighborhood rank for the target location using the relocator profile.

---

### Step 1 - Define the _Relocator Profile_ and _Target Location_

For this project the __Relocator Profile__ will be arbitrary defined to simulate a family relocating to also arbitrary defined __Target Location São Paulo, Brazil__.

__Family components__;
* 2 adults
* 2 kids in primary school age
* 1 dog as pet

__Family priorities__;
* Primary school for the kids
* Outdoor park to go with the kids, walk with the dog and jogging
* Supermarket for the daily life
* Pharmacy in case of emergencies, specially having kids
* Subway/metro station to avoid traffic

__Housing wishes__;
* Apartment
* 2 or 3 bedrooms
* 80 m<sup>2</sup> approximately
* 1 garage spot

__Rental budget__;
* BRL 2,000.00 monthly

Based on the Relocator Profile, define objects to rank the neighborhoods. Family priorities will be translated into a list object containing venues categories according API Foursquare, Housing wishes and Rental budget will be converted in a variable containg the rental budget per square meter - BRL/m<sup>2</sup>.

In [168]:
# create a priorities list according to API Foursquare categories
prio_list = ['Elementary School',
             'Park',
             'Supermarket',
             'Pharmacy',
             'Metro Station']
target_area = 80
target_budget = 2000

budget_sqm = target_budget / target_area

print('Family priorities are:', prio_list, '\n')
print('Monthly rental budget is BRL {:,.2f} for a {} sqm apartment = BRL {:,.2f}/sqm/month'. \
      format(target_budget, target_area, budget_sqm))

Family priorities are: ['Elementary School', 'Park', 'Supermarket', 'Pharmacy', 'Metro Station'] 

Monthly rental budget is BRL 2,000.00 for a 80 sqm apartment = BRL 25.00/sqm/month


---

### Step 2 - Create a Neighborhood Dataframe for the _Target Location_

There are 5 regions and 96 neighborhoods in the city of São Paulo, the official city web site contains its regions and neighborhoods division, which is a table in  XLSX format into a HTML page.
The current table is from 2017 and can be viewed at the following link;

[Prefeitura de São Paulo (SP) Regiões, Prefeituras Regionas e Distritos](http://www.prefeitura.sp.gov.br/cidade/secretarias/upload/urbanismo/infocidade/htmls/3_regioes_prefeituras_regionais_e_distrito_2017_10895.html)

To create a dataframe for the neighborhoods of São Paulo, a request to the above URL will be done, its content parsed, cleaned and organized, to be stored in __Neighborhoods Dataframe__.

First thing, import required libraries;

1. __Pandas__: manipulate dataframe objects
1. __Requests__: send and receive url requests
1. __BeautifulSoup__: parse url content

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

Send request to the above URL using _Requests_.

In [2]:
# define a variable for the url
url = 'http://www.prefeitura.sp.gov.br/cidade/secretarias/upload/urbanismo/infocidade/htmls/3_regioes_prefeituras_regionais_e_distrito_2017_10895.html'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content using _BeautifulSoup_ with _html_ parser.

In [3]:
# parse the raw data
par = bs(raw.text, 'html.parser')
print('URL content parsed.')

URL content parsed.


Check the content returned from the URL, initially, how many tables it contains, the HTML tag _table_ will be the reference for counting.

In [4]:
# print number of tables
print('{} table(s) found in the parsed URL.'.format(len(par.find_all('table'))))

1 table(s) found in the parsed URL.


As it is a table in XLSX format, all its content is stored in one unique table, including table description, headers, data, summary and footnotes. The data should be extract from this unique table, but, first get the table from the URL content, HTML tag _table_ will be the reference as well.

In [5]:
# get the table from the URL content
par_table = par.find_all('table')[0]

Check if there is any table header and how many rows the table contains. HTML tags _th_ and _tr_ will be respectively used for counting.

In [6]:
# print the number of headers and rows
print('{} header(s) and {} row(s) found in the table.'.format(len(par_table.find_all('th')),
                                                              len(par_table.find_all('tr'))))

0 header(s) and 110 row(s) found in the table.


From the 110 rows, 96 should be the required neighborhoods data, the remaining rows should be titles, summaries and footnotes rows.

Extract the data from the table to a list and check where the required data is located. This will be done running a nested loop through the rows, on the first level, and columns, on the second level. HTML tag _tr_ will be used to extract rows and _td_ to extract the columns.

In [7]:
# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

110 rows loaded.


All the 110 rows have been loaded. Check the head and tail of the list to define the range of required data.

In [8]:
# print the first 10 rows
print('Head 10 rows')
tabletmp[:10]

Head 10 rows


[['Regiões, Prefeituras Regionais e Distritos Municipais',
  '\xa0',
  '',
  '',
  '',
  '',
  ''],
 ['Município de São Paulo', '\xa0', '', '', '', ''],
 ['2017', '\xa0', '\xa0', '', '', '', ''],
 ['', '', '', '', ''],
 ['Regiões',
  'Prefeituras\r\n  Regionais',
  'Distritos',
  'Área (ha)',
  'Área (km²)',
  '',
  '',
  ''],
 ['Centro', 'Sé', 'Bela Vista', '271,77', '2,72', '', '', ''],
 ['Bom Retiro', '420,54', '4,21', '', '', ''],
 ['Cambuci', '392,42', '3,92', '', '', ''],
 ['Consolação', '381,51', '3,82', '', '', ''],
 ['Liberdade', '365,07', '3,65', '', '', '']]

The first 4 rows are the table titles and can be ignored, the 5<sup>th</sup> row contains the columns headers.

Check the tail of the table.

In [9]:
#print the last 10 rows
print('Tail 10 rows')
tabletmp[-10:]

Tail 10 rows


[['Vila Mariana', '859,56', '8,60', '', '', '', ''],
 ['Município de São\r\n  Paulo', '', '152.753,58', '1.527,54', '', '', '', ''],
 ['', '', '', '', '', '', '', ''],
 ['Fonte:\r\n  Prefeitura do Município de São Paulo. /\xa0\r\n  Instituto\xa0 Geográfico e\r\n  Cartográfico\xa0 do Estado de São Paulo.',
  '\xa0',
  '',
  '',
  '\xa0'],
 ['Elaboração:\r\n  SMUL/Deinfo', '\xa0', '\xa0', '', '', '\xa0'],
 ['Nota: Distritos Lei\r\n  nº 11.220/1992', '', '', ''],
 ['Subprefeituras\r\n  Lei nº 13.399/2002, alterada pelas Leis nº 13.682/2003 e nº 15.764/2013',
  '',
  '',
  ''],
 ['Base\r\n  de cálculo das áreas: Mapa Digital da Cidade (MDC) - UTM/SAD69-96.',
  '',
  '',
  ''],
 ['', '', '', '', '', '', '', ''],
 ['', '', '', '', '', '', '', '', '']]

The last 8 rows are the columns summary and footnotes, will also be ignored.

Check the columns headers, row 5 (index 4).

In [10]:
# check the columns headers
tabletmp[4]

['Regiões',
 'Prefeituras\r\n  Regionais',
 'Distritos',
 'Área (ha)',
 'Área (km²)',
 '',
 '',
 '']

There are 8 columns, but the last three are empty, for the five required ones an english header will be defined for the __Neighborhood Dataframe__ as following:

* __region__ (Regiões), the first column is the city Region to which the neighborhood belongs to
* __region_hall__ (Prefeituras Regionais), the second column is the location of the Regional Hall
* __neighborhood__ (Distritos), the third column is the Neighborhood name
* __area_ha__ (Área (ha)), the fourth column is the neighborhood land area in hectares
* __area_sqkm__ (Área (km<sup>2</sup>)), the fifth column is the neighborhood land area in square kilometers

In [11]:
# define the columns names
column_names = ['region','region_hall','neighborhood','area_ha','area_sqkm']

Beeing 4 title rows and 1 header row, the first 5 rows at the top will be ignored. At the bottom 8 row will be ignored, 1 summary and 7 footnotes rows. The neighborhoods data ranges from row 6 (index 5) to row 101 (index 100), counting 96 neighborhoods in São Paulo.

Extract only the data rows from the table and count the number of rows left.

In [12]:
tabletmp = tabletmp[5:101]
print('{} rows left.'.format(len(tabletmp)))

96 rows left.


Before extracting the data, check the head and tail to see how the data is stored.

In [13]:
# print the first 10 rows
print('Head 10 rows')
tabletmp[:10]

Head 10 rows


[['Centro', 'Sé', 'Bela Vista', '271,77', '2,72', '', '', ''],
 ['Bom Retiro', '420,54', '4,21', '', '', ''],
 ['Cambuci', '392,42', '3,92', '', '', ''],
 ['Consolação', '381,51', '3,82', '', '', ''],
 ['Liberdade', '365,07', '3,65', '', '', ''],
 ['República', '239,67', '2,40', '', '', ''],
 ['Santa Cecília', '375,92', '3,76', '', '', ''],
 ['Sé', '219,36', '2,19', '', '', ''],
 ['Leste',
  'Aricanduva/Formosa/Carrão',
  'Aricanduva',
  '695,83',
  '6,96',
  '',
  '',
  ''],
 ['Carrão', '790,12', '7,90', '', '', '']]

Check the tail.

In [14]:
#print the last 10 rows
print('Tail 10 rows')
tabletmp[-10:]

Tail 10 rows


[["M'Boi Mirim", 'Jardim Ângela', '3.741,13', '37,41', '', '', ''],
 ['Jardim São Luís', '2.604,72', '26,05', '', '', ''],
 ['Parelheiros', 'Marsilac', '20.818,52', '208,19', '', '', ''],
 ['Parelheiros', '15.260,75', '152,61', '', '', ''],
 ['Santo Amaro', 'Campo Belo', '876,98', '8,77', '', '', ''],
 ['Campo Grande', '1.295,08', '12,95', '', '', ''],
 ['Santo Amaro', '1.603,53', '16,04', '', '', '', ''],
 ['Vila Mariana', 'Moema', '907,87', '9,08', '', '', '', ''],
 ['Saúde', '931,12', '9,31', '', '', '', ''],
 ['Vila Mariana', '859,56', '8,60', '', '', '', '']]

The table seems to be structured in a group format, which is normally done for good visualization in XLSX format tables. The region names, e.g. __Centro__, appear only once at its first regional hall __Sé__, and so the Region Hall __Sé__ for its first neighborhood __Bela Vista__. There are rows with 5 columns, rows with 4 columns and rows with 3 columns, it makes things interesting.

Something else to notice is that the columns are filled from left to right, in some of the rows the first column contains the region name, in some it contains the regional hall, and in many of them it contains the neighborhood, meaning that the columns are not at the same positions for all the rows. To extract the data the reading should be from right to left, and for the rows missing the region and/or regional hall, the group parent should be used, even more interesting.

The last notice here is that the last columns are empty, for some rows 3 and for some 4 empty columns, the will be ignored.

Extract the data to a temporary list, taking all the remarks above into consideration. This will be achieved by a nested loop, the first level runs through the rows, and second level runs through the columns. Actually there is one list object, which represents the source tables, it contains lists as well, representing the rows of the table, and each element of them represents the cells of the table.

In [15]:
# create an empty list to store temporary the data
listtmp = []

# run a loop through the rows
# [actually it is a list object filled with lists]
for i, row in enumerate(tabletmp):

    # run a loop through the columns for the current row
    # [actually items of each list inside the big list object]
    for j in range(len(row)):
        
        # read the data from right [index -1] to left [index 0]
        # skip the last 3 empty columns [actually items]
        if row[len(row)-j-1] != '':
            # store the numbner of features the current row contains (3,4,5) and break the inner loop
            nfeatures = (len(row)-j)
            break

    # check the number of features
    if nfeatures == 5:
        # five features means complete row with region and regional hall, store them in variables
        vregion = row[0]
        vrghall = row[1]

    elif nfeatures == 4:
        # four features means region missing, but with regional hall, store it in a variable
        vrghall = row[0]

    # three features means region and regional hall missing, the variables above will be used

    # append the current row to the temporary list
    listtmp.append([vregion, vrghall, row[nfeatures-3], row[nfeatures-2], row[nfeatures-1]])

Check the resulting list, head and tail.

In [16]:
# check the first 10 rows
print('Head 10 rows')
listtmp[:10]

Head 10 rows


[['Centro', 'Sé', 'Bela Vista', '271,77', '2,72'],
 ['Centro', 'Sé', 'Bom Retiro', '420,54', '4,21'],
 ['Centro', 'Sé', 'Cambuci', '392,42', '3,92'],
 ['Centro', 'Sé', 'Consolação', '381,51', '3,82'],
 ['Centro', 'Sé', 'Liberdade', '365,07', '3,65'],
 ['Centro', 'Sé', 'República', '239,67', '2,40'],
 ['Centro', 'Sé', 'Santa Cecília', '375,92', '3,76'],
 ['Centro', 'Sé', 'Sé', '219,36', '2,19'],
 ['Leste', 'Aricanduva/Formosa/Carrão', 'Aricanduva', '695,83', '6,96'],
 ['Leste', 'Aricanduva/Formosa/Carrão', 'Carrão', '790,12', '7,90']]

In [17]:
#print the last 10 rows
print('Tail 10 rows')
listtmp[-10:]

Tail 10 rows


[['Sul', "M'Boi Mirim", 'Jardim Ângela', '3.741,13', '37,41'],
 ['Sul', "M'Boi Mirim", 'Jardim São Luís', '2.604,72', '26,05'],
 ['Sul', 'Parelheiros', 'Marsilac', '20.818,52', '208,19'],
 ['Sul', 'Parelheiros', 'Parelheiros', '15.260,75', '152,61'],
 ['Sul', 'Santo Amaro', 'Campo Belo', '876,98', '8,77'],
 ['Sul', 'Santo Amaro', 'Campo Grande', '1.295,08', '12,95'],
 ['Sul', 'Santo Amaro', 'Santo Amaro', '1.603,53', '16,04'],
 ['Sul', 'Vila Mariana', 'Moema', '907,87', '9,08'],
 ['Sul', 'Vila Mariana', 'Saúde', '931,12', '9,31'],
 ['Sul', 'Vila Mariana', 'Vila Mariana', '859,56', '8,60']]

The list is ready to be stored in a dataframe. Create the __Neighborhoods Dataframe__, using _Pandas_.

In [18]:
# create the dataframe
df_neighborhoods = pd.DataFrame(data=listtmp, columns=column_names)

Check the dataframe head, tail and shape.

In [19]:
# check the head, tail and shape
df_neighborhoods

Unnamed: 0,region,region_hall,neighborhood,area_ha,area_sqkm
0,Centro,Sé,Bela Vista,27177,272
1,Centro,Sé,Bom Retiro,42054,421
2,Centro,Sé,Cambuci,39242,392
3,Centro,Sé,Consolação,38151,382
4,Centro,Sé,Liberdade,36507,365
...,...,...,...,...,...
91,Sul,Santo Amaro,Campo Grande,"1.295,08",1295
92,Sul,Santo Amaro,Santo Amaro,"1.603,53",1604
93,Sul,Vila Mariana,Moema,90787,908
94,Sul,Vila Mariana,Saúde,93112,931


Data in __area_ha__ and __area_sqkm__ columns don't have a good fit, as the decimal separator is comma and the thousands separator is point, the Brazilian standard, for this reason they are strings when they should be numbers, _float_ in this case.

Convert the areas _string_ data to _float_ format. This will be done firstly removing the thousands separator, replacing the decimal separator, both using _Pandas apply_ metohd, and then converting _string_ to _float_ in the dataframe usind _Pandas astype_ method.

In [20]:
# remove thousand separator
df_neighborhoods['area_ha'] = df_neighborhoods['area_ha'].apply(lambda x : x.replace('.', ''))
df_neighborhoods['area_sqkm'] = df_neighborhoods['area_sqkm'].apply(lambda x : x.replace('.', ''))

# change decimal separator
df_neighborhoods['area_ha'] = df_neighborhoods['area_ha'].apply(lambda x : x.replace(',', '.'))
df_neighborhoods['area_sqkm'] = df_neighborhoods['area_sqkm'].apply(lambda x : x.replace(',', '.'))

# check the head
df_neighborhoods

Unnamed: 0,region,region_hall,neighborhood,area_ha,area_sqkm
0,Centro,Sé,Bela Vista,271.77,2.72
1,Centro,Sé,Bom Retiro,420.54,4.21
2,Centro,Sé,Cambuci,392.42,3.92
3,Centro,Sé,Consolação,381.51,3.82
4,Centro,Sé,Liberdade,365.07,3.65
...,...,...,...,...,...
91,Sul,Santo Amaro,Campo Grande,1295.08,12.95
92,Sul,Santo Amaro,Santo Amaro,1603.53,16.04
93,Sul,Vila Mariana,Moema,907.87,9.08
94,Sul,Vila Mariana,Saúde,931.12,9.31


Check the dataframe columns data types.

In [21]:
# print the columns data types
df_neighborhoods.dtypes

region          object
region_hall     object
neighborhood    object
area_ha         object
area_sqkm       object
dtype: object

Change the type of __area_ha__ and __area_sqkm__ columns from _object_ to _float_.

In [22]:
# convert the price column to float
df_neighborhoods = df_neighborhoods.astype({'area_ha': 'float64', 'area_sqkm': 'float64'})

# print the columns data types
df_neighborhoods.dtypes

region           object
region_hall      object
neighborhood     object
area_ha         float64
area_sqkm       float64
dtype: object

In [23]:
# check results
df_neighborhoods

Unnamed: 0,region,region_hall,neighborhood,area_ha,area_sqkm
0,Centro,Sé,Bela Vista,271.77,2.72
1,Centro,Sé,Bom Retiro,420.54,4.21
2,Centro,Sé,Cambuci,392.42,3.92
3,Centro,Sé,Consolação,381.51,3.82
4,Centro,Sé,Liberdade,365.07,3.65
...,...,...,...,...,...
91,Sul,Santo Amaro,Campo Grande,1295.08,12.95
92,Sul,Santo Amaro,Santo Amaro,1603.53,16.04
93,Sul,Vila Mariana,Moema,907.87,9.08
94,Sul,Vila Mariana,Saúde,931.12,9.31


---

### Step 3 - Create a Rental Prices Dataframe for the _Target Location_

Searching on the internet, there are several real state websites in São Paulo, but to find a list by neighborhood showing the rental prices is quite hard, there are also economy and financial websites doing monthly analysis to show the prices variation, but the list with prices per neighborhood is also not available.

The rental proces list per neighborhood was found in a real state agent website called __Blog SP Imóvel__ (www.spimovel.com.br), which provides real state services all around the city. It hosts additional four websites, one for each city region as follwing;

* __Blog ZN Imóvel__ (www.znimovel.com.br) for region _Norte_
* __Blog ZS Imóvel__ (www.zsimovel.com.br) for region _Sul_
* __Blog ZL Imóvel__ (www.zlimovel.com.br) for region _Leste_
* __Blog ZO Imóvel__ (www.zoimovel.com.br) for region _Oeste_

To create the __Rental Prices Dataframe__ a request will be sent for each of the mentioned websites, their content parsed, cleaned and organized, to be combined and stored into a unique data frame.

The lists can be viewed at the following links;

[Blog ZN Imóvel list from 2020.02.26](https://www.znimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-norte-de-sao-paulo/2834/)

[Blog ZS Imóvel list from 2020.03.06](https://www.zsimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-sul-de-sao-paulo/2874/)

[Blog ZL Imóvel list from 2020.02.28](https://www.zlimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-leste-de-sao-paulo/2824/)

[Blog ZO Imóvel list from 2020.02.28](https://www.zoimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-oeste-de-sao-paulo/2851/)


Unfotunatelly they do not have the _Centro_ prices listed, to solve this gap, it will be used a monthly research per region by __O Sindicato da Habitação de São Paulo (SECOVI-SP)__ (https://secovi.com.br), which is a housing union. The research is available to download in PDF format. To be  fair with the prices, the research results from February 2020 will be used.

Mentioned research results can be viewd at the following link;

[SECOVI-SP Pesquisa Mensal de Valores de Locação Residencial Fevereiro 2020](https://secovi.com.br/downloads/pesquisas-e-indices/pml/2020/arquivos/locacao-2020_02_versao-1.pdf)

Import addional required libraries
1. __Numpy__: handle arrays and matrices calculations. 

In [24]:
import numpy as np

__Region__ ___Norte___ - Send request to the URL using _Requests_.

In [25]:
# define a variable for the url
url = 'https://www.znimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-norte-de-sao-paulo/2834/'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content using _BeautifulSoup_ with _html_ parser.

In [26]:
# parse the raw data
par = bs(raw.text, 'html.parser')
print('URL content parsed.')

URL content parsed.


Check the content returned from the URL, initially, how many tables it contains, the HTML tag _table_ will be the reference for counting.

In [27]:
# print number of tables
print('{} table(s) found in the parsed URL.'.format(len(par.find_all('table'))))

3 table(s) found in the parsed URL.


Check the titles of each table to know which of them is the relevant one to be used. The HTML tag _table_ will be used to select the tables, and the tag _tr_ to select the first row form the table.

In [28]:
# run a loop through the tables [tag table]
for i, title in enumerate(par.find_all('table')):
    # print the first row [tag tr]
    print('Title of Table', i)
    print(title.find_all('tr')[0].get_text())

Title of Table 0

Valor médio do metro quadrado do Aluguel
			Apartamentos 1, 2 e 3 dormitórios 
1 Vaga de Garagem
			Zona Norte - São Paulo

Title of Table 1

Valor médio do metro quadrado do Aluguel
			Apartamentos 2 e 3 dormitórios 
2 Vagas de Garagem
			Zona Norte - São Paulo

Title of Table 2

Valor médio do metro quadrado do Aluguel
			Apartamentos Alto Padrão com 3 SUÌTES ou 4 dormitórios
3 ou mais  Vagas de Garagem
			Zona Norte - São Paulo



Translating results.

__Table 0__ contains the mean rental prices per square meter (BRL/m<sup>2</sup>) for apartments with 1, 2 or 3 bedrooms and 1 garage spot

__Table 1__ contains the mean rental prices per square meter (BRL/m<sup>2</sup>) for apartments with 2 or 3 bedrooms and 2 garage spots

__Table 2__ contains the mean rental prices per square meter (BRL/m<sup>2</sup>) for high standard apartments with 3 bedrooms with private suites or 4 bedrooms and 3 or more garage spots

For this project the __Table 0__ will be used as it fits to __Relocator Profile__.

Extract the data from the table to a list and check where the required data is located. This will be done running a nested loop through the rows, on the first level, and columns, on the second level. HTML tag _tr_ will be used to extract rows and _td_ to extract the columns.

In [29]:
# get the table from the URL content [tag 'table']
par_table = par.find_all('table')[0]

# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

28 rows loaded.


Check the head and tail of the list to define the range of required data.

In [30]:
# print the first 10 rows
print('Head 10 rows')
tabletmp[:10]

Head 10 rows


[['Valor médio do metro quadrado do Aluguel\r\n\t\t\tApartamentos 1, 2 e 3 dormitórios\xa0\n1 Vaga de Garagem\r\n\t\t\tZona Norte - São Paulo'],
 ['Bairros', 'Valor médio do m² Aluguel'],
 ['Santana', 'R$ 23,80'],
 ['Lauzane Paulista', 'R$ 22,10'],
 ['Mandaqui', 'R$ 21,70'],
 ['Tucuruvi', 'R$ 23,00'],
 ['Parada Inglesa', 'R$ 27,00'],
 ['Vila Guilherme', 'R$ 25,80'],
 ['Jardim São Paulo', 'R$ 22,40'],
 ['Vila Mazzei', 'R$ 23,60']]

The first row contains the table title, the second row the columns headers.

Check the tail of the table.

In [31]:
#print the last 10 rows
print('Tail 10 rows')
tabletmp[-10:]

Tail 10 rows


[['Cachoeirinha', 'R$ 21,00'],
 ['Vila Amália', 'R$ 22,30'],
 ['Vila Gustavo', 'R$ 23,70'],
 ['Limão', 'R$ 23,30'],
 ['Vila Medeiros', 'R$ 22,20'],
 ['Vila Nova Cachoeirinha', 'R$ 21,80'],
 ['Tremembé', 'R$ 20,10'],
 ['Horto Florestal', 'R$ 20,40'],
 ['Alto de Santana', 'R$ 23,80'],
 ['Dados Fevereiro 2020\r\n\t\t\tPortal ZN Imóvel']]

The row 28, the last one, contais the footnote.

Check the columns headers in row 2.

In [32]:
tabletmp[1]

['Bairros', 'Valor médio do m² Aluguel']

There are 2 columns, an english header will be defined for the __Rental Prices Dataframe__ as following:

* __neighborhood__ (Bairros), the first column is the Neighborhood name
* __mean_price_sqm__ (Valor médio do m<sup>2</sup> Aluguel), the second column is the mean rental price per m<sup>2</sup> in BRL (BRL/m<sup>2</sup>)

In [33]:
# define the columns names
column_names = ['neighborhood','mean_price_sqm']

There are 1 title row and 1 header row, the first 2 rows at the top will be ignored. At the bottom 1 footnote row  will be ignored. The data ranges from row 3 (index 2) to row 27 (index 26), counting 25 neighborhoods in region _Norte_.

Extract only the data rows from the table and count the number of rows left.

In [34]:
# extract only data rows
tabletmp = tabletmp[2:27]

# print quantity of rows left
print('{} rows left.'.format(len(tabletmp)))

25 rows left.


The list is ready to be stored in a dataframe. Create the __Rental Prices Norte Dataframe__, using _Pandas_.

In [35]:
# create the dataframe
df_rentalpricesN = pd.DataFrame(data=tabletmp, columns=column_names)

Check the dataframe head.

In [36]:
# check the head
df_rentalpricesN.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Santana,"R$ 23,80"
1,Lauzane Paulista,"R$ 22,10"
2,Mandaqui,"R$ 21,70"
3,Tucuruvi,"R$ 23,00"
4,Parada Inglesa,"R$ 27,00"


Data in __mean_price_sqm__ column doesn't have a good fit, as it contains the currency symbol and the decimal separator is comma, the Brazilian standard.

Convert the price _string_ to _float_ format. This will be done firstly removing the currency symbol, then replacing the decimal separator, both using _Pandas apply_ method, and finally converting _string_ to _float_ in the dataframe using _Pandas astype_ method.

In [37]:
# remove currency symbol
df_rentalpricesN['mean_price_sqm'] = df_rentalpricesN['mean_price_sqm'].apply(lambda x : x.replace('R$ ', ''))

# change decimal separator
df_rentalpricesN['mean_price_sqm'] = df_rentalpricesN['mean_price_sqm'].apply(lambda x : x.replace(',', '.'))

# check the head
df_rentalpricesN.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Santana,23.8
1,Lauzane Paulista,22.1
2,Mandaqui,21.7
3,Tucuruvi,23.0
4,Parada Inglesa,27.0


Check the dataframe columns data types.

In [38]:
# print the columns data types
df_rentalpricesN.dtypes

neighborhood      object
mean_price_sqm    object
dtype: object

Change the type of __mean_price_sqm__ column from _object_ to _float_.

In [39]:
# convert the price column to float
df_rentalpricesN = df_rentalpricesN.astype({'mean_price_sqm': 'float64'})

# print the columns data types
df_rentalpricesN.dtypes

neighborhood       object
mean_price_sqm    float64
dtype: object

In [40]:
# check results
df_rentalpricesN.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Santana,23.8
1,Lauzane Paulista,22.1
2,Mandaqui,21.7
3,Tucuruvi,23.0
4,Parada Inglesa,27.0


For regions _Sul_, _Leste_ and _Oeste_ the task should be the same, but they will be performed less didatically as it is only repetition.

__Region__ ___Sul___ - Send request to the URL using _Requests_.

In [41]:
# define a variable for the url
url = 'https://www.zsimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-sul-de-sao-paulo/2874/'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content, check how many tables it contains and check the titles of each table to know which of them is the relevant one to be used.

In [42]:
# parse the raw data
par = bs(raw.text, 'html.parser')

# print number of tables
print('{} table(s) found in the parsed URL.\n'.format(len(par.find_all('table'))))

# run a loop through the tables [tag table]
for i, title in enumerate(par.find_all('table')):
    # print the first row [tag tr]
    print('Title of Table', i)
    print(title.find_all('tr')[0].get_text())

4 table(s) found in the parsed URL.

Title of Table 0

Valor médio do metro quadrado do Aluguel
			Apartamentos 1 ou 2 dormitórios
1 Vaga de Garagem
			Zona Sul - São Paulo

Title of Table 1

Valor médio do metro quadrado do Aluguel
			Apartamentos 2 ou 3 dormitórios
2 Vagas de Garagem
			Zona Sul - São Paulo

Title of Table 2

Valor médio do metro quadrado do Aluguel
			Apartamentos 3 Suítes ou 4 dormitórios
3 Vagas de Garagem
			Zona Sul - São Paulo

Title of Table 3

Valor médio do metro quadrado do Aluguel
			Apartamentos 1 dormitório
SEM VAGA de Garagem
			Zona Sul - São Paulo



Translating results.

__Table 0__ contains prices for apartments with 1 or 2 bedrooms and 1 garage spot

__Table 1__ contains prices for apartments with 2 or 3 bedrooms and 2 garage spots

__Table 2__ contains prices for apartments with 3 bedrooms with private suites or 4 bedrooms and 3 garage spots

__Table 3__ contains prices for apartments with 1 bedroom and no garage spot

For this project the __Table 0__ will be used as it fits to __Relocator Profile__.

Extract the data from the table to a list and confirm where the required data is located.

In [43]:
# get the table from the URL content [tag 'table']
par_table = par.find_all('table')[0]

# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

# print content of three first row
print('Content of three first rows:\n', tabletmp[:3], '\n')

# print content of last two row
print('Content of 2 last rows:\n', tabletmp[-2:])

20 rows loaded.
Content of three first rows:
 [['Valor médio do metro quadrado do Aluguel\r\n\t\t\tApartamentos 1 ou 2 dormitórios\n1 Vaga de Garagem\r\n\t\t\tZona Sul - São Paulo'], ['Bairros', 'Valor médio m² Aluguel'], ['Vila Mariana', 'R$ 29,90']] 

Content of 2 last rows:
 [['Campo Limpo', 'R$ 24,50'], ['Dados Março 2020\r\n\t\t\tPortal ZS Imóvel']]


The table structure is the same, 1 title, 1 header and 1 footnote row, the first 2 and the last 1 rows will be ignored.

Create the __Rental Prices Sul Dataframe__ using only the data rows.

In [44]:
# create the dataframe
df_rentalpricesS = pd.DataFrame(data=tabletmp[2:19], columns=column_names)

# print results
print('{} observations'.format(df_rentalpricesS.shape[0]))
df_rentalpricesS.head()

17 observations


Unnamed: 0,neighborhood,mean_price_sqm
0,Vila Mariana,"R$ 29,90"
1,Moema,"R$ 32,60"
2,Morumbi,"R$ 27,40"
3,Saúde,"R$ 28,20"
4,Ipiranga,"R$ 28,70"


Convert the price string to _float_ format.

In [45]:
# remove currency symbol and change decimal separator
df_rentalpricesS['mean_price_sqm'] = df_rentalpricesS['mean_price_sqm'].apply(lambda x : x.replace('R$ ', ''))
df_rentalpricesS['mean_price_sqm'] = df_rentalpricesS['mean_price_sqm'].apply(lambda x : x.replace(',', '.'))

# convert the price column to float
df_rentalpricesS = df_rentalpricesS.astype({'mean_price_sqm': 'float64'})

# print the columns data types
print('Dataframedata types:\n', df_rentalpricesS.dtypes, '\n')

Dataframedata types:
 neighborhood       object
mean_price_sqm    float64
dtype: object 



Check results

In [46]:
# check results
df_rentalpricesS.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Vila Mariana,29.9
1,Moema,32.6
2,Morumbi,27.4
3,Saúde,28.2
4,Ipiranga,28.7


__Region__ ___Leste___ - Send request to the URL using _Requests_.

In [47]:
# define a variable for the url
url = 'https://www.zlimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-leste-de-sao-paulo/2824/'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content, check how many tables it contains and check the titles of each table to know which of them is the relevant one to be used.

In [48]:
# parse the raw data
par = bs(raw.text, 'html.parser')

# print number of tables
print('{} table(s) found in the parsed URL.\n'.format(len(par.find_all('table'))))

# run a loop through the tables [tag table]
for i, title in enumerate(par.find_all('table')):
    # print the first row [tag tr]
    print('Title of Table', i)
    print(title.find_all('tr')[0].get_text())

4 table(s) found in the parsed URL.

Title of Table 0

Valor médio do metro quadrado do Aluguel
			Apartamentos 1, 2 e 3 dormitórios
1 Vaga de Garagem
			Zona Leste - São Paulo

Title of Table 1

Valor médio do metro quadrado do Aluguel
Apartamentos 2 e 3 dormitórios
2 Vagas de Garagem
			Zona Leste - São Paulo

Title of Table 2

Valor médio do metro quadrado do Aluguel
Apartamentos 3 Suítes ou 4 Dormitórios
3 ou mais Vagas de Garagem
			Zona Leste - São Paulo

Title of Table 3

Valor médio do metro quadrado do Aluguel
Apartamentos 1, 2 e 3 dormitórios
1 Vaga de Garagem
			Cohab, Zona Leste - São Paulo



Translating results.

__Table 0__ contains prices for apartments with 1, 2 or 3 bedrooms and 1 garage spot

__Table 1__ contains prices for apartments with 2 or 3 bedrooms and 2 garage spots

__Table 2__ contains prices for apartments with 3 bedrooms with private suites or 4 bedrooms and 3 or more garage spots

__Table 3__ contains prices for apartments with 1, 2 or 3 bedrooms and 1 garage spot from _Cohab_, which is government habitational support program

For this project the __Table 0__ will be used as it fits to __Relocator Profile__.

Extract the data from the table to a list and confirm where the required data is located.

In [49]:
# get the table from the URL content [tag 'table']
par_table = par.find_all('table')[0]

# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

# print content of three first row
print('Content of three first rows:\n', tabletmp[:3], '\n')

# print content of last two row
print('Content of 2 last rows:\n', tabletmp[-2:])

28 rows loaded.
Content of three first rows:
 [['Valor médio do metro quadrado do Aluguel\r\n\t\t\tApartamentos 1, 2 e 3 dormitórios\n1 Vaga de Garagem\r\n\t\t\tZona Leste - São Paulo'], ['Bairros', '\xa0Valor médio m² Aluguel'], ['Tatuapé', 'R$ 25,70']] 

Content of 2 last rows:
 [['Jd. Nove de Julho', 'R$ 21,90'], ['Dados Fevereiro 2020\r\n\t\t\tPortal ZL Imóvel']]


The table structure is the same, 1 title, 1 header and 1 footnote row, the first 2 and the last 1 rows will be ignored.

Create the __Rental Prices Leste Dataframe__ using only the data rows.

In [50]:
# create the dataframe
df_rentalpricesL = pd.DataFrame(data=tabletmp[2:27], columns=column_names)

# print results
print('{} observations'.format(df_rentalpricesL.shape[0]))
df_rentalpricesL.head()

25 observations


Unnamed: 0,neighborhood,mean_price_sqm
0,Tatuapé,"R$ 25,70"
1,Jd. Anália Franco,"R$ 29,80"
2,Vila Gomes Cardim,"R$ 27,20"
3,Vila Carrão,"R$ 24,90"
4,Mooca,"R$ 26,20"


Convert the price string to _float_ format.

In [51]:
# remove currency symbol and change decimal separator
df_rentalpricesL['mean_price_sqm'] = df_rentalpricesL['mean_price_sqm'].apply(lambda x : x.replace('R$ ', ''))
df_rentalpricesL['mean_price_sqm'] = df_rentalpricesL['mean_price_sqm'].apply(lambda x : x.replace(',', '.'))

# convert the price column to float
df_rentalpricesL = df_rentalpricesL.astype({'mean_price_sqm': 'float64'})

# print the columns data types
print('Dataframedata types:\n', df_rentalpricesL.dtypes, '\n')

Dataframedata types:
 neighborhood       object
mean_price_sqm    float64
dtype: object 



Check results

In [52]:
# check results
df_rentalpricesL.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Tatuapé,25.7
1,Jd. Anália Franco,29.8
2,Vila Gomes Cardim,27.2
3,Vila Carrão,24.9
4,Mooca,26.2


__Region__ ___Oeste___ - Send request to the URL using _Requests_.

In [53]:
# define a variable for the url
url = 'https://www.zoimovel.com.br/blog/qual-o-valor-do-metro-quadrado-do-aluguel-dos-apartamentos-na-zona-oeste-de-sao-paulo/2851/'

# send a request to the URL and store the response
raw = requests.get(url)

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the URL content, check how many tables it contains and check the titles of each table to know which of them is the relevant one to be used.

In [54]:
# parse the raw data
par = bs(raw.text, 'html.parser')

# print number of tables
print('{} table(s) found in the parsed URL.\n'.format(len(par.find_all('table'))))

# run a loop through the tables [tag table]
for i, title in enumerate(par.find_all('table')):
    # print the first row [tag tr]
    print('Title of Table', i)
    print(title.find_all('tr')[0].get_text())

4 table(s) found in the parsed URL.

Title of Table 0

Valor médio do metro quadrado do Aluguel
			Apartamentos 1 e 2 dormitórios
1 Vaga de Garagem
			Zona Oeste - São Paulo

Title of Table 1

Valor médio do metro quadrado do Aluguel
Apartamentos 2 e 3 dormitórios
2 Vagas de Garagem
			Zona Oeste - São Paulo

Title of Table 2

Valor médio do metro quadrado do Aluguel
Apartamentos 3 e 4 dormitórios
3 Vagas de Garagem
			Zona Oeste - São Paulo

Title of Table 3

Valor médio do metro quadrado do Aluguel
Apartamentos ou Kitnets 1 dormitório
SEM VAGA de Garagem
			Zona Oeste - São Paulo



Translating results.

__Table 0__ contains prices for apartments with 1 or 2 bedrooms and 1 garage spot

__Table 1__ contains prices for apartments with 2 or 3 bedrooms and 2 garage spots

__Table 2__ contains prices for apartments with 3 or 4 bedrooms and 3 garage spots

__Table 3__ contains prices for apartments with 1 bedroom and no garage spot

For this project the __Table 0__ will be used as it fits to __Relocator Profile__.

Extract the data from the table to a list and confirm where the required data is located.

In [55]:
# get the table from the URL content [tag 'table']
par_table = par.find_all('table')[0]

# create an empty list for the entire table
tabletmp = []

# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # create an empty list for the current row
    celltmp = []

    # run a loop by column for the current row [tag 'td']
    for j, column in enumerate(row.find_all('td')):
        # append the text of current cell to the list
        celltmp.append(column.get_text())

    # append current line to the list
    tabletmp.append(celltmp)
        
# inform the number of rows loaded
print('{} rows loaded.'.format(len(tabletmp)))

# print content of three first row
print('Content of three first rows:\n', tabletmp[:3], '\n')

# print content of last two row
print('Content of 2 last rows:\n', tabletmp[-2:])

24 rows loaded.
Content of three first rows:
 [['Valor médio do metro quadrado do Aluguel\r\n\t\t\tApartamentos 1 e 2 dormitórios\n1 Vaga de Garagem\r\n\t\t\tZona Oeste - São Paulo'], ['Bairros', 'Valor médio de m²'], ['Perdizes', 'R$ 27,80']] 

Content of 2 last rows:
 [['Vila Leopoldina', 'R$ 33,50'], ['Dados Fevereiro 2020\r\n\t\t\tPortal ZO Imóvel']]


The table structure is the same, 1 title, 1 header and 1 footnote row, the first 2 and the last 1 rows will be ignored.

Create the __Rental Prices Oeste Dataframe__ using only the data rows.

In [56]:
# create the dataframe
df_rentalpricesO = pd.DataFrame(data=tabletmp[2:23], columns=column_names)

# print results
print('{} observations'.format(df_rentalpricesO.shape[0]))
df_rentalpricesO.head()

21 observations


Unnamed: 0,neighborhood,mean_price_sqm
0,Perdizes,"R$ 27,80"
1,Pinheiros,"R$ 31,90"
2,Butantã,"R$ 25,00"
3,Jaguaré,"R$ 26,50"
4,Vila Madalena,"R$ 31,20"


For the region _Oeste_ two of the neighborhoods didn't have enough samples to measure the mean price, for those neighborhoods the text __'*Prejudicado'__ has been added instead of the  mean price.

In [57]:
# check entries with text in price mean column
df_rentalpricesO[df_rentalpricesO['mean_price_sqm']=='*Prejudicado']

Unnamed: 0,neighborhood,mean_price_sqm
13,República,*Prejudicado
18,Vila São Francisco (ZO),*Prejudicado


Convert the price string to _float_ format, but this time before converting replace the text from mean price colum by NaN using _Pandas iloc_ and _Numpy NaN_ for observations __indexes 13 and 18__.

In [58]:
# remove currency symbol and change decimal separator
df_rentalpricesO['mean_price_sqm'] = df_rentalpricesO['mean_price_sqm'].apply(lambda x : x.replace('R$ ', ''))
df_rentalpricesO['mean_price_sqm'] = df_rentalpricesO['mean_price_sqm'].apply(lambda x : x.replace(',', '.'))

# replace text by NaN in mean price
df_rentalpricesO.iloc[[13,18],[1]] = np.nan

# convert the price column to float
df_rentalpricesO = df_rentalpricesO.astype({'mean_price_sqm': 'float64'})

# print the columns data types
print('Dataframedata types:\n', df_rentalpricesO.dtypes, '\n')

Dataframedata types:
 neighborhood       object
mean_price_sqm    float64
dtype: object 



Check results

In [59]:
# check results
df_rentalpricesO.head()

Unnamed: 0,neighborhood,mean_price_sqm
0,Perdizes,27.8
1,Pinheiros,31.9
2,Butantã,25.0
3,Jaguaré,26.5
4,Vila Madalena,31.2


Remove missing data. For this case the missing mean prices will be filled with the __region mean price sqm__.

Check the observations with missing the mean price sqm.

In [60]:
# check observations with missing data
df_rentalpricesO[df_rentalpricesO['mean_price_sqm'].isna()]

Unnamed: 0,neighborhood,mean_price_sqm
13,República,
18,Vila São Francisco (ZO),


Replace the missing mean prices by region mean price for observations __indexes 13 and 18__.

In [61]:
# Replace the missing mean prices with the region mean price
df_rentalpricesO.iloc[[13,18],[1]] = round(df_rentalpricesO['mean_price_sqm'].mean(),2)

# check results
df_rentalpricesO.iloc[[13,18]]

Unnamed: 0,neighborhood,mean_price_sqm
13,República,29.62
18,Vila São Francisco (ZO),29.62


__Region__ ___Centro___ - as already stated, a list with prices per neighborhood is not available here, the monthly research per region by __O Sindicato da Habitação de São Paulo (SECOVI-SP)__ (https://secovi.com.br) as of February 2020 will be used. As it is in PDF format and only one line of the whole 10 pages report is required, the __Rental Prices Centro Dataframe__ will be manually created using the informatior from the mentioned research.

For apartments in good conditions, the report has the minimum and the maximum square meter prices as following.
* 2 bedrooms apartment
    * min = BRL 23.29/m<sup>2</sup>
    * max = BRL 28.25/m<sup>2</sup>
* 3 bedrooms apartment
    * min = BRL 24.76/m<sup>2</sup>
    * max = BRL 24.90/m<sup>2</sup>

The mean of the minimum and maximum prices will be calculated, using _Numpy Mean_ function, the __Rental Prices Centro Dataframe__ will be created based on __Neighborhoods Dataframe__, only observations where feature __neighborhood__ equals to _Centro_, once the dataframe has been created calculated mean price will be added to the dataframe.

In [62]:
# set a list with the values
val =[23.29, 28.25, 24.76, 24.90]

# create the dataframe using the Neighborhoods Dataframe filtered by neighborhood == Centro
df_rentalpricesC = df_neighborhoods[df_neighborhoods['region']=='Centro'][['neighborhood']]

# add the calculated mean price to dataframe
df_rentalpricesC['mean_price_sqm'] = list(np.repeat(round(np.mean(val),2), df_rentalpricesC.shape[0]))

# print results
df_rentalpricesC

Unnamed: 0,neighborhood,mean_price_sqm
0,Bela Vista,25.3
1,Bom Retiro,25.3
2,Cambuci,25.3
3,Consolação,25.3
4,Liberdade,25.3
5,República,25.3
6,Santa Cecília,25.3
7,Sé,25.3


__Rental Prices Dataframe__ - having the mean rental prices for all neighborhoods in all regions, combined all of them in one unique dataframe.

Check the quantity of observations each of them contains.

In [67]:
# print the quantity of observations for each region
print('Centro = {}\n Norte = {}\n Sul = {}\n Leste = {}\n Oeste = {}\n'.format(df_rentalpricesC.shape[0],
                                                                           df_rentalpricesN.shape[0],
                                                                           df_rentalpricesS.shape[0],
                                                                           df_rentalpricesL.shape[0],
                                                                           df_rentalpricesO.shape[0]))

# print the sum of the observations
print('Sum = {}'.format(df_rentalpricesC.shape[0]+df_rentalpricesN.shape[0]+df_rentalpricesS.shape[0]+df_rentalpricesL.shape[0]+df_rentalpricesO.shape[0]))

Centro = 8
 Norte = 25
 Sul = 17
 Leste = 25
 Oeste = 21

Sum = 96


Merge the 5 dataframes in one unique dataframe. It will be done using _Pandas Concat_ method.

In [71]:
# merge the dataframes in one
df_rentalprices = pd.concat([df_rentalpricesC, 
                             df_rentalpricesN, 
                             df_rentalpricesS, 
                             df_rentalpricesL,
                             df_rentalpricesO], ignore_index=True, axis=0)

# check results
df_rentalprices

Unnamed: 0,neighborhood,mean_price_sqm
0,Bela Vista,25.30
1,Bom Retiro,25.30
2,Cambuci,25.30
3,Consolação,25.30
4,Liberdade,25.30
...,...,...
91,Pirituba,23.70
92,Vila Sônia,25.90
93,Vila São Francisco (ZO),29.62
94,Jardins,32.20


---

### Step 4 - Data analysis on _Neighbohoods_ and _Rental Prices_ dataframes

Work in progress