# Parte 2

## Precificação de imóveis

* Como precificar um imóvel com base num conjunto de variáveis inerentes a uma residência (**covariáveis**).

### Estratégia de negócio

* Utilizar a estimativa do preço do imóvel para precificá-lo ao consumidor final.
* Priorizar a venda de imóveis dependendo da estimativa do preço.

**Imports requeridos para a execução do estudo**

In [1]:
#Bibliotecas básicas
import pandas as pd     #Manipulação dos dados
import numpy as np      #Operações multidimensionais e matemáticas
import matplotlib.pyplot as plt    #Gráficos
import matplotlib.ticker as ticker #Remover a notação científica do gráfico
import seaborn as sns              #Gráficos
##Machine learning
from pycaret.regression import * 
##Eliminar os warnings
import warnings
warnings.filterwarnings("ignore")
##Ver todas as colunas do data frame
pd.set_option('display.max_columns', None)
##Extrair os valores do feature importance
import sklearn as sk
#Definir o formato de exibição tipo float para evitar notação científica
pd.options.display.float_format = '{:.2f}'.format
#MAPE (métrica de avaliação)
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
##Extrair os valores do feature importance
import sklearn as sk

**Versões do Python e Pycaret instalados**

In [2]:
#Identificar a versão do Python
import sys
print(f"Versão do Python: {sys.version}") #Versão do Python: 3.11.8

Versão do Python: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]


In [3]:
#Identificar a versão do Pycaret
import pycaret
print(f"Versão do PyCaret: {pycaret.__version__}") #Versão do PyCaret: 3.3.0

Versão do PyCaret: 3.3.0


# 1 - Visão geral os dados

In [4]:
#Importar a base de dados
df =  pd.read_csv('base_dados.csv', sep = ';')
#Visualizar
df.head(3) #As três primeiras linhas da tabela

Unnamed: 0,ID_RESIDENCIA,PRECO,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA
0,A1,13300000,7420,4.0,2.0,3,SIM,NAO,NAO,,SIM,2.0,SIM,SIM
1,A2,12250000,8960,4.0,4.0,4,SIM,NAO,,NAO,SIM,3.0,NAO,SIM
2,A3,12250000,9960,,2.0,2,SIM,NAO,SIM,NAO,NAO,2.0,SIM,PARCIALMENTE


* NaN - **not as number** (é um valor nulo/ *missing*).

In [5]:
#Conatgem distinta de IDs
##Cada ID é um número de identificação de uma determinada residência
df['ID_RESIDENCIA'].drop_duplicates().shape[0] 
#Cada ID é um número de imóvel. A contagem sem duplicidade foi de 545 observações.

545

In [6]:
#Características básicas do data frame - Parte 1
df.shape #545 observações e 14 variáveis - cada linha é uma única observação e cada coluna é uma variável.

(545, 14)

In [7]:
#Características básicas do data frame - Parte 2
df.info()#Nome da variável, contagem de observações não nulas e o tipo das variáveis

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID_RESIDENCIA           545 non-null    object 
 1   PRECO                   545 non-null    int64  
 2   AREA                    545 non-null    int64  
 3   QUARTOS                 481 non-null    float64
 4   BANHEIROS               465 non-null    float64
 5   ANDARES                 545 non-null    int64  
 6   FLAG_CENTRO             533 non-null    object 
 7   FLAG_QUARTO_HOSPEDE     523 non-null    object 
 8   FLAG_PORAO              510 non-null    object 
 9   FLAG_AGUA_MORNA         469 non-null    object 
 10  FLAG_AR_CONDICIONADO    532 non-null    object 
 11  VAGAS_ESTACIONAMENTO    485 non-null    float64
 12  FLAG_AREA_PREFERENCIAL  527 non-null    object 
 13  MOBILIADA               493 non-null    object 
dtypes: float64(3), int64(3), object(8)
memory 

In [8]:
#Quais as classes da variável MOBILIADA
df['MOBILIADA'].drop_duplicates() #Essa variável pode ser vista como ordinal
                                  #Uma casa mobiliada pode ser vista como uma casa melhor que uma parcial ou sem mobília

0              SIM
2     PARCIALMENTE
7              NAO
12             NaN
Name: MOBILIADA, dtype: object

# 2 - PyCaret

## 2.1 - Setup

* Esta função tem vários parâmetros e prepara o ambiente de modelagem, ou seja, realiza o pré-processamento dos dados (**imputação**, **normalização** e **encoding**) e os divide entre treinamento e teste.

In [9]:
exp = setup(df, target = 'PRECO', session_id = 1935, train_size = 0.65, ignore_features=['ID_RESIDENCIA'],
            normalize = True, normalize_method = 'minmax', numeric_imputation = 'median', categorical_imputation = 'mode',
            ordinal_features = {'MOBILIADA': ['NAO','PARCIALMENTE', 'SIM']}, experiment_name= "EXP_REGRESSAO")

Unnamed: 0,Description,Value
0,Session id,1935
1,Target,PRECO
2,Target type,Regression
3,Original data shape,"(545, 14)"
4,Transformed data shape,"(545, 15)"
5,Transformed train set shape,"(354, 15)"
6,Transformed test set shape,"(191, 15)"
7,Ignore features,1
8,Ordinal features,1
9,Numeric features,5


In [10]:
#Base de treino
df_treino = get_config('train')
#Visualização
df_treino.head(3)

Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO
229,9667,4.0,2.0,2,SIM,SIM,SIM,NAO,NAO,1.0,NAO,PARCIALMENTE,4690000
114,6800,,1.0,1,SIM,SIM,SIM,NAO,NAO,,NAO,SIM,6020000
405,3060,3.0,1.0,1,SIM,NAO,NAO,NAO,NAO,0.0,,NAO,3465000


In [11]:
#As variáveis da base de treino transformadas após o pré-processamento
X = get_config('X_train_transformed')
X.head(3)

Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA_1.0,MOBILIADA_2.0,MOBILIADA_0.0
229,0.55,0.6,0.33,0.33,1.0,1.0,1.0,0.0,0.0,0.33,0.0,1.0,0.0,0.0
114,0.35,0.4,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
405,0.1,0.4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [12]:
#Variável BAHNEIROS (transformada)
X['BANHEIROS'].describe()

count   354.00
mean      0.08
std       0.15
min       0.00
25%       0.00
50%       0.00
75%       0.00
max       1.00
Name: BANHEIROS, dtype: float64

In [13]:
#Variável BAHNEIROS (bruta)
df_treino['BANHEIROS'].describe()

count   299.00
mean      1.27
std       0.49
min       1.00
25%       1.00
50%       1.00
75%       2.00
max       4.00
Name: BANHEIROS, dtype: float64

In [14]:
#Base de teste - usada para a avaliação final dos modelos
df_teste = get_config('test')
#Visualização
df_teste.head(3)

Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000


## 2.2 - Compare models

* Esta função treina e avalia o desempenho de todos os algoritmos disponíveis através da abordagem do cross-validation (o padrão são 10 folds). Ela fornece um resumo das métricas de avaliação usadas para cada modelo.

In [15]:
#Quais algoritmos estão disponíveis (instalados)
models() #Todos os que possuem True estão instalados, podendo ser testados. Se for False, é possível instalar o algoritmo e utilizá-lo.

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Linear Regression,sklearn.linear_model._base.LinearRegression,True
lasso,Lasso Regression,sklearn.linear_model._coordinate_descent.Lasso,True
ridge,Ridge Regression,sklearn.linear_model._ridge.Ridge,True
en,Elastic Net,sklearn.linear_model._coordinate_descent.Elast...,True
lar,Least Angle Regression,sklearn.linear_model._least_angle.Lars,True
llar,Lasso Least Angle Regression,sklearn.linear_model._least_angle.LassoLars,True
omp,Orthogonal Matching Pursuit,sklearn.linear_model._omp.OrthogonalMatchingPu...,True
br,Bayesian Ridge,sklearn.linear_model._bayes.BayesianRidge,True
ard,Automatic Relevance Determination,sklearn.linear_model._bayes.ARDRegression,False
par,Passive Aggressive Regressor,sklearn.linear_model._passive_aggressive.Passi...,True


In [16]:
#Ordenar os modelos com base nos que tem o menor MAPE - métrica de avaliação adotada
compare_models(sort = 'MAPE') 
#Ordenar os modelos do menor para o maior MAPE

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,798444.9154,1170970875337.2876,1064002.9883,0.6434,0.2146,0.1773,0.077
lasso,Lasso Regression,801914.9289,1174788535375.571,1066425.4273,0.6411,0.2155,0.1777,0.079
lar,Least Angle Regression,801915.5597,1174787963312.415,1066425.3492,0.6411,0.2155,0.1777,0.074
llar,Lasso Least Angle Regression,801914.9124,1174788451600.96,1066425.374,0.6411,0.2155,0.1777,0.083
lr,Linear Regression,801439.1203,1171845642690.0366,1065771.5351,0.6415,0.2157,0.1778,1.018
rf,Random Forest Regressor,825826.9176,1309991595265.3264,1119201.9368,0.6056,0.2228,0.1817,0.213
catboost,CatBoost Regressor,832400.1586,1354656084645.7449,1133500.688,0.5983,0.2238,0.182,1.153
lightgbm,Light Gradient Boosting Machine,838711.6148,1393826424273.3423,1149161.9948,0.5745,0.2237,0.1825,0.15
gbr,Gradient Boosting Regressor,872147.4795,1466555121410.4265,1178926.1253,0.5597,0.2315,0.1911,0.114
xgboost,Extreme Gradient Boosting,888222.2829,1577265282402.2122,1226937.4154,0.531,0.2422,0.1924,0.11


### 2.2.1 - Métrica de avaliação: MAPE (Erro Médio Percentual Absoluto)

#### Explicação 

* o MAPE indica, em média, o quão distantes estão as previsões dos valores reais, em termos percentuais. **Quanto menor o valor do MAPE, mais precisa é a previsão**. Por exemplo, um MAPE de 5% indica que, em média, as previsões estão, em média, a 5% dos valores reais (seja para mais ou menos).


$$
\text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{|A_i - F_i|}{A_i} \right) \times 100
$$

##### Notação

* A é o valor real;
* F é o valor previsto;
* n é o número total de observações na amostra,
* Σ representa a soma sobre todas as observações,
* | | representa o valor absoluto.


#### Estratégia

* Escolher os três modelos com o menor MAPE no cross-validação e avaliá-los na base de teste. Posteriormente, selecionar o modelo com o menor MAPE da base de teste.

## 2.3 - Create model

* Esta função treina e avalia o desempenho de um determinado estimador utilizando a abordagem do cross-validation.

**Regressão Ridge**

In [17]:
#Modelo Ridge
model_1 = create_model('ridge') #Na base de treino se aplica o cross validação
model_1

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,756037.4394,998458272750.5084,999228.839,0.6927,0.2025,0.1629
1,846974.6177,1386433224129.9543,1177468.9907,0.474,0.2571,0.2085
2,792471.3935,1098912921885.7528,1048290.4759,0.6009,0.2319,0.1975
3,766584.6035,1069095800891.7437,1033970.8898,0.7175,0.2003,0.1632
4,800426.5574,1152688897098.671,1073633.5022,0.7677,0.2231,0.1861
5,910730.5534,1172717630443.984,1082920.8791,0.572,0.2457,0.2241
6,658126.0667,729652095867.0696,854196.7548,0.6871,0.171,0.1361
7,1024008.5204,2414841560050.436,1553976.0487,0.5937,0.2369,0.1838
8,785501.9918,1091690895918.2769,1044840.1294,0.5857,0.2229,0.1732
9,643587.4102,595217454336.4785,771503.3729,0.7426,0.1551,0.1373


In [18]:
#Avaliar o modelo na base de teste
##Criar um data frame
df_teste_1 = predict_model(model_1, data = df_teste)
#Visualização
df_teste_1.head(3)
#MAPE = 0.1876

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,853068.9396,1468199608915.9065,1211692.8691,0.594,0.2337,0.1876


Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO,prediction_label
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000,4175094.22
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000,7533131.5
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000,6310951.16


In [19]:
#Outra abordagem de extrair o MAPE da base de teste
mape_1 = mean_absolute_percentage_error(df_teste_1.PRECO, df_teste_1.prediction_label)
print("MAPE:", round(100 * mape_1,2)) #MAPE: 18.76

MAPE: 18.76


**Regressão Lasso**

In [20]:
#Modelo Lasso
model_2 = create_model('lasso') 
model_2

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,746875.2295,1003645173350.6852,1001820.9288,0.6911,0.2029,0.1606
1,852273.6396,1393906822228.9568,1180638.3114,0.4712,0.2564,0.2086
2,779050.8849,1036320509688.4958,1017998.2857,0.6237,0.2303,0.1953
3,776477.5736,1087058033370.2798,1042620.7524,0.7127,0.2041,0.1667
4,783074.3386,1099782044382.8044,1048704.9368,0.7784,0.2205,0.1829
5,925121.1638,1204001495635.1775,1097270.0195,0.5606,0.2475,0.2262
6,685828.9596,794739325023.9889,891481.5338,0.6592,0.1757,0.1405
7,1038668.536,2426961411238.439,1557870.7941,0.5917,0.239,0.1868
8,780571.2805,1090918645619.2836,1044470.5097,0.586,0.223,0.1717
9,651207.6828,610551893217.6002,781378.2011,0.736,0.1559,0.138


In [21]:
#Avaliar o modelo na base de teste
df_teste_2 = predict_model(model_2, data = df_teste)
#Visualização
df_teste_2.head(3) #0.1858

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Lasso Regression,847688.5884,1451529252864.4482,1204794.2782,0.5986,0.232,0.1858


Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO,prediction_label
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000,4212043.96
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000,7570575.09
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000,6217405.7


In [22]:
#Na regressão lasso, só existe um parâmetro (alfa). O seu valor padrão é 1.
print(model_2.alpha)

1.0


**Regressão do ângulo mínimo**

In [23]:
model_3 = create_model('lar') 
model_3

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,746874.0525,1003645103758.2592,1001820.8941,0.6911,0.2029,0.1606
1,852274.9078,1393911727665.5637,1180640.3888,0.4712,0.2564,0.2086
2,779050.3833,1036314426912.3384,1017995.2981,0.6237,0.2303,0.1953
3,776478.1821,1087060505541.4998,1042621.938,0.7127,0.2041,0.1667
4,783072.9369,1099772947686.5862,1048700.5996,0.7784,0.2205,0.1829
5,925124.5279,1204010566769.4966,1097274.153,0.5606,0.2475,0.2263
6,685832.0653,794744643573.4049,891484.5167,0.6592,0.1757,0.1405
7,1038669.2774,2426953138138.2407,1557868.1389,0.5917,0.239,0.1868
8,780569.4876,1090910010222.5502,1044466.3758,0.586,0.223,0.1717
9,651209.776,610556562856.2081,781381.1892,0.736,0.1559,0.138


In [24]:
#Avaliar o modelo na base de teste
df_teste_3 = predict_model(model_3, data = df_teste)
#Visualização
df_teste_3.head(3) #0.1858

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Least Angle Regression,847688.6721,1451526443442.6218,1204793.1123,0.5986,0.232,0.1858


Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO,prediction_label
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000,4212043.93
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000,7570582.17
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000,6217409.74


**Regressão Lasso do ângulo mínimo**

In [25]:
model_4 = create_model('llar') 
model_4

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,746874.8574,1003644216412.2888,1001820.4512,0.6911,0.2029,0.1606
1,852273.5449,1393906172381.0862,1180638.0361,0.4712,0.2564,0.2086
2,779050.9804,1036320925911.691,1017998.4901,0.6237,0.2303,0.1953
3,776477.8145,1087058339927.6304,1042620.8994,0.7127,0.2041,0.1667
4,783074.4236,1099782201110.7217,1048705.0115,0.7784,0.2205,0.1829
5,925121.0476,1204001060271.1672,1097269.8211,0.5606,0.2475,0.2262
6,685828.9047,794739212341.4412,891481.4706,0.6592,0.1757,0.1405
7,1038668.653,2426962266356.987,1557871.0686,0.5917,0.239,0.1868
8,780571.2216,1090918350569.3154,1044470.3684,0.586,0.223,0.1717
9,651207.6759,610551770727.2715,781378.1228,0.736,0.1559,0.138


In [26]:
#Avaliar o modelo na base de teste
df_teste_4 = predict_model(model_4, data = df_teste)
#Visualização
df_teste_4.head(3) #0.1858

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Lasso Least Angle Regression,847688.5519,1451529177797.327,1204794.2471,0.5986,0.232,0.1858


Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO,prediction_label
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000,4212043.85
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000,7570575.15
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000,6217404.55


## 2.4 - Tune model (opcional)

* Esta função visa encontrar uma nova combinação de hiperparâmetros que possa melhorar a performance do modelo.
* O melhor modelo foi o **lasso**.
    * A justificativa foi que além de ser o modelo com o melhor resultado do MAPE na base de teste, ele também é o método mais conhecido dentre os demais.
    * É mais fácil de explicar. Um valor elevado de alpha implica que as variáveis menos importantes serão desprezadas.

In [27]:
#Simular outros valores de alfa para encontrar um resultado melhor
param = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 1000]}

In [29]:
tuned_model = tune_model(model_2, optimize = 'MAPE', fold = 10, 
                        custom_grid = param, n_iter=10)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,747686.5287,1002880068820.1022,1001438.9991,0.6914,0.2027,0.1607
1,850911.4433,1388446577572.328,1178323.6302,0.4733,0.2559,0.2084
2,779660.6948,1043003392627.1448,1021275.3755,0.6212,0.2306,0.1954
3,776109.2657,1084985042278.451,1041626.1528,0.7133,0.2034,0.1663
4,784557.2435,1109170836199.2737,1053171.798,0.7765,0.2204,0.1828
5,921645.9807,1194621060488.5764,1092987.2188,0.564,0.2466,0.2256
6,682672.7896,789479274042.2103,888526.4622,0.6614,0.1755,0.14
7,1038044.2802,2436186598110.706,1560828.8177,0.5902,0.2387,0.1865
8,782298.6902,1099395228433.9376,1048520.495,0.5828,0.2234,0.1719
9,649112.9791,605865706492.3942,778373.7576,0.738,0.1556,0.1377


Fitting 10 folds for each of 7 candidates, totalling 70 fits


In [31]:
#Avaliar o modelo tunado na base de teste
df_teste_tune = predict_model(tuned_model, data = df_teste)
#Visualização
df_teste_tune.head(3) #MAPE: 0.1858

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Lasso Regression,847567.4062,1454375032442.483,1205974.723,0.5979,0.2322,0.1858


Unnamed: 0,AREA,QUARTOS,BANHEIROS,ANDARES,FLAG_CENTRO,FLAG_QUARTO_HOSPEDE,FLAG_PORAO,FLAG_AGUA_MORNA,FLAG_AR_CONDICIONADO,VAGAS_ESTACIONAMENTO,FLAG_AREA_PREFERENCIAL,MOBILIADA,PRECO,prediction_label
483,6615,3.0,,2,SIM,NAO,NAO,,NAO,0.0,NAO,PARCIALMENTE,2940000,4211984.58
172,8400,,1.0,2,SIM,SIM,SIM,NAO,SIM,2.0,SIM,NAO,5250000,7563547.1
144,4700,4.0,1.0,2,SIM,SIM,SIM,NAO,SIM,1.0,NAO,SIM,5600000,6212244.07


In [32]:
#valor do alpha = 1000
print(tuned_model.alpha)

1000


In [33]:
#Hiperparâmetro
print(tuned_model)

Lasso(alpha=1000, random_state=1935)


## 2.5 - Avaliação de desempenho

In [34]:
#Panorama geral dos resultados
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [35]:
#Top 10 variáveis
plot_model(tuned_model, plot = 'feature', save=True)

'Feature Importance.png'

In [36]:
#Ordem de importância de todas as variáveis
plot_model(tuned_model, plot = 'feature_all', save=True)

'Feature Importance (All).png'

In [37]:
#Criar um data frame do feature importance (o peso dos coeficientes estimados)
feature_importance = tuned_model.coef_
feature_names = X.columns
feature_importance_dict = dict(zip(feature_names, feature_importance))
sorted_feature_importance = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)
df_importance = pd.DataFrame(sorted_feature_importance, columns=['Feature', 'Importance'])
df_importance.to_csv("feature_importance.csv", sep=";")
df_importance #MOBILIADA_1.0 foi desprezada

Unnamed: 0,Feature,Importance
0,AREA,3332346.19
1,BANHEIROS,2373115.98
2,ANDARES,1815510.06
3,FLAG_AGUA_MORNA,929640.15
4,VAGAS_ESTACIONAMENTO,860098.18
5,FLAG_AR_CONDICIONADO,807485.15
6,FLAG_AREA_PREFERENCIAL,680126.45
7,FLAG_QUARTO_HOSPEDE,588123.01
8,FLAG_PORAO,495449.27
9,FLAG_CENTRO,324976.86


In [39]:
#Intercepto estimado
tuned_model.intercept_ #2022610.616775285

2022610.616775285

In [42]:
#Salvar o erro
plot_model(tuned_model, plot = 'error', save=True)

'Prediction Error.png'

In [43]:
#Salvar os resíduos
plot_model(tuned_model, plot = 'residuals', save=True)

'Residuals.png'

In [41]:
#Pipeline da modelagem
plot_model(tuned_model, plot = 'pipeline', save=True)

'Pipeline Plot.png'

## 2.6 - Concluir o experimento

In [46]:
#Finalizar o modelo
##O modelo vai ser treinando com todo o data set (base completa), considerando a regressão lasso
final_model = finalize_model(tuned_model)
final_model

In [47]:
#Salvar o modelo ("CPF" do modelo)
save_model(final_model, 'lasso_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['AREA', 'QUARTOS', 'BANHEIROS',
                                              'ANDARES',
                                              'VAGAS_ESTACIONAMENTO'],
                                     transformer=SimpleImputer(strategy='median'))),
                 ('categorical_imputer',
                  TransformerWrapper(include=['FLAG_CENTRO',
                                              'FLAG_QUARTO_HOSPEDE',
                                              'FLAG_PORAO', 'FLAG_AGUA_MORNA',
                                              'FLAG_AR_CONDICIONADO',
                                              'FLAG...
                                                                          'data_type': dtype('O'),
                                                                          'mapping': NAO             0
 PARCIALMENTE    1
 SIM             2
 NaN           