---
**Autor**: Prof. Dino Magri

**Contato**: `professor.dinomagri@gmail.com`

**Licença deste notebook**:
<br>
<img align="left" width="80" src="https://licensebuttons.net/l/by/3.0/88x31.png" />

<br>
<br>

[Clique aqui para saber mais sobre a licença CC BY v4.0](https://creativecommons.org/licenses/by/4.0/legalcode.pt)


---

# Exercícios Regressão - Gabarito

Utilize a base de dados de preço de carros (`car_price_train.csv`).

O objetivo é criar um modelo que seja capaz de sugerir o preço de venda de carros usados na concessionária Supimpa.

A base contém informações sobre as caracteristicas do carro:

- `carID` - variável que identifica o carro
- `brand` - marca do carro
- `model` - modelo do carro
- `year` - ano
- `transmission` - tipo da transmissão
- `mileage` - quilometragem
- `fuelType` - tipo de combustível
- `tax` - imposto
- `mpg` - miles por gallon - milhas por galão
- `engineSize` - tamanho do motor
- `target` - preço de venda do carro em dólares



## Carregando os dados

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/datasets/car_price_train.csv')
df.head(3)

Unnamed: 0,carID,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,target
0,14266,vauxhall,GTC,2018,Manual,4646,Petrol,145.0,36.2,2.0,19998
1,15238,ford,Grand Tourneo Connect,2015,Manual,94362,Diesel,125.0,58.9,1.6,9490
2,13982,vw,Tiguan Allspace,2019,Automatic,6580,Diesel,145.0,39.2,2.0,31990


In [3]:
df.shape

(3968, 11)

In [4]:
df.describe()

Unnamed: 0,carID,year,mileage,tax,mpg,engineSize,target
count,3968.0,3968.0,3968.0,3968.0,3968.0,3968.0,3968.0
mean,15830.915827,2016.691784,25347.024446,152.483619,50.334955,2.125126,23309.007812
std,2200.751077,2.908401,24620.999153,83.738882,35.373234,0.788259,16288.551889
min,12002.0,1997.0,1.0,0.0,2.8,0.0,450.0
25%,13953.5,2016.0,5930.0,145.0,38.7,1.6,11994.0
50%,15807.5,2017.0,19550.0,145.0,47.1,2.0,18995.0
75%,17750.75,2019.0,37317.75,150.0,54.3,2.5,29998.0
max,19629.0,2020.0,259000.0,580.0,470.8,6.6,145000.0


## Definindo as variáveis

In [5]:
df.columns

Index(['carID', 'brand', 'model', 'year', 'transmission', 'mileage',
       'fuelType', 'tax', 'mpg', 'engineSize', 'target'],
      dtype='object')

In [6]:
key_vars = ['carID']
num_vars = ['mileage', 'tax', 'mpg', 'engineSize']
cat_vars = ['brand', 'model', 'transmission', 'fuelType']

features = cat_vars + num_vars
target = 'target'

X = df[features]
y = df[target]

## Criando a base de treino e teste

Crie a base de treino e teste utilizando o método `train_test_split` com os seguintes parâmetros.

`test_size=0.2`
`random_state=42`

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [9]:
X_train.shape

(2777, 8)

In [10]:
X_test.shape

(1191, 8)

## Pipeline de dados

Crie os pipelines de transformações de dados para modelos lineares e não lineares.

Selecione as técnicas de feature engineering que julgar que sejam as mais adequadas para as variáveis da conjunto de dados.

Para facilitar, crie listas de tuplas que definem os steps para modelos lineares e não lineares.

In [11]:
!pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.6.2-py2.py3-none-any.whl (328 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.9/328.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: feature-engine
Successfully installed feature-engine-1.6.2


In [12]:
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.encoding import OneHotEncoder

In [13]:
steps_modelos_lineares = [
    ('numeric_imputer', MeanMedianImputer(variables=num_vars, imputation_method='median')),
    ('numeric_scaler', SklearnTransformerWrapper(variables=num_vars, transformer=StandardScaler())),
    ('categoric_imputer', CategoricalImputer(variables=cat_vars, fill_value='NotAv')),
    ('one_hot_encoder', OneHotEncoder(variables=cat_vars)),
]

steps_modelos_nao_lineares = [
    ('numeric_imputer', MeanMedianImputer(variables=num_vars, imputation_method='median')),
    ('categoric_imputer', CategoricalImputer(variables=cat_vars, fill_value='NotAv')),
    ('one_hot_encoder', OneHotEncoder(variables=cat_vars)),

]

## Treinando diversos modelos

Crie um DataFrame comparativo com os seguintes algoritmos:

- LinearRegression
- SGDRegressor
- DecisionTreeRegressor
- RandomForestRegressor
- XGBRegressor
- LGBMRegressor
- XGBRegressor
- CatBoostRegressor

Compute as seguintes métricas tanto para treino, quanto para teste:
- RMSE
- MAE
- MAPE

O DataFrame abaixo é um exemplo do que é esperado.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>algoritmo</th>
      <th>base</th>
      <th>rmse</th>
      <th>mae</th>
      <th>mape</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>16</th>
      <td>catboost</td>
      <td>treino</td>
      <td>2423.2371</td>
      <td>1742.7349</td>
      <td>0.1021</td>
    </tr>
    <tr>
      <th>17</th>
      <td>catboost</td>
      <td>teste</td>
      <td>4007.3208</td>
      <td>2385.0553</td>
      <td>0.1318</td>
    </tr>
    <tr>
      <th>6</th>
      <td>decision_tree</td>
      <td>treino</td>
      <td>249.1378</td>
      <td>24.5117</td>
      <td>0.0009</td>
    </tr>
    <tr>
      <th>7</th>
      <td>decision_tree</td>
      <td>teste</td>
      <td>6608.6693</td>
      <td>3460.4962</td>
      <td>0.1770</td>
    </tr>
    <tr>
      <th>10</th>
      <td>gb</td>
      <td>treino</td>
      <td>4260.8895</td>
      <td>3067.5421</td>
      <td>0.1932</td>
    </tr>
    <tr>
      <th>11</th>
      <td>gb</td>
      <td>teste</td>
      <td>4899.7093</td>
      <td>3312.6936</td>
      <td>0.1914</td>
    </tr>
  </tbody>
</table>

In [14]:
!pip install lightgbm xgboost catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [15]:
random_state = 42

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor

modelos_lineares = [
    ('linear_regression', LinearRegression()),
    ('sgdr', SGDRegressor(random_state=random_state)),
]

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

modelos_nao_lineares = [
    ('decision_tree', DecisionTreeRegressor(random_state=random_state)),
    ('random_forest', RandomForestRegressor(random_state=random_state)),
    ('xgb', XGBRegressor(random_state=random_state)),
    ('lgbm', LGBMRegressor(random_state=random_state)),
    ('catboost', CatBoostRegressor(random_state=random_state, verbose=0))
]


In [16]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

df_resultados = pd.DataFrame(columns=['algoritmo', 'base', 'rmse', 'mae', 'mape'])
df_resultados

Unnamed: 0,algoritmo,base,rmse,mae,mape


In [17]:
def treinar_modelo(model_name, model, steps, X_train, y_train, X_test, y_test, random_state):
    pipeline = Pipeline(steps=steps + [(model_name, model)])
    pipeline.fit(X_train, y_train)

    res_treino = [
        mean_squared_error(y_train, pipeline.predict(X_train), squared=False),
        mean_absolute_error(y_train, pipeline.predict(X_train)),
        mean_absolute_percentage_error(y_train, pipeline.predict(X_train)),
    ]
    res_teste = [
        mean_squared_error(y_test, pipeline.predict(X_test), squared=False),
        mean_absolute_error(y_test, pipeline.predict(X_test)),
        mean_absolute_percentage_error(y_test, pipeline.predict(X_test)),
    ]

    return res_treino, res_teste

In [18]:
pd.options.display.float_format = '{:.4f}'.format
import warnings
warnings.filterwarnings('ignore')

In [19]:
for model_name, model in modelos_lineares:
    print(f'Treinando {model_name} ...', end=' ')
    res_treino, res_teste = treinar_modelo(model_name, model, steps_modelos_lineares, X_train, y_train, X_test, y_test, random_state)
    df_resultados.loc[len(df_resultados)] = [model_name, 'treino'] + res_treino
    df_resultados.loc[len(df_resultados)] = [model_name, 'teste'] + res_teste
    print('OK')

for model_name, model in modelos_nao_lineares:
    print(f'Treinando {model_name} ...', end=' ')
    res_treino, res_teste = treinar_modelo(model_name, model, steps_modelos_nao_lineares, X_train, y_train, X_test, y_test, random_state)
    df_resultados.loc[len(df_resultados)] = [model_name, 'treino'] + res_treino
    df_resultados.loc[len(df_resultados)] = [model_name, 'teste'] + res_teste
    print('OK')

Treinando linear_regression ... OK
Treinando sgdr ... OK
Treinando decision_tree ... OK
Treinando random_forest ... OK
Treinando xgb ... OK
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000618 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 537
[LightGBM] [Info] Number of data points in the train set: 2777, number of used features: 69
[LightGBM] [Info] Start training from score 23035.427080
OK
Treinando catboost ... OK


In [20]:
df_resultados[df_resultados.base == 'treino'].sort_values(['mae'])

Unnamed: 0,algoritmo,base,rmse,mae,mape
4,decision_tree,treino,249.1378,24.5117,0.0009
6,random_forest,treino,1725.1742,942.6826,0.0521
8,xgb,treino,1779.8733,1237.4746,0.0695
12,catboost,treino,2423.2371,1742.7349,0.1021
10,lgbm,treino,3398.2517,2075.9294,0.1216
0,linear_regression,treino,5896.4355,3829.3563,0.2425
2,sgdr,treino,6130.7345,3979.4935,0.2574


In [21]:
df_resultados[df_resultados.base == 'teste'].sort_values(['mae'])

Unnamed: 0,algoritmo,base,rmse,mae,mape
13,catboost,teste,4007.3208,2385.0553,0.1318
9,xgb,teste,4551.29,2470.0958,0.1323
7,random_forest,teste,5104.3762,2584.0442,0.1371
11,lgbm,teste,5130.5404,2878.3697,0.1554
5,decision_tree,teste,6608.6693,3460.4962,0.177
3,sgdr,teste,6796.3102,4371.9905,0.2795
1,linear_regression,teste,47101003811686.73,1364816560251.962,78437733.3733


## Validação Cruzada

Selecione os dois principais algoritmos que tiveram o melhor valor para MAE e aplique  a validação cruzada Kfold.

Utilize a funcionalidade `cross_val_score` com os seguintes parâmetros:

- `scoring='neg_mean_absolute_error'`
- `X=X_train`
- `y=y_train`
- `n_jobs=-1`

Utilize a funcionalidade `KFold` com os seguintes parâmetros:

- `n_splits=5`
- `shuffle=True`
- `random_state=4`

Lembre-se de calcular a média e desvio padrão para cada validação cruzada.


In [22]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


kf = KFold(n_splits=5, shuffle=True, random_state=42)

rf_pipeline = Pipeline(steps=steps_modelos_nao_lineares + [('rf', RandomForestRegressor(random_state=42))])
cv_results_rf = cross_val_score(estimator=rf_pipeline, scoring='neg_mean_absolute_error', X=X_train, y=y_train, cv=kf, n_jobs=-1)
print('Mean CV RF', cv_results_rf.mean())
print('Std CV RF', cv_results_rf.std())

print()

catboost_pipeline = Pipeline(steps=steps_modelos_nao_lineares + [('catboost', CatBoostRegressor(random_state=42))])
cv_results_catboost = cross_val_score(estimator=catboost_pipeline, scoring='neg_mean_absolute_error', X=X_train, y=y_train, cv=kf, n_jobs=-1)
print('Mean CV Catboost', cv_results_catboost.mean())
print('Std CV Catboost', cv_results_catboost.std())

Mean CV RF -2679.6206701902693
Std CV RF 162.3727624533136

Mean CV Catboost -2500.9422279431105
Std CV Catboost 114.2279106986258


## Otimização de Hiperparâmetros

Para o melhor modelo encontrado na Validação Cruzada, aplique uma grid de hiperparâmetros e diversos valores.

Essa grid deve ter ao menos 3 hiperparâmetros e 3 diferentes valores para cada hiperparâmetro.

Utilize o método `GridSearchCV` com os seguintes parâmetros:
- `scoring='neg_mean_absolute_error'`
- `cv=3`
- `n_jobs=-1`

In [23]:
# Essa célula demora 12 minutos
from sklearn.model_selection import GridSearchCV

catboost_pipeline = Pipeline(steps=steps_modelos_nao_lineares + [('catboost', CatBoostRegressor(random_state=42, verbose=0))])

parametros = {
    'catboost__learning_rate': [0.01, 0.05, 0.1],
    'catboost__iterations': [100, 1000, 5000],
    'catboost__max_depth': [3, 7, 10],
}

grid_search = GridSearchCV(catboost_pipeline, parametros, scoring='neg_mean_absolute_error', cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

In [24]:
grid_search.best_params_

{'catboost__iterations': 5000,
 'catboost__learning_rate': 0.05,
 'catboost__max_depth': 7}

In [25]:
best_model = grid_search.best_estimator_

In [26]:
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print('Best model train:', mae_train)
print('Best model test:', mae_test)

Best model train: 638.072562702918
Best model test: 2189.7185208744127


## Feature Selection

Aplique uma técnica de seleção automática de atributos.

Lembre-se de converter os dados de treino e teste utilizando o pipeline do melhor modelo.

Após a seleção automatica de atributos, será necessário treinar um novo modelo somente com esses atributos, além disso aplique os valores para os hiperparâmetros que foram encontrados durante o GridSearchCV para o algoritmo que produziu o melhor modelo.

In [32]:
len(best_model[-1].feature_importances_)

110

In [34]:
from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(best_model[-1], threshold=0.1)
sfm.fit(best_model[:-1].transform(X_train), y_train)


variaveis_selecionadas = list(best_model[:-1].transform(X_train).columns[sfm.get_support()])
variaveis_selecionadas

['mileage',
 'tax',
 'mpg',
 'engineSize',
 'brand_skoda',
 'brand_hyundi',
 'brand_vw',
 'brand_ford',
 'brand_audi',
 'brand_merc',
 'brand_bmw',
 'model_ Santa Fe',
 'model_ I800',
 'model_ CLS Class',
 'model_ Caddy Maxi Life',
 'model_ V Class',
 'model_ Q8',
 'model_ KA',
 'model_ R8',
 'model_ 6 Series',
 'model_ S Class',
 'model_ X6',
 'model_ 7 Series',
 'model_ GLS Class',
 'model_ X4',
 'model_ Galaxy',
 'model_ M4',
 'model_ RS6',
 'model_ SLK',
 'model_ 8 Series',
 'model_ Caravelle',
 'model_ i8',
 'model_ X7',
 'model_ G Class',
 'model_ Mustang',
 'model_ California',
 'transmission_Manual',
 'transmission_Semi-Auto',
 'transmission_Automatic',
 'fuelType_Diesel',
 'fuelType_Petrol',
 'fuelType_Hybrid']

In [35]:
X_train_fs = best_model[:-1].transform(X_train)[variaveis_selecionadas]
X_test_fs = best_model[:-1].transform(X_test)[variaveis_selecionadas]

In [36]:
best_model[:-1].transform(X_train).shape

(2777, 110)

In [37]:
X_train_fs.shape

(2777, 42)

In [38]:
best_params = grid_search.best_params_
best_params

{'catboost__iterations': 5000,
 'catboost__learning_rate': 0.05,
 'catboost__max_depth': 7}

In [39]:
best_model_with_fs = CatBoostRegressor(
    random_state=42,
    learning_rate=best_params['catboost__learning_rate'],
    iterations=best_params['catboost__iterations'],
    max_depth=best_params['catboost__max_depth']
)
best_model_with_fs.fit(X_train_fs, y_train)

0:	learn: 15548.8153809	total: 2.05ms	remaining: 10.3s
1:	learn: 15026.3829126	total: 4.13ms	remaining: 10.3s
2:	learn: 14526.7517292	total: 6.27ms	remaining: 10.4s
3:	learn: 14078.3647595	total: 8.32ms	remaining: 10.4s
4:	learn: 13660.1188698	total: 10.2ms	remaining: 10.2s
5:	learn: 13232.0562416	total: 12.2ms	remaining: 10.1s
6:	learn: 12863.1049218	total: 14.1ms	remaining: 10.1s
7:	learn: 12489.3648507	total: 16.1ms	remaining: 10.1s
8:	learn: 12135.4367413	total: 18.1ms	remaining: 10s
9:	learn: 11792.1060235	total: 20.1ms	remaining: 10s
10:	learn: 11473.8822787	total: 22.1ms	remaining: 10s
11:	learn: 11176.1741769	total: 24.1ms	remaining: 10s
12:	learn: 10878.9691710	total: 26.1ms	remaining: 10s
13:	learn: 10609.7294177	total: 28ms	remaining: 9.96s
14:	learn: 10359.5125218	total: 30.3ms	remaining: 10.1s
15:	learn: 10124.9910203	total: 32.2ms	remaining: 10s
16:	learn: 9898.8662744	total: 34.3ms	remaining: 10s
17:	learn: 9693.3229144	total: 36.2ms	remaining: 10s
18:	learn: 9465.261641

<catboost.core.CatBoostRegressor at 0x789cf3e48070>

In [40]:
y_train_pred = best_model_with_fs.predict(X_train_fs)
y_test_pred = best_model_with_fs.predict(X_test_fs)

mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

print('Best model with FS train:', mae_train)
print('Best model with FS test:', mae_test)

Best model with FS train: 618.6473043346813
Best model with FS test: 2348.531405889027


## Avalie o melhor modelo na base de produção

Para o melhor modelo, avali-o na base de produção.

Lembre-se de avaliar os pipelines criados anteriormente. Todas as variáveis existentes durante o processo de treinamento, devem ser criadas/utilizadas na predição do modelo.

In [41]:
df_prod = pd.read_csv('/content/drive/MyDrive/datasets/car_price_prod.csv')
df_prod.shape

(992, 11)

In [42]:
df_prod

Unnamed: 0,carID,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,target
0,13209,ford,Edge,2017,Automatic,30255,Diesel,165.0000,48.7000,2.0000,20500
1,15965,bmw,X4,2015,Automatic,95408,Diesel,145.0000,54.3000,2.0000,15199
2,14659,bmw,6 Series,2016,Automatic,32449,Diesel,160.0000,51.4000,3.0000,18450
3,12756,merc,S Class,2012,Automatic,72700,Diesel,235.0000,41.5000,3.0000,12995
4,13289,merc,V Class,2018,Automatic,15232,Diesel,145.0000,48.7000,2.1000,19750
...,...,...,...,...,...,...,...,...,...,...,...
987,17108,merc,X-CLASS,2018,Automatic,9500,Diesel,260.0000,31.4000,3.0000,36790
988,14859,merc,CLS Class,2017,Semi-Auto,27354,Diesel,160.0000,49.6000,3.0000,23998
989,12676,bmw,X6,2016,Semi-Auto,31585,Diesel,200.0000,47.1000,3.0000,28950
990,16387,toyota,Verso,2017,Semi-Auto,16084,Petrol,160.0000,43.5000,1.8000,15498


In [43]:
X_prod = df_prod[features]
y_prod = df_prod[target]

In [44]:
y_prod_pred = best_model.predict(X_prod)
mae_prod = mean_absolute_error(y_prod, y_prod_pred)
mape_prod = mean_absolute_percentage_error(y_prod, y_prod_pred)

print('Best model MAE prod:', mae_train)
print('Best model MAPE prod:', mape_prod)

Best model MAE prod: 618.6473043346813
Best model MAPE prod: 0.0708727288772974
