El servicio de venta de autos usados Rusty Bargain está desarrollando una aplicación para atraer nuevos clientes. Gracias a esa app, puedes averiguar rápidamente el valor de mercado de tu coche. Tienes acceso al historial: especificaciones técnicas, versiones de equipamiento y precios. Tienes que crear un modelo que determine el valor de mercado.
A Rusty Bargain le interesa:
- la calidad de la predicción;
- la velocidad de la predicción;
- el tiempo requerido para el entrenamiento

## Preparación de datos

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from time import process_time
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from time import process_time

In [2]:
# Cargar datos

df = pd.read_csv('/datasets/car_data.csv')

In [3]:
df.sample(5)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
293056,05/04/2016 21:36,1100,wagon,2000,manual,90,focus,150000,6,gasoline,ford,no,05/04/2016 00:00,0,58099,05/04/2016 21:36
62453,07/03/2016 09:53,0,small,1994,manual,70,cooper,5000,6,petrol,mini,no,07/03/2016 00:00,0,54329,08/03/2016 16:20
137427,01/04/2016 13:44,1800,convertible,2002,manual,90,2_reihe,100000,9,petrol,peugeot,no,01/04/2016 00:00,0,9456,01/04/2016 13:44
34623,05/04/2016 11:52,790,small,1997,manual,75,other,150000,3,petrol,toyota,no,05/04/2016 00:00,0,83257,05/04/2016 12:42
276965,23/03/2016 15:57,1500,bus,1991,manual,0,transporter,150000,11,,volkswagen,yes,23/03/2016 00:00,0,31867,26/03/2016 09:45


In [4]:
# Valores ausentes

df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [5]:
# Filas duplicadas

df.duplicated().sum()

262

In [6]:
# Tipo de datos de cada columna

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [7]:
# Reemplazar valores ausentes con 'unknown'

df.fillna('Unknown', inplace=True)

In [8]:
df.isna().sum()

DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

In [9]:
# Remover filas duplicadas

df.drop_duplicates(inplace=True)

In [10]:
df.duplicated().sum()

0

In [11]:
# Remover columnas irrelevantes

irrelevant_fields = ['DateCrawled', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'LastSeen']

df.drop(irrelevant_fields, axis=1, inplace=True)

In [12]:
df.columns

Index(['Price', 'VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Mileage', 'FuelType', 'Brand', 'NotRepaired', 'PostalCode'],
      dtype='object')

In [13]:
# Convertir campos categóricos a campos numéricos

categorical_fields = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']

df_ohe = pd.get_dummies(df, columns=categorical_fields, drop_first=True)

In [14]:
df_ohe.sample(5)

Unnamed: 0,Price,RegistrationYear,Power,Mileage,PostalCode,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,...,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_yes
41030,1850,1996,0,150000,31303,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
350333,8950,2006,209,100000,86637,0,0,0,0,1,...,0,0,0,0,0,0,0,1,1,0
304922,14599,2012,170,150000,58332,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
269609,3399,2006,101,150000,23879,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
8819,900,1998,60,150000,7985,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [15]:
# Separar data

X = df_ohe.drop('Price', axis=1)
y = df_ohe['Price']

In [16]:
X_train, X_test_valid, y_train, y_test_valid = train_test_split(X, y, test_size=0.40, random_state=12345)
X_test, X_valid, y_test, y_valid = train_test_split(X_test_valid, y_test_valid, test_size=0.50, random_state=12345)

In [17]:
print(X_train.shape)
print(y_train.shape)

(212464, 312)
(212464,)


In [18]:
print(X_test.shape)
print(y_test.shape)

(70821, 312)
(70821,)


In [19]:
print(X_valid.shape)
print(y_valid.shape)

(70822, 312)
(70822,)


In [20]:
# Escalamiento

features_to_scale = ['RegistrationYear', 'Power', 'Mileage']

transformer = StandardScaler().fit(X_train[features_to_scale].to_numpy())

In [21]:
X_train_scaled = X_train.copy()
X_train_scaled.loc[:,features_to_scale] = transformer.transform(X_train[features_to_scale].to_numpy())

In [22]:
X_valid_scaled = X_valid.copy()
X_valid_scaled.loc[:,features_to_scale] = transformer.transform(X_valid[features_to_scale].to_numpy())

In [23]:
X_test_scaled = X_test.copy()
X_test_scaled.loc[:,features_to_scale] = transformer.transform(X_test[features_to_scale].to_numpy())

## Entrenamiento del modelo 

In [24]:
%%time
# Regresion lineal

model = LinearRegression()
model.fit(X_train, y_train)

predicted_values = model.predict(X_valid)

RMSE = np.sqrt(mean_squared_error(y_valid, predicted_values))

print('Error caudarado medio de la regresión linear es ', round(RMSE,2))
print('Tiempo para la regresión lineal:')

Error caudarado medio de la regresión linear es  3162.68
Tiempo para la regresión lineal:
CPU times: user 9.39 s, sys: 3.97 s, total: 13.4 s
Wall time: 7.05 s


In [25]:
#Funcion para evaluar diferentes modelos con distintos hiperparámetros

def model_eval(model, model_name):
    
    parameters = {'n_estimators':[5, 10, 20],
                 'max_depth': [5, 10, 20]}

    model = model

    model_grid = GridSearchCV(model, param_grid=parameters, cv=3, scoring='neg_root_mean_squared_error', verbose=0)

    model_grid.fit(X_train, y_train)

    best_parameters = model_grid.best_params_

    best_score = model_grid.best_score_

    best_grid = model_grid.best_estimator_

    predictes_values = best_grid.predict(X_valid)

    RMSE = np.sqrt(mean_squared_error(y_valid, predicted_values))

    print('Mejores parametros para', model_name, ":", best_parameters)
    print('RMSE con datos de entrenamiento', round(best_score,2))
    print('RMSE con datos de validación', round(RMSE,2))
    print('Tiempo de entrenamiento para', model_name, ':')

%%time

# RandomForest

model = RandomForestRegressor()
model_name = 'RandomForestRegressor'

model_eval(model, model_name)

In [26]:
%%time

# LightGBM

model = LGBMRegressor(random_state=12345)
model_name = 'LightGBMRegressor'

model_eval(model, model_name)

Mejores parametros para LightGBMRegressor : {'max_depth': 20, 'n_estimators': 20}
RMSE con datos de entrenamiento -2217.07
RMSE con datos de validación 3162.68
Tiempo de entrenamiento para LightGBMRegressor :
CPU times: user 55.1 s, sys: 4.07 s, total: 59.2 s
Wall time: 36.6 s


In [27]:
%%time

# CatBoost

model = CatBoostRegressor(random_state=12345)
model_name = 'CatBootRegressor'

model_eval(model, model_name)

Learning rate set to 0.5
0:	learn: 3385.5863119	total: 52ms	remaining: 208ms
1:	learn: 2856.0259970	total: 56.9ms	remaining: 85.4ms
2:	learn: 2562.1711323	total: 61.5ms	remaining: 41ms
3:	learn: 2433.2242960	total: 65.8ms	remaining: 16.5ms
4:	learn: 2349.1954520	total: 70.1ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3372.0967743	total: 5.22ms	remaining: 20.9ms
1:	learn: 2850.0901721	total: 55.5ms	remaining: 83.2ms
2:	learn: 2598.0591684	total: 60ms	remaining: 40ms
3:	learn: 2436.4944153	total: 64.2ms	remaining: 16.1ms
4:	learn: 2353.6532274	total: 68.3ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3382.2333646	total: 5.31ms	remaining: 21.2ms
1:	learn: 2853.6776925	total: 10ms	remaining: 15ms
2:	learn: 2563.1154861	total: 14.5ms	remaining: 9.64ms
3:	learn: 2439.5803922	total: 19.1ms	remaining: 4.78ms
4:	learn: 2358.3116835	total: 63.6ms	remaining: 0us
Learning rate set to 0.5
0:	learn: 3385.5863119	total: 5.08ms	remaining: 45.7ms
1:	learn: 2856.0259970	total: 10.3ms	remai

Traceback (most recent call last):
  File "/.venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/.venv/lib/python3.9/site-packages/catboost/core.py", line 5299, in fit
    return self._fit(X, y, cat_features, None, None, None, sample_weight, None, None, None, None, baseline,
  File "/.venv/lib/python3.9/site-packages/catboost/core.py", line 2021, in _fit
    train_params = self._prepare_train_params(
  File "/.venv/lib/python3.9/site-packages/catboost/core.py", line 1953, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5839, in _catboost._check_train_params
  File "_catboost.pyx", line 5858, in _catboost._check_train_params
_catboost.CatBoostError: catboost/private/libs/options/oblivious_tree_options.cpp:122: Maximum tree depth is 16

Traceback (most recent call last):
  File "/.venv/lib/python3.9/site-packages/sklearn/model_selection/_valid

Learning rate set to 0.5
0:	learn: 3338.8891297	total: 6.77ms	remaining: 27.1ms
1:	learn: 2785.9118420	total: 13.5ms	remaining: 20.2ms
2:	learn: 2567.9217148	total: 19.9ms	remaining: 13.3ms
3:	learn: 2433.5332605	total: 54ms	remaining: 13.5ms
4:	learn: 2347.7255090	total: 59.5ms	remaining: 0us
Mejores parametros para CatBootRegressor : {'max_depth': 5, 'n_estimators': 5}
RMSE con datos de entrenamiento -2355.67
RMSE con datos de validación 3162.68
Tiempo de entrenamiento para CatBootRegressor :
CPU times: user 41 s, sys: 1.78 s, total: 42.8 s
Wall time: 24.5 s


%% time

# XGBRegressor

model = XGBRegressor(random_state=12345)
model_name = 'XGBRegressor'

model_eval(model, model_name)

## Análisis del modelo

Model	RMSE (Predicting Targets on Validation Set)	Time Required for Training + Tuning (s)	Best Hyperparameters
LinearRegression	3162.68	15s (solo entrenamiento)	N/A
RandomForest	3162.68	625s	max_depth=20, n_estimators=20
LightGBM	3162.68	93s	max_depth=20, n_estimators=20
CatBoost	3162.68	66	max_depth=15, n_estimators=15
XGBoost	1783.36	2973	max_depth=10, n_estimators=20

In [28]:
# Regresion lineal

import time

start_cell = time.time()

start_train = time.time()
model = LinearRegression()
model.fit(X_train, y_train)
end_train = time.time()

elapsed_train = end_train - start_train

start_predict = time.time()
predicted_values = model.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_test, predicted_values))
print('RMSE del modelo lineal', round(RMSE,2))
end_predict = time.time()

elapsed_predict = end_predict - start_predict

print('Tiempo de entrenamiento', round(elapsed_train,2))
print('Tiempo de predicción', round(elapsed_predict,2))

end_cell = time.time()
print('Tiempo de ejecución de la celda:', round(end_cell - start_cell,2), 'seg')

RMSE del modelo lineal 3178.07
Tiempo de entrenamiento 6.64
Tiempo de predicción 0.15
Tiempo de ejecución de la celda: 6.79 seg


In [29]:
# Random Forest

import time

start_cell = time.time()

start_train = time.time()
model = RandomForestRegressor(random_state=12345, n_estimators=20, max_depth=20)
model.fit(X_train, y_train)
end_train = time.time()

elapsed_train = end_train - start_train

start_predict = time.time()
predicted_values = model.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_test, predicted_values))
print('RMSE del modelo lineal', round(RMSE,2))
end_predict = time.time()

elapsed_predict = end_predict - start_predict

print('Tiempo de entrenamiento', round(elapsed_train,2))
print('Tiempo de predicción', round(elapsed_predict,2))

end_cell = time.time()
print('Tiempo de ejecución de la celda:', round(end_cell - start_cell,2), 'seg')

RMSE del modelo lineal 1799.24
Tiempo de entrenamiento 51.4
Tiempo de predicción 0.35
Tiempo de ejecución de la celda: 51.75 seg


In [30]:
# CatBoost

import time

start_cell = time.time()

start_train = time.time()
model = CatBoostRegressor(random_state=12345, n_estimators=5, max_depth=5, verbose=0)
model.fit(X_train, y_train)
end_train = time.time()

elapsed_train = end_train - start_train

start_predict = time.time()
predicted_values = model.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_test, predicted_values))
print('RMSE del modelo lineal', round(RMSE,2))
end_predict = time.time()

elapsed_predict = end_predict - start_predict

print('Tiempo de entrenamiento', round(elapsed_train,2))
print('Tiempo de predicción', round(elapsed_predict,2))

end_cell = time.time()
print('Tiempo de ejecución de la celda:', round(end_cell - start_cell,2), 'seg')

RMSE del modelo lineal 2355.07
Tiempo de entrenamiento 1.03
Tiempo de predicción 0.01
Tiempo de ejecución de la celda: 1.05 seg


# Conclusion

Basandonos en los resultados, Rusty Bargain deberían utilizar un modelo RandomForest para sus predicciones dado a que es el que tiene menor RMSE. Sin embargo, si la preferencia es la velocidad de ejecución, el modelo a elegir debería ser CatBoost.

# Lista de control

Escribe 'x' para verificar. Luego presiona Shift+Enter

- [x]  Jupyter Notebook está abierto
- [ ]  El código no tiene errores- [ ]  Las celdas con el código han sido colocadas en orden de ejecución- [ ]  Los datos han sido descargados y preparados- [ ]  Los modelos han sido entrenados
- [ ]  Se realizó el análisis de velocidad y calidad de los modelos