# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

Описание данных
  
Признаки
  
- DateCrawled — дата скачивания анкеты из базы
- VehicleType — тип автомобильного кузова
- RegistrationYear — год регистрации автомобиля
- Gearbox — тип коробки передач
- Power — мощность (л. с.)
- Model — модель автомобиля
- Kilometer — пробег (км)
- RegistrationMonth — месяц регистрации автомобиля
- FuelType — тип топлива
- Brand — марка автомобиля
- Repaired — была машина в ремонте или нет
- DateCreated — дата создания анкеты
- NumberOfPictures — количество фотографий автомобиля
- PostalCode — почтовый индекс владельца анкеты (пользователя)
- LastSeen — дата последней активности пользователя
  
Целевой признак
  
- Price — цена (евро)


In [1]:
!pip install scikit-learn==1.1.3



## Загрузка библиотек

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from catboost import Pool, CatBoostRegressor, cv
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
warnings.filterwarnings('ignore')

## Подготовка данных

In [3]:
data = pd.read_csv('/datasets/autos.csv')
display(data.info())
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

None

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [4]:
data.drop_duplicates(inplace=True)

In [5]:
data.columns = data.columns.str.replace(r"([A-Z])", r" \1").str.lower().str.replace(' ', '_').str[1:]
data.columns

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'kilometer', 'registration_month', 'fuel_type',
       'brand', 'repaired', 'date_created', 'number_of_pictures',
       'postal_code', 'last_seen'],
      dtype='object')

In [6]:
data.isna().mean()

date_crawled          0.000000
price                 0.000000
vehicle_type          0.105795
registration_year     0.000000
gearbox               0.055968
power                 0.000000
model                 0.055607
kilometer             0.000000
registration_month    0.000000
fuel_type             0.092828
brand                 0.000000
repaired              0.200793
date_created          0.000000
number_of_pictures    0.000000
postal_code           0.000000
last_seen             0.000000
dtype: float64

In [7]:
def show_rows(data_frame):
    for column in data_frame.columns:
        print('Уникальные значения столбца', column)
        print(data_frame[column].unique())

In [8]:
show_rows(data)

Уникальные значения столбца date_crawled
['2016-03-24 11:52:17' '2016-03-24 10:58:45' '2016-03-14 12:52:21' ...
 '2016-03-21 09:50:58' '2016-03-14 17:48:27' '2016-03-19 18:57:12']
Уникальные значения столбца price
[  480 18300  9800 ... 12395 18429 10985]
Уникальные значения столбца vehicle_type
[nan 'coupe' 'suv' 'small' 'sedan' 'convertible' 'bus' 'wagon' 'other']
Уникальные значения столбца registration_year
[1993 2011 2004 2001 2008 1995 1980 2014 1998 2005 1910 2016 2007 2009
 2002 2018 1997 1990 2017 1981 2003 1994 1991 1984 2006 1999 2012 2010
 2000 1992 2013 1996 1985 1989 2015 1982 1976 1983 1973 1111 1969 1971
 1987 1986 1988 1970 1965 1945 1925 1974 1979 1955 1978 1972 1968 1977
 1961 1960 1966 1975 1963 1964 5000 1954 1958 1967 1959 9999 1956 3200
 1000 1941 8888 1500 2200 4100 1962 1929 1957 1940 3000 2066 1949 2019
 1937 1951 1800 1953 1234 8000 5300 9000 2900 6000 5900 5911 1933 1400
 1950 4000 1948 1952 1200 8500 1932 1255 3700 3800 4800 1942 7000 1935
 1936 6500 1923 2

В столбце registrationyear существуют некорректные значения года регистрации.
В столбце registrationmonth отсутсвует месяц регистрации.
В столбце power присутвует некорректное значение мощности.

Определим признаки, важные для модели.

К значимым признакам отнесем следующие столбцы:

vehicle_type. Тип машины определяет ее стоимость и возможность использования в разных ситуациях.   
gearbox. Наличие автоматической коробки передач повышает стоимость машины из-за сложности конструкции.  
power. Мощность в лоашдиных силах напрямую влияют на цену.  
kilometer. Имеет обратно пропорциональную зависиммость от цены. Чем больше пробег, тем меньше цена.  
fuel_type. Тип топлива определяет конструкцию двигателя внутреннего сгорания, что также влияет на цену.  
brand. Компания, которая создала автомобиль, может влиять на цену (более элитные марки будут стоить больше).  
repaired. Машины, которые были в ремонте, менее надежно, а следовательно дешевле.  
registration_year. Год регистрации определяет сколько уже лет машина на ходу. Чем больше возраст, тем меньше цена.  
model. Вместе с типом машины и брендом може влиять на популярность у покупателей.  
  
Остальные столбцы для модели не требуются, так как не влияют на стоимость.

In [9]:
filtred_data = data[['vehicle_type', 
                     'gearbox', 
                     'power',
                     'kilometer',
                     'fuel_type',
                     'brand',
                     'repaired',
                     'registration_year',
                     'model',
                     'price']]
filtred_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354365 entries, 0 to 354368
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   vehicle_type       316875 non-null  object
 1   gearbox            334532 non-null  object
 2   power              354365 non-null  int64 
 3   kilometer          354365 non-null  int64 
 4   fuel_type          321470 non-null  object
 5   brand              354365 non-null  object
 6   repaired           283211 non-null  object
 7   registration_year  354365 non-null  int64 
 8   model              334660 non-null  object
 9   price              354365 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 29.7+ MB


- VehicleType. Для всех Nan установим значение 'other'.
- Gearbox. Для всех NaN установим значение 'manual'.
- FuelType. Для всех Nan установим значение 'other'.
- Repaired. Для всех NaN установим значение 'yes'.
- Model. Для всех NaN установим значение 'unknown'.
  
- Power. Избавимся от выбросов. 
- RegistrationYear. Избавимся от выбросов.

In [10]:
filtred_data["vehicle_type"] = filtred_data["vehicle_type"].fillna("other")
filtred_data["gearbox"] = filtred_data["gearbox"].fillna("manual")
filtred_data["fuel_type"] = filtred_data["fuel_type"].fillna("other")
filtred_data["repaired"] = filtred_data["repaired"].fillna("yes")
filtred_data["model"] = filtred_data["model"].fillna("unknown")

In [11]:
filtred_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354365 entries, 0 to 354368
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   vehicle_type       354365 non-null  object
 1   gearbox            354365 non-null  object
 2   power              354365 non-null  int64 
 3   kilometer          354365 non-null  int64 
 4   fuel_type          354365 non-null  object
 5   brand              354365 non-null  object
 6   repaired           354365 non-null  object
 7   registration_year  354365 non-null  int64 
 8   model              354365 non-null  object
 9   price              354365 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 29.7+ MB


In [12]:
filtred_data.describe()

Unnamed: 0,power,kilometer,registration_year,price
count,354365.0,354365.0,354365.0,354365.0
mean,110.093816,128211.363989,2004.234481,4416.67983
std,189.85133,37905.083858,90.228466,4514.176349
min,0.0,5000.0,1000.0,0.0
25%,69.0,125000.0,1999.0,1050.0
50%,105.0,150000.0,2003.0,2700.0
75%,143.0,150000.0,2008.0,6400.0
max,20000.0,150000.0,9999.0,20000.0


In [13]:
# RegistrationYear 
def Balance_RegistrationYear(value):
    if value > 2015:
        return 2015
    elif value < 1900:
        return 1900
    else:
        return value

filtred_data["registration_year"] = filtred_data["registration_year"].apply(Balance_RegistrationYear)

In [14]:
# Power
def Balance_Power(value):
    if value > 3500:
        return 3500
    elif value < 80:
        return 80
    else:
        return value

filtred_data["power"] = filtred_data["power"].apply(Balance_Power)

In [15]:
filtred_data.describe()

Unnamed: 0,power,kilometer,registration_year,price
count,354365.0,354365.0,354365.0,354365.0
mean,121.263485,128211.363989,2002.949326,4416.67983
std,82.474071,37905.083858,7.461354,4514.176349
min,80.0,5000.0,1900.0,0.0
25%,80.0,125000.0,1999.0,1050.0
50%,105.0,150000.0,2003.0,2700.0
75%,143.0,150000.0,2008.0,6400.0
max,3500.0,150000.0,2015.0,20000.0


In [16]:
filtred_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354365 entries, 0 to 354368
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   vehicle_type       354365 non-null  object
 1   gearbox            354365 non-null  object
 2   power              354365 non-null  int64 
 3   kilometer          354365 non-null  int64 
 4   fuel_type          354365 non-null  object
 5   brand              354365 non-null  object
 6   repaired           354365 non-null  object
 7   registration_year  354365 non-null  int64 
 8   model              354365 non-null  object
 9   price              354365 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 29.7+ MB


Выводы предобработки:
  
Мы избавились от пропусков и выбросов. Можно приступать к обучению.

In [17]:
filtred_data.price.value_counts()

0        10772
500       5670
1500      5394
1000      4648
1200      4594
         ...  
13180        1
10879        1
2683         1
634          1
8188         1
Name: price, Length: 3731, dtype: int64

In [18]:
filtred_data = filtred_data.query('price > 0')

In [19]:
filtred_data.price.value_counts()

500      5670
1500     5394
1000     4648
1200     4594
2500     4438
         ... 
5240        1
13180       1
10879       1
2683        1
8188        1
Name: price, Length: 3730, dtype: int64

In [20]:
filtred_data.shape

(343593, 10)

## Обучение моделей

In [21]:
#признаки и целевой признак:
features_orig = filtred_data.drop('price', axis=1)
target = filtred_data.price

In [22]:
#деление на выборки:
features_train, features_test, target_train, target_test = train_test_split(features_orig,
                                                                            target, 
                                                                            test_size=.25,
                                                                            random_state=12345)

In [23]:
#категориальные признаки для OHE 
ohe_features = features_train.select_dtypes(include='object').columns.to_list()
print(ohe_features)

['vehicle_type', 'gearbox', 'fuel_type', 'brand', 'repaired', 'model']


In [24]:
#численные признаки
num_features = features_train.select_dtypes(exclude='object').columns.to_list()
num_features

['power', 'kilometer', 'registration_year']

In [25]:
# TEST
encoder_ohe = OneHotEncoder(drop='first', handle_unknown='ignore')

encoder_ohe.fit(features_train[ohe_features])

# Получаем разреженую матрицу - она занимает меньше памяти
sparse_matrix = encoder_ohe.transform(features_train[ohe_features])

# Имена ohe фичей.
ohe_columns = encoder_ohe.get_feature_names_out().tolist()

# удаляем незакодированные категориальные признаки (изначальные колонки)
# features_train = features_train.drop(ohe_features, axis=1)

# создаём скелер
scaler = StandardScaler()

# обучаем его на численных признаках тренировочной выборки, трансформируем её же
X_train_scaled = scaler.fit_transform(features_train[num_features])

# собираем всё в Датафрейм
features_train_arr = np.column_stack((sparse_matrix.toarray(), X_train_scaled))
features_train_ohe = pd.DataFrame(features_train_arr, columns = ohe_columns + num_features)

In [26]:
# смотрим на результат
features_train_ohe.head()

Unnamed: 0,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,vehicle_type_sedan,vehicle_type_small,vehicle_type_suv,vehicle_type_wagon,gearbox_manual,fuel_type_electric,fuel_type_gasoline,...,model_x_type,model_xc_reihe,model_yaris,model_yeti,model_ypsilon,model_z_reihe,model_zafira,power,kilometer,registration_year
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.373943,0.577495,0.816029
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,0.577495,-0.694671
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,0.577495,-1.381353
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,-1.555553,0.816029
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,0.577495,-0.832007


In [28]:
sparse_matrix_test = encoder_ohe.transform(features_test[ohe_features])
X_test_scaled = scaler.transform(features_test[num_features])
features_test_arr = np.column_stack((sparse_matrix_test.toarray(), X_test_scaled))
features_test_ohe = pd.DataFrame(features_test_arr, columns = ohe_columns + num_features)

In [29]:
features_test_ohe.head()

Unnamed: 0,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,vehicle_type_sedan,vehicle_type_small,vehicle_type_suv,vehicle_type_wagon,gearbox_manual,fuel_type_electric,fuel_type_gasoline,...,model_x_type,model_xc_reihe,model_yaris,model_yeti,model_ypsilon,model_z_reihe,model_zafira,power,kilometer,registration_year
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.755234,-0.75566,1.090701
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,0.577495,0.129347
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.203158,0.577495,-0.419998
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.506446,0.577495,-0.145326
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.227421,0.577495,0.129347


In [30]:
del features_test_arr, X_test_scaled, sparse_matrix
del features_train_arr, X_train_scaled, sparse_matrix_test

In [30]:
#кодировка OE:
encoder = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = -1) # Если в тестовых данных возникнет занчение 
                                                                                   # категорий которой не было в обучении,
                                                                                   # код не упадёт.

features_train_oe = features_train[['vehicle_type','fuel_type','brand','model']]
features_train_oe = pd.DataFrame(encoder.fit_transform(features_train),
                                 columns=features_train.columns,
                                 index=features_train.index)

features_test_oe = features_test[['vehicle_type','fuel_type','brand','model']]
features_test_oe = pd.DataFrame(encoder.transform(features_test),
                                columns=features_test.columns,
                                index=features_test.index)

target_train_oe = target_train.copy()

target_test_oe = target_test.copy()

In [31]:
#размеры выборок:
for i in [features_train_ohe, features_test_ohe, target_train, target_test]:
    print(i.shape)   
    
print()

for i in [features_train_oe, features_test_oe, target_train_oe, target_test_oe]:
    print(i.shape)

print()
    
for i in [features_train, features_test, target_train, target_test]:
    print(i.shape)

(257694, 307)
(85899, 307)
(257694,)
(85899,)

(257694, 9)
(85899, 9)
(257694,)
(85899,)

(257694, 9)
(85899, 9)
(257694,)
(85899,)


### LinearRegression

#### OHE

In [32]:
%%time

model_lr = LinearRegression()
model_lr.fit(features_train_ohe, target_train)

CPU times: user 18 s, sys: 6.37 s, total: 24.4 s
Wall time: 24.6 s


In [33]:
%%time

target_predict = model_lr.predict(features_train_ohe)

CPU times: user 158 ms, sys: 34 ms, total: 192 ms
Wall time: 184 ms


In [34]:
#rmse_lr_ohe = mean_squared_error(target_train_ohe, target_predict) ** .5
#rmse_lr_ohe

model_lr = LinearRegression()

#mse доступна только как neg_mean_squared_error, которая возвращает массив отрицательных значений
#поэтому на -1 умножаю:
rmse_lr_ohe = (cross_val_score(model_lr,
                               features_train_ohe,
                               target_train,
                               cv=3,
                               scoring='neg_root_mean_squared_error')).mean() * -1 
rmse_lr_ohe

2847.3450941260653

### CatBoostRegressor

#### OHE

In [40]:
%%time

model_cbr = CatBoostRegressor() 
parameters = [{'learning_rate':[.1, .5, .8], 'random_state':[12345], 'verbose':[False]}]

gscv = GridSearchCV(model_cbr, parameters, scoring='neg_mean_squared_error')
gscv.fit(features_train_ohe, target_train)

print(gscv.best_params_)

mts = gscv.cv_results_['mean_test_score']

print(gscv.refit_time_)

gscv_rsme_cbr_ohe = (max(mts) * -1) ** .5
gscv_rsme_cbr_ohe

{'learning_rate': 0.5, 'random_state': 12345, 'verbose': False}
32.43224501609802
CPU times: user 6min 39s, sys: 6.83 s, total: 6min 46s
Wall time: 7min 1s


1654.9661276684203

In [41]:
%%time

model_cbr = CatBoostRegressor(learning_rate=.5, random_state=12345, verbose=False) #подставила лучшие параметры
model_cbr.fit(features_train_ohe, target_train)

CPU times: user 30.5 s, sys: 269 ms, total: 30.8 s
Wall time: 34.7 s


<catboost.core.CatBoostRegressor at 0x7f26911494c0>

In [42]:
%%time

target_predict = model_cbr.predict(features_test_ohe)

CPU times: user 343 ms, sys: 36.1 ms, total: 380 ms
Wall time: 378 ms


In [43]:
# rmse_cbr_ohe = mean_squared_error(target_test, target_predict) ** .5
# rmse_cbr_ohe

1621.1637741955888

#### OE

In [44]:
%%time

model_cbr = CatBoostRegressor() 
parameters = [{'learning_rate':[.1, .5, .8], 'random_state':[12345], 'verbose':[False]}]

gscv = GridSearchCV(model_cbr, parameters, scoring='neg_mean_squared_error')
gscv.fit(features_train_oe, target_train_oe)

print(gscv.best_params_)

mts = gscv.cv_results_['mean_test_score']
    
gscv_rsme_cbr_oe = (max(mts) * -1) ** .5
gscv_rsme_cbr_oe

{'learning_rate': 0.5, 'random_state': 12345, 'verbose': False}
CPU times: user 6min 28s, sys: 1.77 s, total: 6min 30s
Wall time: 6min 47s


1673.080196493667

In [45]:
%%time

model_cbr = CatBoostRegressor(learning_rate=.5, random_state=12345, verbose=False) 
model_cbr.fit(features_train_oe, target_train_oe)

CPU times: user 30.1 s, sys: 136 ms, total: 30.2 s
Wall time: 31.1 s


<catboost.core.CatBoostRegressor at 0x7f2691149160>

In [46]:
%%time

target_predict = model_cbr.predict(features_test_oe)

CPU times: user 80.5 ms, sys: 7.74 ms, total: 88.3 ms
Wall time: 86.8 ms


### LightGBMRegressor

#### OHE

In [31]:
%%capture

model_lgbmr = LGBMRegressor(num_threads = 10) 
parameters = [{'num_leaves':[25, 50, 100, 200], 'learning_rate':[.1, .3, .5], 'random_state':[12345]}]

clf = GridSearchCV(model_lgbmr, parameters, scoring='neg_mean_squared_error')
clf.fit(features_train_ohe, target_train)

print(clf.best_params_)
print()

mts = clf.cv_results_['mean_test_score']

rsme_lgbmr = (max(mts) * -1) ** .5
rsme_lgbmr

In [32]:
rsme_lgbmr

1649.8337567663073

In [33]:
%%time

model_lgbmr = LGBMRegressor(learning_rate=.3, num_leaves=100, random_state=12345)
model_lgbmr.fit(features_train_ohe, target_train)

CPU times: user 9.66 s, sys: 8.04 s, total: 17.7 s
Wall time: 17.7 s


In [34]:
%%time

target_predict = model_lgbmr.predict(features_test_ohe)

CPU times: user 1.13 s, sys: 0 ns, total: 1.13 s
Wall time: 1.11 s


#### OE

In [None]:
%%time

model_lgbmr = LGBMRegressor() 
parameters = [{'num_leaves':[25, 50, 100, 200], 'learning_rate':[.1, .3, .5], 'random_state':[12345]}]


clf = GridSearchCV(model_lgbmr, parameters, scoring='neg_mean_squared_error')
clf.fit(features_train_oe, target_train_oe)

print(clf.best_params_)
print()

mts = clf.cv_results_['mean_test_score']

rsme_lgbmr_1 = (max(mts) * -1) ** .5
rsme_lgbmr_1

In [None]:
%%time

model = LGBMRegressor(learning_rate=.3, num_leaves=100, random_state=12345)
model.fit(features_train_oe, target_train_oe)

In [None]:
%%time

target_predict = model.predict(features_test_oe)

## Анализ моделей, cравнение результатов, выводы

In [None]:
#таблица по показателям RMSE, время обучения модели и время предсказания модели:
index = ['Линейная регрессия с OHE',
         'CatBoostRegressor с OHE',
         'CatBoostRegressor с OE',
         'LGBMRegressor с OHE',
         'LGBMRegressor с OE']

data = {'RMSE':[rmse_lr_ohe,
                gscv_rsme_cbr_ohe,
                gscv_rsme_cbr_oe,
                rsme_lgbmr,
                rsme_lgbmr_1],
        
        'Время обучения модели':[25.4,
                                 33.6,
                                 33.2,
                                 10.8,
                                 5.85],
        
        'Время предсказания модели':[0.192,
                                     1.16,
                                     0.107,
                                     2.73,
                                     0.788]
       }

kpi_data = pd.DataFrame(data=data, index=index)

#Рейтинг с весами:
kpi_data['Рейтинг'] = (kpi_data['RMSE'] * .34 \
                       + kpi_data['Время обучения модели'] * .33 \
                       + kpi_data['Время предсказания модели'] * .33)

kpi_data.sort_values(by = 'Рейтинг', ascending=True)

In [35]:
%%time

target_predict = model_lgbmr.predict(features_test_ohe)

CPU times: user 1.12 s, sys: 0 ns, total: 1.12 s
Wall time: 1.1 s


In [36]:
rsme_lgbmr_ohe = mean_squared_error(target_test, target_predict) ** .5
rsme_lgbmr_ohe

1638.9826145432692

В итоге на тестовых данных мы получили RMSE = 1616.

В итоге лучшей моделью по RMSE оказалась CatBoostRegressor с OHE, но важно учитывать, что время её обучения было не минимальным. Но по соотношениям трёх показателей данная модель является лучшей. 