# Определение стоимости автомобилей

Нам нужно построить модель для определения рыночной стоимости автомобиля.\
Шаги:\
1)Загрузим данные.\
2)Изучим данные. Заполним пропущенные значения и обработаем аномалии в столбцах.\
3)Подготовим выборки для обучения моделей.\
4)Обучим разные модели. \
5)Проанализируем качество моделей.\
6)Выберем лучшую модель, проверим её качество на тестовой выборке.

## 1. Загрузка данных

Импортируем все необходимые для работы библиотеки.

In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import re


Загрузим наши данные и проведем первичный анализ.

In [3]:
data = pd.read_csv('autos.csv')
display(data.head(5))

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


# 2. Исследование данных

In [4]:
# Преобразование названий столбцов в snake_case
data.columns = [''.join(['_'+ c.lower() if c.isupper() else c for c in col]).lstrip('_') for col in data.columns]
data.columns

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'kilometer', 'registration_month', 'fuel_type',
       'brand', 'repaired', 'date_created', 'number_of_pictures',
       'postal_code', 'last_seen'],
      dtype='object')

В наших данных целевая переменая - price(цена автомобиля). Как мы можем заметить, имеются пропуски в VehicleType(тип автомобиля), Model(модель автомобиля), и в признаке Repaired("битый" автомобиль).
Кроме того, для признака power(мощность автомобиля) есть странное значение 0, у автомобиля не может быть нулевой мощности.
Исследуем наши данные подробнее на пропуски, дубликаты и выбросы(аномалии).

In [5]:
# Посчитаем процентное соотношение пропусков для каждого признака

data.isna().sum() / data.count() * 100

date_crawled           0.000000
price                  0.000000
vehicle_type          11.831014
registration_year      0.000000
gearbox                5.928510
power                  0.000000
model                  5.887995
kilometer              0.000000
registration_month     0.000000
fuel_type             10.232554
brand                  0.000000
repaired              25.123669
date_created           0.000000
number_of_pictures     0.000000
postal_code            0.000000
last_seen              0.000000
dtype: float64

В наших данных много пропусков (vehicle_type, gearbox, model, fuel_type, repaired), чтобы не потерять информативности заполним их. Каждый рассмотрим подробнее.

In [6]:
#Рассмотрим признакми vehicle_type и model

display(data['vehicle_type'].value_counts())
display(data['model'].value_counts())

vehicle_type
sedan          91457
small          79831
wagon          65166
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: count, dtype: int64

model
golf                  29232
other                 24421
3er                   19761
polo                  13066
corsa                 12570
                      ...  
i3                        8
serie_3                   4
rangerover                4
range_rover_evoque        2
serie_1                   2
Name: count, Length: 250, dtype: int64

Как мы можем заметить, разница между первыми 3 позициями не сильно большая, а значит заполнить пропуски наиболее популярными будет неверным решением. Для обоих признаков пропуски неочевидны. Тогда заполним их значением заглушкой "unknown".

In [7]:
data[['model', 'vehicle_type']] = data[['model', 'vehicle_type']].fillna('unknown')

Пропуски в оставшихся признаках скорее всего подзразумевают настолько банальное значение, что пользователь решил их не заполнять\
Так, для признака gearbox - это механическая коробка передач(manual). Большинство машин на механической коробке передач.\
Для признака fuel_type - 'fueltype'. Большинство машин выпускают изначально на бензиновом топливе.\
Для признака not_repaired - no. Скорее всего, пропуск означает, что машина не попадала в аварии.

In [8]:
data['gearbox'] = data['gearbox'].fillna('manual')
data['fuel_type'] = data['fuel_type'].fillna('petrol')
data['repaired'] = data['repaired'].fillna('no')

In [9]:
print(data['registration_year'].apply(['max','min']), end='\n\n')
print(data['power'].apply(['max','min']))
print(data['price'].apply(['max','min']))

max    9999
min    1000
Name: registration_year, dtype: int64

max    20000
min        0
Name: power, dtype: int64
max    20000
min        0
Name: price, dtype: int64


Как мы можем заметить в столбцах power и registration_year есть аномалии избавимся от них. Год регистрации ограничим снизу 1960 годом а сверзу 2023.Мощность снизу 50 а сверху 2028(максимальная мощность машины на данный момент).\
Цену ограничим снизу 300( цена указана в евро и примерно равна 30000, машины дешевле данной суммы можно отнести к металлалому)

In [10]:
#Удалим выбросы
data = data.query('registration_year < 2017 and registration_year > 1960')
data = data.query('power > 50 and power < 2028')
data = data.query('price > 350')

In [11]:
#Посмотрим сколько осталось данных
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 276742 entries, 1 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        276742 non-null  object
 1   price               276742 non-null  int64 
 2   vehicle_type        276742 non-null  object
 3   registration_year   276742 non-null  int64 
 4   gearbox             276742 non-null  object
 5   power               276742 non-null  int64 
 6   model               276742 non-null  object
 7   kilometer           276742 non-null  int64 
 8   registration_month  276742 non-null  int64 
 9   fuel_type           276742 non-null  object
 10  brand               276742 non-null  object
 11  repaired            276742 non-null  object
 12  date_created        276742 non-null  object
 13  number_of_pictures  276742 non-null  int64 
 14  postal_code         276742 non-null  int64 
 15  last_seen           276742 non-null  object
dtypes: int6

Удалим неинформативные признаки: datacrawled,datecreated,lastseen,number_of_pictures,registretion_month скорее всего цена от них не зависит.\
Разделим данные на features и target.

In [12]:
features = data.drop(['price','date_crawled','date_created','last_seen','postal_code','number_of_pictures','registration_month'],axis=1)
target = data['price']

# 3. Подготовим наши выборки для обучения

Перед обучением разобьем данные на обучающую, валидационную и тестовую выборки.

In [13]:
features_train, features_valid,target_train,target_valid = train_test_split(features,target,test_size=0.4,random_state=12345)
features_valid,features_test,target_valid,target_test = train_test_split(features_valid,target_valid,test_size = 0.5,random_state=12345)

In [14]:
# Посмотрим на наши данные
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 166045 entries, 273345 to 278467
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   vehicle_type       166045 non-null  object
 1   registration_year  166045 non-null  int64 
 2   gearbox            166045 non-null  object
 3   power              166045 non-null  int64 
 4   model              166045 non-null  object
 5   kilometer          166045 non-null  int64 
 6   fuel_type          166045 non-null  object
 7   brand              166045 non-null  object
 8   repaired           166045 non-null  object
dtypes: int64(3), object(6)
memory usage: 12.7+ MB


Так как мы будем использовать линейную регрессию нам нужно закодировать категориальные данные и отмасштабировать целевые признаки.

In [15]:
# Закодируем наши категориальные данные

features_train_ohe = pd.get_dummies(features_train, drop_first=True).astype('int')
features_valid_ohe = pd.get_dummies(features_valid, drop_first=True).astype('int')
features_test_ohe = pd.get_dummies(features_test, drop_first=True).astype('int')

Теперь нужно проверить, что признаки в каждой выборке совпадают.

In [16]:
# Посмотрим на количество признаков
features_train_ohe.shape, features_valid_ohe.shape, features_test_ohe.shape

((166045, 306), (55348, 302), (55349, 300))

Как мы видим в данных некоторые признаки отличаются, избавимся от них. Напишем функцию, осуществляющую проверку: если какого-то признака нет, то мы его удаляем 

In [17]:
def same_features(features_one,features_second):
    features_one.astype('int')
    for col in features_one.columns.to_list():
        if col not in features_second.columns.to_list():
            del features_one[col]

Применим нашу функцию к выборкам.

In [18]:
features_list = [features_train_ohe, features_valid_ohe, features_test_ohe]
for el in features_list:
    for el_sec in features_list:
        same_features(el,el_sec)

Теперь проверим наши признаки.

In [19]:
features_train_ohe.shape, features_valid_ohe.shape, features_test_ohe.shape

((166045, 298), (55348, 298), (55349, 298))

Как мы видим, количество признаков совпрадает. Можно перейти к стандартизации

In [20]:
scaler = StandardScaler()
numeric = ['registration_year', 'power', 'kilometer']

scaler.fit(features_train_ohe[numeric])

features_train_ohe[numeric] = scaler.transform(features_train_ohe[numeric])
features_valid_ohe[numeric] = scaler.transform(features_valid_ohe[numeric])
features_test_ohe[numeric] = scaler.transform(features_test_ohe[numeric])

Закодируем численные признаки

In [21]:
# Закодируем наши данные методом OE
encoder = OrdinalEncoder()
cat_col = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'repaired']

features_train_oe = features_train.copy()
features_valid_oe = features_valid.copy()
features_test_oe = features_test.copy()

features_train_oe[cat_col] = encoder.fit_transform(features_train_oe[cat_col])
features_valid_oe[cat_col] = encoder.fit_transform(features_valid_oe[cat_col])
features_test_oe[cat_col] = encoder.fit_transform(features_test_oe[cat_col])

In [22]:
features_train_oe.head()

Unnamed: 0,vehicle_type,registration_year,gearbox,power,model,kilometer,fuel_type,brand,repaired
273345,8.0,2001,1.0,145,153.0,150000,6.0,10.0,0.0
126348,4.0,2005,0.0,150,94.0,150000,2.0,20.0,0.0
121408,4.0,1999,1.0,75,115.0,150000,6.0,38.0,0.0
229056,5.0,1999,1.0,75,102.0,150000,6.0,10.0,0.0
37663,0.0,2011,0.0,140,115.0,100000,2.0,38.0,0.0


Данные готовы для обучения:

features_train_ohe, features_valid_ohe, features_test_ohe для линейной регресии
    
features_train_oe, features_valid_oe, features_test_oe для бустингов и деревьев решений

# 4. Обучение моделей

## Линейная регрессия

Для начала обучим линейную регрессию без регуляризации, измерим время обучения и прдесказания.

In [23]:
# Линейная регрессия

model = LinearRegression()

Замерим время обучения

In [24]:
%%time

# Время обучения

model.fit(features_train_ohe, target_train)

CPU times: total: 16.3 s
Wall time: 8.33 s


In [24]:
%%time

# Время предсказания

pred = model.predict(features_valid_ohe)

CPU times: total: 484 ms
Wall time: 146 ms


In [25]:
print('Linear Regression RMSE:', mean_squared_error(target_valid, pred, squared=False))

Linear Regression RMSE: 2643.1365861302643


Обучим модель с L-2 регуляризацией.

In [26]:
# Ridge

ridge = Ridge()

In [27]:
%%time

# Время обучения

ridge.fit(features_train_ohe, target_train)

CPU times: total: 2.84 s
Wall time: 1.25 s


In [28]:
%%time

# Время предсказания

predictions = ridge.predict(features_valid_ohe)

CPU times: total: 422 ms
Wall time: 152 ms


In [29]:
print('RMSE для ridge: ',mean_squared_error(target_valid, predictions, squared=False))

RMSE для ridge:  2642.931213478334


Обучим модель с L-1 регуляризацией.

In [30]:
# Lasso

lasso = Lasso()

In [31]:
%%time

#Время обучения модели

lasso.fit(features_train_ohe, target_train)

CPU times: total: 25.3 s
Wall time: 7.53 s


In [32]:
%%time

#Время предсказания

pred = lasso.predict(features_valid_ohe)

CPU times: total: 594 ms
Wall time: 142 ms


In [33]:
print('RMSE для Lasso: ', mean_squared_error(target_valid, pred, squared=False))

RMSE для Lasso:  2674.9577814358404


Таким образом, среди линейных моделей лучше всех себя показала Ridge с показателем RMSE 2642.931213478334

## LightGBM

Обучим модель LightGBM. Подберем для нее параметры с помощью GridSearchCV.

In [90]:
%%time

#инициализация модели

model = LGBMRegressor(random_state = 12345)
param_grid = {
    'n_estimators' : [100,500,1000],
    'learning_rate' : [0.1,0.01],
    'num_leaves' : [20,25,30]
}
grid_search = GridSearchCV(
    estimator = model,
    param_grid = param_grid,
    cv=5,
    scoring = 'neg_mean_squared_error',
    verbose = 1,
    n_jobs=-1
)

CPU times: total: 0 ns
Wall time: 0 ns


In [91]:
%%time

#Обучение

grid_search.fit(features_train_oe, target_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003932 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 628
[LightGBM] [Info] Number of data points in the train set: 166045, number of used features: 9
[LightGBM] [Info] Start training from score 5154.928013
CPU times: total: 18.8 s
Wall time: 5min 54s


In [94]:
#Модель с наименьшей метрикой
model = grid_search.best_estimator_

In [95]:
%%time
model.fit(features_train_oe,target_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005188 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 628
[LightGBM] [Info] Number of data points in the train set: 166045, number of used features: 9
[LightGBM] [Info] Start training from score 5154.928013
CPU times: total: 16.5 s
Wall time: 4.63 s


In [96]:
%%time

#время предсказания

predictions = model.predict(features_valid_oe)

CPU times: total: 4.17 s
Wall time: 1.04 s


In [97]:
print('LGBMRegerssor RMSE :', mean_squared_error(target_valid,predictions,squared=False))

LGBMRegerssor RMSE : 1669.8511630061125


Таким образом, модель LightGBM выдала метрику, равную  1669.8511630061125, время обучения 4.63 с, а предсказания - 1.04.

# Catboost

Воспользуемся моделю CatBoost. Подберем для нее гиперпараметры с помощью grid_search

In [129]:
#Гиперпараметры

grid_params = {
    'iterations' : [500,1000,1500],
}

In [145]:
#Инициализация GridSearch

grid_search = GridSearchCV(
    estimator = model,
    param_grid = grid_params,
    cv=5,
    scoring = 'neg_mean_squared_error',
    verbose = 1,
    n_jobs=-1
)

In [146]:
#Обучение GridSearch

grid_search.fit(features_train_oe,target_train) 

InvalidParameterError: The 'scoring' parameter of GridSearchCV must be a str among {'neg_mean_absolute_error', 'adjusted_mutual_info_score', 'average_precision', 'neg_mean_squared_log_error', 'positive_likelihood_ratio', 'precision_micro', 'accuracy', 'neg_mean_poisson_deviance', 'f1_weighted', 'f1', 'neg_median_absolute_error', 'neg_mean_gamma_deviance', 'recall_macro', 'neg_log_loss', 'v_measure_score', 'adjusted_rand_score', 'max_error', 'fowlkes_mallows_score', 'normalized_mutual_info_score', 'recall_weighted', 'completeness_score', 'recall_samples', 'rand_score', 'neg_mean_absolute_percentage_error', 'jaccard_samples', 'neg_mean_squared_error', 'precision_samples', 'roc_auc', 'roc_auc_ovo_weighted', 'roc_auc_ovo', 'neg_brier_score', 'precision_macro', 'recall', 'f1_macro', 'f1_samples', 'precision_weighted', 'roc_auc_ovr_weighted', 'top_k_accuracy', 'neg_negative_likelihood_ratio', 'explained_variance', 'neg_root_mean_squared_error', 'recall_micro', 'roc_auc_ovr', 'r2', 'jaccard_macro', 'balanced_accuracy', 'f1_micro', 'homogeneity_score', 'jaccard_micro', 'precision', 'matthews_corrcoef', 'mutual_info_score', 'jaccard', 'jaccard_weighted'}, a callable, an instance of 'list', an instance of 'tuple', an instance of 'dict' or None. Got 'mean_squared_error' instead.

In [141]:
#Инициализируем

model = grid_search.best_estimator_

In [142]:
%%time

#Обучим модель

model.fit(features_train_oe, target_train)

Learning rate set to 0.066041
0:	learn: 4430.9579301	total: 15.3ms	remaining: 22.9s
1:	learn: 4245.8573824	total: 29.4ms	remaining: 22s
2:	learn: 4076.8284024	total: 43.8ms	remaining: 21.8s
3:	learn: 3915.7415444	total: 58.4ms	remaining: 21.8s
4:	learn: 3770.2019266	total: 74.2ms	remaining: 22.2s
5:	learn: 3641.0237100	total: 89.6ms	remaining: 22.3s
6:	learn: 3518.9500599	total: 105ms	remaining: 22.3s
7:	learn: 3402.4749910	total: 119ms	remaining: 22.1s
8:	learn: 3297.2322063	total: 133ms	remaining: 22s
9:	learn: 3199.7632082	total: 147ms	remaining: 21.9s
10:	learn: 3110.9382649	total: 161ms	remaining: 21.8s
11:	learn: 3032.1845191	total: 175ms	remaining: 21.7s
12:	learn: 2959.4147265	total: 192ms	remaining: 22s
13:	learn: 2889.3680344	total: 209ms	remaining: 22.1s
14:	learn: 2828.6451432	total: 226ms	remaining: 22.3s
15:	learn: 2769.6609368	total: 241ms	remaining: 22.4s
16:	learn: 2716.3941166	total: 255ms	remaining: 22.2s
17:	learn: 2667.2604357	total: 269ms	remaining: 22.1s
18:	lear

163:	learn: 1780.3200572	total: 2.54s	remaining: 20.7s
164:	learn: 1779.3495083	total: 2.58s	remaining: 20.9s
165:	learn: 1778.3425196	total: 2.61s	remaining: 21s
166:	learn: 1777.2016921	total: 2.63s	remaining: 21s
167:	learn: 1776.5186799	total: 2.65s	remaining: 21s
168:	learn: 1775.6632020	total: 2.66s	remaining: 21s
169:	learn: 1774.6330444	total: 2.68s	remaining: 21s
170:	learn: 1773.7176281	total: 2.69s	remaining: 21s
171:	learn: 1773.0182339	total: 2.71s	remaining: 20.9s
172:	learn: 1772.4108355	total: 2.73s	remaining: 20.9s
173:	learn: 1771.7974164	total: 2.75s	remaining: 20.9s
174:	learn: 1770.6041458	total: 2.77s	remaining: 20.9s
175:	learn: 1769.8430578	total: 2.78s	remaining: 20.9s
176:	learn: 1768.9305956	total: 2.8s	remaining: 20.9s
177:	learn: 1768.0759213	total: 2.82s	remaining: 20.9s
178:	learn: 1766.9815179	total: 2.83s	remaining: 20.9s
179:	learn: 1766.0428690	total: 2.85s	remaining: 20.9s
180:	learn: 1765.6945854	total: 2.87s	remaining: 20.9s
181:	learn: 1765.206644

317:	learn: 1690.7612548	total: 5.07s	remaining: 18.9s
318:	learn: 1690.3957887	total: 5.09s	remaining: 18.9s
319:	learn: 1689.9391130	total: 5.11s	remaining: 18.8s
320:	learn: 1689.6135245	total: 5.12s	remaining: 18.8s
321:	learn: 1689.2054369	total: 5.14s	remaining: 18.8s
322:	learn: 1688.7651796	total: 5.16s	remaining: 18.8s
323:	learn: 1688.6026717	total: 5.17s	remaining: 18.8s
324:	learn: 1688.2329584	total: 5.19s	remaining: 18.8s
325:	learn: 1687.8140836	total: 5.2s	remaining: 18.7s
326:	learn: 1687.4901201	total: 5.22s	remaining: 18.7s
327:	learn: 1687.0701195	total: 5.23s	remaining: 18.7s
328:	learn: 1686.5476712	total: 5.25s	remaining: 18.7s
329:	learn: 1686.2670655	total: 5.26s	remaining: 18.7s
330:	learn: 1685.7878588	total: 5.28s	remaining: 18.7s
331:	learn: 1685.4458159	total: 5.3s	remaining: 18.6s
332:	learn: 1684.9482170	total: 5.31s	remaining: 18.6s
333:	learn: 1684.6198032	total: 5.33s	remaining: 18.6s
334:	learn: 1684.1592731	total: 5.34s	remaining: 18.6s
335:	learn: 

469:	learn: 1642.6247701	total: 7.43s	remaining: 16.3s
470:	learn: 1642.3067287	total: 7.45s	remaining: 16.3s
471:	learn: 1642.0031917	total: 7.46s	remaining: 16.3s
472:	learn: 1641.9420508	total: 7.48s	remaining: 16.2s
473:	learn: 1641.7306563	total: 7.49s	remaining: 16.2s
474:	learn: 1641.3886913	total: 7.51s	remaining: 16.2s
475:	learn: 1641.0425902	total: 7.52s	remaining: 16.2s
476:	learn: 1640.7807154	total: 7.54s	remaining: 16.2s
477:	learn: 1640.4334689	total: 7.55s	remaining: 16.1s
478:	learn: 1640.1037298	total: 7.57s	remaining: 16.1s
479:	learn: 1639.7724031	total: 7.58s	remaining: 16.1s
480:	learn: 1639.5225963	total: 7.59s	remaining: 16.1s
481:	learn: 1639.3505098	total: 7.61s	remaining: 16.1s
482:	learn: 1639.1382026	total: 7.63s	remaining: 16.1s
483:	learn: 1638.8729797	total: 7.64s	remaining: 16s
484:	learn: 1638.4225928	total: 7.66s	remaining: 16s
485:	learn: 1638.0536278	total: 7.68s	remaining: 16s
486:	learn: 1637.8840677	total: 7.7s	remaining: 16s
487:	learn: 1637.62

630:	learn: 1607.6389704	total: 9.96s	remaining: 13.7s
631:	learn: 1607.3840739	total: 9.97s	remaining: 13.7s
632:	learn: 1607.2535811	total: 9.99s	remaining: 13.7s
633:	learn: 1606.9759831	total: 10s	remaining: 13.7s
634:	learn: 1606.7940944	total: 10s	remaining: 13.7s
635:	learn: 1606.5737296	total: 10s	remaining: 13.6s
636:	learn: 1606.4710477	total: 10.1s	remaining: 13.6s
637:	learn: 1606.3035036	total: 10.1s	remaining: 13.6s
638:	learn: 1606.2000504	total: 10.1s	remaining: 13.6s
639:	learn: 1606.0713470	total: 10.1s	remaining: 13.6s
640:	learn: 1605.8552223	total: 10.1s	remaining: 13.6s
641:	learn: 1605.6368799	total: 10.1s	remaining: 13.5s
642:	learn: 1605.4329708	total: 10.1s	remaining: 13.5s
643:	learn: 1605.2339579	total: 10.2s	remaining: 13.5s
644:	learn: 1605.0540974	total: 10.2s	remaining: 13.5s
645:	learn: 1604.9309387	total: 10.2s	remaining: 13.5s
646:	learn: 1604.7472276	total: 10.2s	remaining: 13.5s
647:	learn: 1604.6283744	total: 10.2s	remaining: 13.5s
648:	learn: 1604

789:	learn: 1584.4371351	total: 12.5s	remaining: 11.2s
790:	learn: 1584.3688273	total: 12.5s	remaining: 11.2s
791:	learn: 1584.1683546	total: 12.5s	remaining: 11.2s
792:	learn: 1583.9424826	total: 12.6s	remaining: 11.2s
793:	learn: 1583.8236880	total: 12.6s	remaining: 11.2s
794:	learn: 1583.6923009	total: 12.6s	remaining: 11.2s
795:	learn: 1583.6250605	total: 12.6s	remaining: 11.2s
796:	learn: 1583.4140426	total: 12.6s	remaining: 11.1s
797:	learn: 1583.3485273	total: 12.7s	remaining: 11.1s
798:	learn: 1583.2362487	total: 12.7s	remaining: 11.1s
799:	learn: 1583.1174565	total: 12.7s	remaining: 11.1s
800:	learn: 1583.0143115	total: 12.7s	remaining: 11.1s
801:	learn: 1582.9260333	total: 12.7s	remaining: 11.1s
802:	learn: 1582.8009504	total: 12.7s	remaining: 11s
803:	learn: 1582.7374429	total: 12.7s	remaining: 11s
804:	learn: 1582.6760966	total: 12.8s	remaining: 11s
805:	learn: 1582.4556288	total: 12.8s	remaining: 11s
806:	learn: 1582.3121101	total: 12.8s	remaining: 11s
807:	learn: 1582.147

942:	learn: 1564.9480416	total: 14.9s	remaining: 8.77s
943:	learn: 1564.8323403	total: 14.9s	remaining: 8.76s
944:	learn: 1564.6886596	total: 14.9s	remaining: 8.74s
945:	learn: 1564.5715836	total: 14.9s	remaining: 8.73s
946:	learn: 1564.5027503	total: 14.9s	remaining: 8.72s
947:	learn: 1564.2787586	total: 14.9s	remaining: 8.7s
948:	learn: 1564.0959989	total: 15s	remaining: 8.69s
949:	learn: 1563.9697949	total: 15s	remaining: 8.68s
950:	learn: 1563.9091093	total: 15s	remaining: 8.66s
951:	learn: 1563.8641150	total: 15s	remaining: 8.65s
952:	learn: 1563.7739836	total: 15s	remaining: 8.63s
953:	learn: 1563.6225385	total: 15.1s	remaining: 8.62s
954:	learn: 1563.5424321	total: 15.1s	remaining: 8.6s
955:	learn: 1563.4589352	total: 15.1s	remaining: 8.59s
956:	learn: 1563.3902784	total: 15.1s	remaining: 8.57s
957:	learn: 1563.1476204	total: 15.1s	remaining: 8.55s
958:	learn: 1563.0598190	total: 15.1s	remaining: 8.54s
959:	learn: 1562.8651312	total: 15.2s	remaining: 8.52s
960:	learn: 1562.73657

1103:	learn: 1547.9499472	total: 17.4s	remaining: 6.25s
1104:	learn: 1547.8492048	total: 17.4s	remaining: 6.24s
1105:	learn: 1547.7422777	total: 17.5s	remaining: 6.22s
1106:	learn: 1547.6786904	total: 17.5s	remaining: 6.21s
1107:	learn: 1547.5872425	total: 17.5s	remaining: 6.19s
1108:	learn: 1547.5505568	total: 17.5s	remaining: 6.17s
1109:	learn: 1547.4526865	total: 17.5s	remaining: 6.16s
1110:	learn: 1547.3945242	total: 17.5s	remaining: 6.14s
1111:	learn: 1547.2695520	total: 17.6s	remaining: 6.13s
1112:	learn: 1547.1258253	total: 17.6s	remaining: 6.11s
1113:	learn: 1547.0634333	total: 17.6s	remaining: 6.09s
1114:	learn: 1546.9787873	total: 17.6s	remaining: 6.08s
1115:	learn: 1546.9069532	total: 17.6s	remaining: 6.06s
1116:	learn: 1546.8336044	total: 17.6s	remaining: 6.05s
1117:	learn: 1546.7793229	total: 17.6s	remaining: 6.03s
1118:	learn: 1546.6936398	total: 17.7s	remaining: 6.01s
1119:	learn: 1546.5216229	total: 17.7s	remaining: 6s
1120:	learn: 1546.4456114	total: 17.7s	remaining: 5

1263:	learn: 1533.9936118	total: 20s	remaining: 3.73s
1264:	learn: 1533.8966061	total: 20s	remaining: 3.71s
1265:	learn: 1533.8467804	total: 20s	remaining: 3.69s
1266:	learn: 1533.7709012	total: 20s	remaining: 3.68s
1267:	learn: 1533.6691238	total: 20s	remaining: 3.66s
1268:	learn: 1533.5857232	total: 20s	remaining: 3.65s
1269:	learn: 1533.3970911	total: 20s	remaining: 3.63s
1270:	learn: 1533.3144319	total: 20.1s	remaining: 3.61s
1271:	learn: 1533.2289282	total: 20.1s	remaining: 3.6s
1272:	learn: 1533.1614672	total: 20.1s	remaining: 3.58s
1273:	learn: 1533.0876611	total: 20.1s	remaining: 3.57s
1274:	learn: 1533.0090123	total: 20.1s	remaining: 3.55s
1275:	learn: 1532.9054487	total: 20.1s	remaining: 3.53s
1276:	learn: 1532.8094021	total: 20.1s	remaining: 3.52s
1277:	learn: 1532.7182305	total: 20.2s	remaining: 3.5s
1278:	learn: 1532.6596460	total: 20.2s	remaining: 3.49s
1279:	learn: 1532.5901341	total: 20.2s	remaining: 3.47s
1280:	learn: 1532.4979697	total: 20.2s	remaining: 3.45s
1281:	le

1411:	learn: 1521.8258616	total: 22.5s	remaining: 1.4s
1412:	learn: 1521.7224402	total: 22.5s	remaining: 1.39s
1413:	learn: 1521.6262800	total: 22.5s	remaining: 1.37s
1414:	learn: 1521.5558553	total: 22.6s	remaining: 1.36s
1415:	learn: 1521.4777307	total: 22.6s	remaining: 1.34s
1416:	learn: 1521.3586314	total: 22.6s	remaining: 1.32s
1417:	learn: 1521.3184960	total: 22.6s	remaining: 1.31s
1418:	learn: 1521.2540584	total: 22.6s	remaining: 1.29s
1419:	learn: 1521.2051227	total: 22.7s	remaining: 1.28s
1420:	learn: 1521.1439338	total: 22.7s	remaining: 1.26s
1421:	learn: 1521.0629421	total: 22.7s	remaining: 1.24s
1422:	learn: 1520.9767508	total: 22.7s	remaining: 1.23s
1423:	learn: 1520.9325340	total: 22.7s	remaining: 1.21s
1424:	learn: 1520.8698412	total: 22.8s	remaining: 1.2s
1425:	learn: 1520.8498031	total: 22.8s	remaining: 1.18s
1426:	learn: 1520.7880039	total: 22.8s	remaining: 1.17s
1427:	learn: 1520.6877156	total: 22.8s	remaining: 1.15s
1428:	learn: 1520.6209334	total: 22.8s	remaining: 

<catboost.core.CatBoostRegressor at 0x139df109390>

In [143]:
%%time

# Предсказание

predictions = model.predict(features_valid_oe)

CPU times: total: 172 ms
Wall time: 74 ms


In [144]:
#Качетсво
mean_squared_error(target_valid, predictions, squared=False)

1682.404093038962

Наилучшая модель показала значение метрики 1682, причем время обучения равно 24,5 с, а предсказания - 74мс.

## RandomForestRegressor

Мы не будем подбирать гиперпараметры для случайного леса, т.к. он обучается слишком долго. Рассмотрим значение метрики для модели с параметрами: n_estimators = 100, max_depth=15

In [82]:
# Инициализация модели
model = RandomForestRegressor(n_estimators=100,max_depth=15,random_state=12345)


In [83]:
%%time

#время обучения

model.fit(features_train_oe,target_train)

CPU times: total: 1min 33s
Wall time: 1min 35s


In [84]:
%%time

#время предсказания

predictions = model.predict(features_valid_oe)

CPU times: total: 3.03 s
Wall time: 3.16 s


In [85]:
print('RMSE RandomForesClassifier:', mean_squared_error(target_valid,predictions,squared=False))

RMSE RandomForesClassifier: 1695.767292404521


У нас получилась модель со значение RMSE 1695.767292404521, временем обучения - 95с, предсказания - 3.16 с.

# Анализ моделей

1.Cреди линейных моделей лучше всех себя показала Ridge с показателем RMSE 2642.931213478334

2.модель LightGBM выдала метрику, равную  1669.8511630061125, время обучения 4.63 с, а предсказания - 1.04.

3.Catboost показала значение метрики 1682, причем время обучения равно 24,5 с, а предсказания - 74мс.

4.RF получилась модель со значение RMSE 1695.767292404521, временем обучения - 95с, предсказания - 3.16 с.

Таким образом, Наилучшая модель с наименьшим RMSE это LBGMRegressor.

# Тестирование лучшей модели

Проведем тестирование лучшей модели.

In [147]:
model = LGBMRegressor(iterations=1000, max_leaves=30, random_state=12345)
model.fit(features_train_oe, target_train)
predictions = model.predict(features_test_oe)
print(mean_squared_error(target_test,predictions,squared = False))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005798 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 628
[LightGBM] [Info] Number of data points in the train set: 166045, number of used features: 9
[LightGBM] [Info] Start training from score 5154.928013
1744.0722474091358


На тестовой выборке наша метрика получилась равной 1744.