# Определение стоимости автомобилей

Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

**Импорты**

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score, roc_curve

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import StandardScaler

from lightgbm import LGBMRegressor

## Подготовка данных

In [3]:
df = pd.read_csv('/datasets/autos.csv')
df_lgbm = pd.read_csv('/datasets/autos.csv')

In [4]:
df_lgbm['VehicleType'] = df_lgbm['VehicleType'].astype('category')
df_lgbm['FuelType'] = df_lgbm['FuelType'].astype('category')
df_lgbm['Brand'] = df_lgbm['Brand'].astype('category')
df_lgbm['Gearbox'] = df_lgbm['Gearbox'].astype('category')
df_lgbm['NotRepaired'] = df_lgbm['NotRepaired'].astype('category')

In [5]:
df.head(5)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Kilometer            354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


Столбцы с отсутствующими данными:

- VehicleType — тип автомобильного кузова

- Gearbox — тип коробки передач

- Model — модель автомобиля

- FuelType — тип топлива

- NotRepaired — была машина в ремонте или нет

Каждый из этих параметров так или иначе влияет на итоговую цену автомобиля.

Так как заказчику важны качество предсказания, скорость предсказания и время обучения, имеет смысл отбросить все значения с NaNами в данных столбцах, тем более они в большинстве своем категориальные и заменить средним значением или медианой или еще как-то их не заменишь.

In [7]:
df = df.loc[df.notnull().all(axis=1)]
df_lgbm = df_lgbm.loc[df_lgbm.notnull().all(axis=1)]

Далее...

### Рубрика  "а надо ли?"

In [8]:
df.head(5)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
5,2016-04-04 17:36:23,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,2016-04-04 00:00:00,0,33775,2016-04-06 19:17:07
6,2016-04-01 20:48:51,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,2016-04-01 00:00:00,0,67112,2016-04-05 18:18:39
7,2016-03-21 18:54:38,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,2016-03-21 00:00:00,0,19348,2016-03-25 16:47:58


In [9]:
df['Gearbox'].value_counts()

manual    194736
auto       51078
Name: Gearbox, dtype: int64

***Признаки и их влияние на стоимость***

В зависимости от того, влияет ли по своей логике признак на конечную стоимость можно избавиться от ненужных, оптимизировать датасет и сократить время обучения модели.

- DateCrawled — дата скачивания анкеты из базы (не влияет)

- VehicleType — тип автомобильного кузова (влияет, категориальный признак)

- RegistrationYear — год регистрации автомобиля (влияет)

- Gearbox — тип коробки передач (влияет)

- Power — мощность (л. с.) (влияет)

- Model — модель автомобиля (влияет, но не сильно: при формировании стоимости автомобиля гораздо большую роль играет бренд)

- Kilometer — пробег (км) (влияет)

- RegistrationMonth — месяц регистрации автомобиля (я бы ориентировался на год регистрации, а месяц может "запутать" модель (хх.12.1995 модель примет равным хх.01.2010), можно сделать скаляризацию, но везде при покупке пишут год выпуска авто, я считаю этот параметр лишним)

- FuelType — тип топлива (влияет)

- Brand — марка автомобиля (влияет)

- NotRepaired — была машина в ремонте или нет (влияет)

- DateCreated — дата создания анкеты (не влияет)

- NumberOfPictures — количество фотографий автомобиля (не влияет на стоимость, но влияет на спрос)

- PostalCode — почтовый индекс владельца анкеты (пользователя) (влияет, цены в разных регионах разные, но так как это число и получается, что чем больше число, тем будет больше цена, этот признак также может сбить модель с толку)

- LastSeen — дата последней активности пользователя (не влияет)


Итого

In [10]:
df['Model'].value_counts()

golf                  20200
other                 18480
3er                   14894
polo                   8806
corsa                  8265
                      ...  
i3                        4
serie_3                   3
samara                    3
rangerover                2
range_rover_evoque        2
Name: Model, Length: 249, dtype: int64

In [11]:
df = df.drop(['DateCrawled', 'Model', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'LastSeen', 'PostalCode'], axis = 1)
df_lgbm = df_lgbm.drop(['DateCrawled', 'Model', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'LastSeen', 'PostalCode'], axis = 1)

In [12]:
df.head(5)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired
3,1500,small,2001,manual,75,150000,petrol,volkswagen,no
4,3600,small,2008,manual,69,90000,gasoline,skoda,no
5,650,sedan,1995,manual,102,150000,petrol,bmw,yes
6,2200,convertible,2004,manual,109,150000,petrol,peugeot,no
7,0,sedan,1980,manual,50,40000,petrol,volkswagen,no


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245814 entries, 3 to 354367
Data columns (total 9 columns):
Price               245814 non-null int64
VehicleType         245814 non-null object
RegistrationYear    245814 non-null int64
Gearbox             245814 non-null object
Power               245814 non-null int64
Kilometer           245814 non-null int64
FuelType            245814 non-null object
Brand               245814 non-null object
NotRepaired         245814 non-null object
dtypes: int64(4), object(5)
memory usage: 18.8+ MB


In [14]:
df_lgbm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 245814 entries, 3 to 354367
Data columns (total 9 columns):
Price               245814 non-null int64
VehicleType         245814 non-null category
RegistrationYear    245814 non-null int64
Gearbox             245814 non-null category
Power               245814 non-null int64
Kilometer           245814 non-null int64
FuelType            245814 non-null category
Brand               245814 non-null category
NotRepaired         245814 non-null category
dtypes: category(5), int64(4)
memory usage: 10.6 MB


### Предобработка

***Цена*** - количественный

Есть значения с 0, вероятно ошибки, полагаю, от них надо избавиться

In [15]:
df['Price'].value_counts()

1500     3481
0        3386
500      3322
2500     2912
1200     2894
         ... 
11333       1
1344        1
3393        1
7491        1
11195       1
Name: Price, Length: 3405, dtype: int64

In [16]:
df = df.loc[df['Price'] != 0]
df_lgbm = df_lgbm.loc[df_lgbm['Price'] != 0]

***Мощность*** - количественный

Есть значения с 0, вероятно ошибки, полагаю, от них надо избавиться

In [17]:
df['Power'].value_counts().head(10)

75     16062
150    10853
60     10470
140    10336
101     9638
116     9547
0       9153
90      8556
105     8358
170     8117
Name: Power, dtype: int64

In [18]:
df = df.loc[df['Power'] != 0]
df_lgbm = df_lgbm.loc[df_lgbm['Power'] != 0]

In [19]:
df = df.reset_index(drop=True)
df_lgbm = df_lgbm.reset_index(drop=True)

In [20]:
df.head(10)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired
0,1500,small,2001,manual,75,150000,petrol,volkswagen,no
1,3600,small,2008,manual,69,90000,gasoline,skoda,no
2,650,sedan,1995,manual,102,150000,petrol,bmw,yes
3,2200,convertible,2004,manual,109,150000,petrol,peugeot,no
4,2000,sedan,2004,manual,105,150000,petrol,mazda,no
5,2799,wagon,2005,manual,140,150000,gasoline,volkswagen,yes
6,17999,suv,2011,manual,190,70000,gasoline,nissan,no
7,1750,small,2004,auto,75,150000,petrol,renault,no
8,7550,bus,2007,manual,136,150000,gasoline,ford,no
9,1850,bus,2004,manual,102,150000,petrol,mercedes_benz,no


***Тип автомобильного кузова*** - категориальный

In [21]:
df['VehicleType'].value_counts()

sedan          68961
small          55004
wagon          48783
bus            22563
convertible    15681
coupe          11585
suv             9216
other           1482
Name: VehicleType, dtype: int64

In [22]:
ohe = OneHotEncoder(sparse=False)
ohe_ftrs = ohe.fit_transform(df['VehicleType'].values.reshape(-1,1))
tmp = pd.DataFrame(ohe_ftrs, columns = ['VehicleType ' + str(i) for i in range(ohe_ftrs.shape[1])])
df = pd.concat([df, tmp], axis=1)

In [23]:
df

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired,VehicleType 0,VehicleType 1,VehicleType 2,VehicleType 3,VehicleType 4,VehicleType 5,VehicleType 6,VehicleType 7
0,1500,small,2001,manual,75,150000,petrol,volkswagen,no,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,3600,small,2008,manual,69,90000,gasoline,skoda,no,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,650,sedan,1995,manual,102,150000,petrol,bmw,yes,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,2200,convertible,2004,manual,109,150000,petrol,peugeot,no,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2000,sedan,2004,manual,105,150000,petrol,mazda,no,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233270,7900,sedan,2010,manual,140,150000,gasoline,volkswagen,no,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
233271,3999,wagon,2005,manual,3,150000,gasoline,bmw,no,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
233272,3200,sedan,2004,manual,225,150000,petrol,seat,yes,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
233273,1199,convertible,2000,auto,101,125000,petrol,smart,no,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


***Тип коробки передач*** - категориальный

In [24]:
df['Gearbox'].value_counts()

manual    184548
auto       48727
Name: Gearbox, dtype: int64

Преобразуем:
    
    так как auto обычно дороже, чем manual, auto будет 1, а manual - 0

In [25]:
gearbox = list()
for i in list(df['Gearbox']):
    if i == 'auto':
        gearbox.append(1)
    else:
        gearbox.append(0)

In [26]:
df['Gearbox'] = gearbox

***Тип топлива*** - категориальный

In [27]:
df['FuelType'].value_counts()

petrol      153352
gasoline     75542
lpg           3687
cng            424
hybrid         167
other           54
electric        49
Name: FuelType, dtype: int64

In [28]:
ohe = OneHotEncoder(sparse=False)
ohe_ftrs = ohe.fit_transform(df['FuelType'].values.reshape(-1,1))
tmp = pd.DataFrame(ohe_ftrs, columns = ['FuelType ' + str(i) for i in range(ohe_ftrs.shape[1])])
df = pd.concat([df, tmp], axis=1)

In [29]:
df

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired,VehicleType 0,...,VehicleType 5,VehicleType 6,VehicleType 7,FuelType 0,FuelType 1,FuelType 2,FuelType 3,FuelType 4,FuelType 5,FuelType 6
0,1500,small,2001,0,75,150000,petrol,volkswagen,no,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,3600,small,2008,0,69,90000,gasoline,skoda,no,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,650,sedan,1995,0,102,150000,petrol,bmw,yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,2200,convertible,2004,0,109,150000,petrol,peugeot,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,2000,sedan,2004,0,105,150000,petrol,mazda,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233270,7900,sedan,2010,0,140,150000,gasoline,volkswagen,no,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
233271,3999,wagon,2005,0,3,150000,gasoline,bmw,no,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
233272,3200,sedan,2004,0,225,150000,petrol,seat,yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
233273,1199,convertible,2000,1,101,125000,petrol,smart,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


***Бренд*** - категориальный

In [30]:
df['Brand'].value_counts()

volkswagen       49484
bmw              26224
opel             24514
mercedes_benz    22673
audi             20779
ford             16035
renault          10665
peugeot           7440
fiat              5925
seat              4696
skoda             4262
mazda             3771
toyota            3461
citroen           3457
nissan            3247
smart             3199
mini              2643
hyundai           2619
volvo             2370
mitsubishi        1911
honda             1832
kia               1762
alfa_romeo        1602
suzuki            1597
chevrolet         1198
chrysler           936
dacia              690
porsche            527
subaru             506
jeep               478
daihatsu           473
saab               403
land_rover         395
jaguar             370
daewoo             303
lancia             292
rover              241
trabant            167
lada               128
Name: Brand, dtype: int64

In [31]:
ohe = OneHotEncoder(sparse=False)
ohe_ftrs = ohe.fit_transform(df['Brand'].values.reshape(-1,1))
tmp = pd.DataFrame(ohe_ftrs, columns = ['Brand ' + str(i) for i in range(ohe_ftrs.shape[1])])
df = pd.concat([df, tmp], axis=1)

In [32]:
df

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired,VehicleType 0,...,Brand 29,Brand 30,Brand 31,Brand 32,Brand 33,Brand 34,Brand 35,Brand 36,Brand 37,Brand 38
0,1500,small,2001,0,75,150000,petrol,volkswagen,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,3600,small,2008,0,69,90000,gasoline,skoda,no,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,650,sedan,1995,0,102,150000,petrol,bmw,yes,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2200,convertible,2004,0,109,150000,petrol,peugeot,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2000,sedan,2004,0,105,150000,petrol,mazda,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233270,7900,sedan,2010,0,140,150000,gasoline,volkswagen,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
233271,3999,wagon,2005,0,3,150000,gasoline,bmw,no,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
233272,3200,sedan,2004,0,225,150000,petrol,seat,yes,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
233273,1199,convertible,2000,1,101,125000,petrol,smart,no,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


***Была ли машина в ремонте*** - категориальный

In [33]:
df['NotRepaired'].value_counts()

no     208819
yes     24456
Name: NotRepaired, dtype: int64

Преобразуем:
    
    будем считать этот пункт показателем технического состояния автомобиля, исходя из чего ремонт был необходим тем машинам, состояние которых этого требовало
    no - 1
    yes - 0

In [34]:
notrepaired = list()
for i in list(df['NotRepaired']):
    if i == 'yes':
        notrepaired.append(0)
    else:
        notrepaired.append(1)

In [35]:
df['NotRepaired'] = notrepaired

### Итоговый датафрейм после предобработки

**Для LGBM**

In [36]:
df_lgbm

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired
0,1500,small,2001,manual,75,150000,petrol,volkswagen,no
1,3600,small,2008,manual,69,90000,gasoline,skoda,no
2,650,sedan,1995,manual,102,150000,petrol,bmw,yes
3,2200,convertible,2004,manual,109,150000,petrol,peugeot,no
4,2000,sedan,2004,manual,105,150000,petrol,mazda,no
...,...,...,...,...,...,...,...,...,...
233270,7900,sedan,2010,manual,140,150000,gasoline,volkswagen,no
233271,3999,wagon,2005,manual,3,150000,gasoline,bmw,no
233272,3200,sedan,2004,manual,225,150000,petrol,seat,yes
233273,1199,convertible,2000,auto,101,125000,petrol,smart,no


**Для остальных**

In [37]:
df = df.drop(['VehicleType', 'FuelType', 'Brand'], axis = 1)

In [38]:
df.head(10)

Unnamed: 0,Price,RegistrationYear,Gearbox,Power,Kilometer,NotRepaired,VehicleType 0,VehicleType 1,VehicleType 2,VehicleType 3,...,Brand 29,Brand 30,Brand 31,Brand 32,Brand 33,Brand 34,Brand 35,Brand 36,Brand 37,Brand 38
0,1500,2001,0,75,150000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,3600,2008,0,69,90000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,650,1995,0,102,150000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2200,2004,0,109,150000,1,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2000,2004,0,105,150000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2799,2005,0,140,150000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,17999,2011,0,190,70000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1750,2004,1,75,150000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,7550,2007,0,136,150000,1,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1850,2004,0,102,150000,1,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Обучение моделей

In [40]:
features = df.drop('Price', axis = 1)
target = df['Price']

In [41]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.25, random_state = 12345)

In [42]:
print('Размеры тренировочной выборки:')
print(features_train.shape)
print(target_train.shape)

print('Размеры валидационной выборки:')
print(features_valid.shape)
print(target_valid.shape)

Размеры тренировочной выборки:
(174956, 59)
(174956,)
Размеры валидационной выборки:
(58319, 59)
(58319,)


In [43]:
print('Скаляризация - уравновешивание признаков...')

scaler = StandardScaler()

features_train = scaler.fit_transform(features_train)
features_valid = scaler.transform(features_valid)

Скаляризация - уравновешивание признаков...


**Необходимо обучить несколько моделей**

### LinearRegression

In [44]:
%%time

model1 = LinearRegression()
model1.fit(features_train, target_train)
# predictions_valid = model1.predict(features_valid)

CPU times: user 2.14 s, sys: 700 ms, total: 2.84 s
Wall time: 2.83 s


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [45]:
%%time

predictions1_train = model1.predict(features_train)

CPU times: user 40.8 ms, sys: 560 µs, total: 41.4 ms
Wall time: 15.7 ms


In [46]:
print('RMSE модели на обучающей выборке:', mean_squared_error(target_train, predictions1_train)**0.5)
print('MAE модели на обучающей выборке:', mean_absolute_error(target_train, predictions1_train))
print('Коэффициент детерминации на обучающей выборке:', r2_score(target_train, predictions1_train))
print()
# print('RMSE модели на валидационной выборке:', mean_squared_error(target_valid, predictions_valid)**0.5)
# print('Коэффициент детерминации на валидационной выборке:', r2_score(target_valid, predictions_valid))

RMSE модели на обучающей выборке: 2913.283642440045
MAE модели на обучающей выборке: 2057.239489208692
Коэффициент детерминации на обучающей выборке: 0.619835583714386



### RandomForestRegressor

In [47]:
parametrs = {'n_estimators': range(1, 51, 5),
              'max_depth': range(1, 21, 2)}

In [48]:
%%time

model2 = RandomForestRegressor(random_state=12345)
model2.fit(features_train, target_train)
predictions2_train = model2.predict(features_train)

CPU times: user 12.9 s, sys: 0 ns, total: 12.9 s
Wall time: 13.6 s


In [49]:
print('RMSE модели на обучающей выборке:', mean_squared_error(target_train, predictions2_train)**0.5)
print('MAE модели на обучающей выборке:', mean_absolute_error(target_train, predictions2_train))

RMSE модели на обучающей выборке: 1109.81241565519
MAE модели на обучающей выборке: 686.5214186658596


In [50]:
scoring = make_scorer(mean_absolute_error)

In [51]:
# %%time

# grid = RandomizedSearchCV(model2, parametrs, n_iter = 10, cv=3, scoring = scoring, verbose = 0, random_state = 12345)
# grid.fit(features_train, target_train)
# grid.best_params_

OUT[]: {'n_estimators': 1, 'max_depth': 5}

Почему так? ведь полученные значения RMSE и MAE даже хуже, чем были без подбора параметров!

In [52]:
%%time

model3 = RandomForestRegressor(random_state=12345, max_depth = 5, n_estimators = 1)
model3.fit(features_train, target_train)

CPU times: user 391 ms, sys: 0 ns, total: 391 ms
Wall time: 418 ms


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1, n_jobs=None,
                      oob_score=False, random_state=12345, verbose=0,
                      warm_start=False)

In [53]:
%%time

predictions3_train = model3.predict(features_train)

CPU times: user 44.8 ms, sys: 0 ns, total: 44.8 ms
Wall time: 46.9 ms


In [54]:
print('RMSE модели на обучающей выборке:', mean_squared_error(target_train, predictions3_train)**0.5)
print('MAE модели на обучающей выборке:', mean_absolute_error(target_train, predictions3_train))

RMSE модели на обучающей выборке: 2391.805970823164
MAE модели на обучающей выборке: 1660.4678706006714


### LGBMRegressor

Подготовка признаков

In [55]:
df_lgbm

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Kilometer,FuelType,Brand,NotRepaired
0,1500,small,2001,manual,75,150000,petrol,volkswagen,no
1,3600,small,2008,manual,69,90000,gasoline,skoda,no
2,650,sedan,1995,manual,102,150000,petrol,bmw,yes
3,2200,convertible,2004,manual,109,150000,petrol,peugeot,no
4,2000,sedan,2004,manual,105,150000,petrol,mazda,no
...,...,...,...,...,...,...,...,...,...
233270,7900,sedan,2010,manual,140,150000,gasoline,volkswagen,no
233271,3999,wagon,2005,manual,3,150000,gasoline,bmw,no
233272,3200,sedan,2004,manual,225,150000,petrol,seat,yes
233273,1199,convertible,2000,auto,101,125000,petrol,smart,no


In [56]:
features = df_lgbm.drop('Price', axis = 1)
target = df_lgbm['Price']

In [57]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.25, random_state = 12345)

In [58]:
print('Размеры тренировочной выборки:')
print(features_train.shape)
print(target_train.shape)

print('Размеры валидационной выборки:')
print(features_valid.shape)
print(target_valid.shape)

Размеры тренировочной выборки:
(174956, 8)
(174956,)
Размеры валидационной выборки:
(58319, 8)
(58319,)


In [59]:
df_lgbm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233275 entries, 0 to 233274
Data columns (total 9 columns):
Price               233275 non-null int64
VehicleType         233275 non-null category
RegistrationYear    233275 non-null int64
Gearbox             233275 non-null category
Power               233275 non-null int64
Kilometer           233275 non-null int64
FuelType            233275 non-null category
Brand               233275 non-null category
NotRepaired         233275 non-null category
dtypes: category(5), int64(4)
memory usage: 8.2 MB


Обучение

In [60]:
parametrs = {'learning_rate': [0.01, 0.05, 0.1, 1],
                'n_estimators': range(20, 100, 2)}

In [61]:
%%time

model4 = LGBMRegressor(random_state=12345)
model4.fit(features_train, target_train)
predictions4_train = model4.predict(features_train)

CPU times: user 2min 58s, sys: 723 ms, total: 2min 58s
Wall time: 3min


In [62]:
print('RMSE модели на обучающей выборке:', mean_squared_error(target_train, predictions4_train)**0.5)
print('MAE модели на обучающей выборке:', mean_absolute_error(target_train, predictions4_train))

RMSE модели на обучающей выборке: 1627.9018596625062
MAE модели на обучающей выборке: 1058.5468958316721


In [63]:
%%time

grid = RandomizedSearchCV(model4, parametrs, n_iter = 10, cv=3, scoring = scoring, verbose = 0, random_state = 12345)
grid.fit(features_train, target_train)
grid.best_params_

CPU times: user 1h 28min 59s, sys: 16.7 s, total: 1h 29min 15s
Wall time: 1h 29min 56s


{'n_estimators': 90, 'learning_rate': 0.01}

OUT[]: {'n_estimators': 90, 'learning_rate': 0.01} - полтора часа))

In [64]:
%%time

model5 = LGBMRegressor(random_state=12345, learning_rate = 0.01, n_estimators = 90)
model5.fit(features_train, target_train)

CPU times: user 2min 40s, sys: 495 ms, total: 2min 41s
Wall time: 2min 42s


LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.01, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=90, n_jobs=-1, num_leaves=31, objective=None,
              random_state=12345, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [65]:
%%time

predictions5_train = model5.predict(features_train)

CPU times: user 1.51 s, sys: 0 ns, total: 1.51 s
Wall time: 1.52 s


In [66]:
print('RMSE модели на обучающей выборке:', mean_squared_error(target_train, predictions5_train)**0.5)
print('MAE модели на обучающей выборке:', mean_absolute_error(target_train, predictions5_train))

RMSE модели на обучающей выборке: 2745.1796975764805
MAE модели на обучающей выборке: 2077.479542378132


## Анализ моделей

**Параметры RMSE, MAE и скорость обучения обучения каждой из моделей**

In [67]:
display(pd.DataFrame(np.array([[mean_squared_error(target_train, predictions1_train)**0.5,  mean_absolute_error(target_train, predictions1_train), '2.84', '0.0442'],
                           [mean_squared_error(target_train, predictions3_train)**0.5, mean_absolute_error(target_train, predictions3_train), '0.418', '0.047'],
                           [mean_squared_error(target_train, predictions5_train)**0.5, mean_absolute_error(target_train, predictions5_train), '162', '1.6']]),
                    columns = ['RMSE', 'MAE', 'Скорость обучения модели (сек)', 'Скорость предсказания модели (сек)'], index = ['LinearRegression', 'RandomForestRegressor', 'LGBMRegressor']))

Unnamed: 0,RMSE,MAE,Скорость обучения модели (сек),Скорость предсказания модели (сек)
LinearRegression,2913.283642440045,2057.239489208692,2.84,0.0442
RandomForestRegressor,2391.805970823164,1660.4678706006714,0.418,0.047
LGBMRegressor,2745.1796975764805,2077.479542378132,162.0,1.6


**Вывод**

1) Подбор гиперпараметров проводился методами RandomizedSearchCV и GridSearshCV

2) Лучшие результаты показала модель RandomForestRegressor с самой высокой скоростью обучения, скоростью предсказания и меньшей MAE

3) Есть прямая зависимость между качеством модели и скоростью ее обучения: чем выше качество, тем выше скорость обучения!

## Чек-лист проверки

Поставьте 'x' в выполненных пунктах. Далее нажмите Shift+Enter.

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Выполнена загрузка и подготовка данных
- [ ]  Выполнено обучение моделей
- [ ]  Есть анализ скорости работы и качества моделей