# Определение стоимости автомобилей

# Описание проекта
Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Вам нужно построить модель для определения стоимости. 

Заказчику важны:

- качество предсказания;
- скорость предсказания;
- время обучения.

**Описание данных**

Признаки:
- DateCrawled — дата скачивания анкеты из базы
- VehicleType — тип автомобильного кузова
- RegistrationYear — год регистрации автомобиля
- Gearbox — тип коробки передач
- Power — мощность (л. с.)
- Model — модель автомобиля
- Kilometer — пробег (км)
- RegistrationMonth — месяц регистрации автомобиля
- FuelType — тип топлива
- Brand — марка автомобиля
- NotRepaired — была машина в ремонте или нет
- DateCreated — дата создания анкеты
- NumberOfPictures — количество фотографий автомобиля
- PostalCode — почтовый индекс владельца анкеты (пользователя)
- LastSeen — дата последней активности пользователя

Целевой признак
- Price — цена (евро)

## Подготовка данных

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import  train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from lightgbm import LGBMRegressor
from catboost import Pool, CatBoostRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/datasets/autos.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Kilometer            354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


In [3]:
data.sample(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
58586,2016-04-05 09:54:43,800,coupe,1994,manual,115,6_reihe,150000,0,,mazda,no,2016-04-05 00:00:00,0,89250,2016-04-07 13:16:21
306879,2016-04-01 17:39:37,2630,sedan,1998,manual,170,5er,150000,6,petrol,bmw,no,2016-04-01 00:00:00,0,38444,2016-04-05 12:46:16
273150,2016-04-02 21:36:34,5700,sedan,2004,auto,150,3er,150000,10,gasoline,bmw,yes,2016-04-02 00:00:00,0,47443,2016-04-04 21:16:37
315717,2016-03-29 11:50:05,15900,other,1978,auto,185,s_klasse,5000,3,petrol,mercedes_benz,,2016-03-29 00:00:00,0,24113,2016-04-05 21:15:26
317250,2016-03-14 17:43:47,7990,wagon,2011,manual,105,golf,150000,1,gasoline,volkswagen,no,2016-03-14 00:00:00,0,38312,2016-03-14 17:43:47
116403,2016-04-02 11:48:37,4200,sedan,2009,,0,a3,125000,11,gasoline,audi,no,2016-04-02 00:00:00,0,10247,2016-04-02 11:48:37
308477,2016-03-11 09:55:33,750,wagon,1997,manual,60,polo,150000,11,petrol,volkswagen,,2016-03-11 00:00:00,0,96247,2016-03-15 19:46:18
147299,2016-03-15 08:55:26,600,sedan,1995,auto,90,golf,125000,3,petrol,volkswagen,yes,2016-03-15 00:00:00,0,10709,2016-03-18 01:46:12
193481,2016-03-30 14:49:21,1300,sedan,2000,manual,82,a_klasse,150000,10,petrol,mercedes_benz,no,2016-03-30 00:00:00,0,64291,2016-04-07 05:44:33
120226,2016-03-23 11:55:19,2050,coupe,1992,manual,141,3er,150000,10,petrol,bmw,no,2016-03-23 00:00:00,0,89077,2016-04-07 13:17:16


In [4]:
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [5]:
data['NumberOfPictures'].unique()

array([0])

In [6]:
data['RegistrationYear'].unique()

array([1993, 2011, 2004, 2001, 2008, 1995, 1980, 2014, 1998, 2005, 1910,
       2016, 2007, 2009, 2002, 2018, 1997, 1990, 2017, 1981, 2003, 1994,
       1991, 1984, 2006, 1999, 2012, 2010, 2000, 1992, 2013, 1996, 1985,
       1989, 2015, 1982, 1976, 1983, 1973, 1111, 1969, 1971, 1987, 1986,
       1988, 1970, 1965, 1945, 1925, 1974, 1979, 1955, 1978, 1972, 1968,
       1977, 1961, 1960, 1966, 1975, 1963, 1964, 5000, 1954, 1958, 1967,
       1959, 9999, 1956, 3200, 1000, 1941, 8888, 1500, 2200, 4100, 1962,
       1929, 1957, 1940, 3000, 2066, 1949, 2019, 1937, 1951, 1800, 1953,
       1234, 8000, 5300, 9000, 2900, 6000, 5900, 5911, 1933, 1400, 1950,
       4000, 1948, 1952, 1200, 8500, 1932, 1255, 3700, 3800, 4800, 1942,
       7000, 1935, 1936, 6500, 1923, 2290, 2500, 1930, 1001, 9450, 1944,
       1943, 1934, 1938, 1688, 2800, 1253, 1928, 1919, 5555, 5600, 1600,
       2222, 1039, 9996, 1300, 8455, 1931, 1915, 4500, 1920, 1602, 7800,
       9229, 1947, 1927, 7100, 8200, 1946, 7500, 35

In [7]:
data['Power'].value_counts()

0        40225
75       24023
60       15897
150      14590
101      13298
         ...  
16311        1
1360         1
1968         1
6226         1
6006         1
Name: Power, Length: 712, dtype: int64

**Вывод**

Из первичного анализа данных, можно заметить, что 
- все объявления без фотографий
- индекс не понадобится для ислледования стоимости автомобиля
- в столбце RegistrationYear и Power имеются аномальные значения
- для некоторых автомобилей неуказана цена
- данные об анкете, об активности пользователя не несут ценности для данной работы
- месяц регистрации автомобиля не так важен, обычно берут во внимание только год
- много пропусков

### Заполнение пропусков в данных

In [8]:
#удалим ненужные столбцы
data = data.drop(columns=['NumberOfPictures', 'PostalCode', 'DateCrawled', 'DateCreated', 'LastSeen', 'RegistrationMonth'])
data.sample(5)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,FuelType,Brand,NotRepaired
104804,5950,small,2010,manual,60,polo,125000,petrol,volkswagen,no
348447,7200,small,2009,manual,95,one,100000,petrol,mini,
48216,9950,convertible,1996,manual,90,mx_reihe,70000,petrol,mazda,no
338111,2900,sedan,2000,manual,150,a6,150000,gasoline,audi,no
251226,1950,,2005,,0,,150000,,renault,


In [9]:
#зададим границы для RegistrationYear (год патента на первый автомобиль - 1886)
data = data.query('1886 <= RegistrationYear <= 2021')
data['RegistrationYear'].describe()

count    354198.000000
mean       2003.084789
std           7.536418
min        1910.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        2019.000000
Name: RegistrationYear, dtype: float64

In [10]:
#зададим границы для Power
data = data.query('100 <= Power <= 2000')
data['Power'].describe()

count    199956.000000
mean        149.807928
std          63.370673
min         100.000000
25%         115.000000
50%         140.000000
75%         170.000000
max        2000.000000
Name: Power, dtype: float64

In [11]:
#удалим пропуски в Price
data.dropna(subset=['Price'], inplace=True)

In [12]:
data.isna().sum()

Price                   0
VehicleType         12011
RegistrationYear        0
Gearbox              2880
Power                   0
Model                7945
Kilometer               0
FuelType            11790
Brand                   0
NotRepaired         27972
dtype: int64

- В столбце VehicleType пропущенные значения заменим на популярный вариант.
- Для столбца Gearbox в местах попусков установим значение 'manual'.
- Для Model вместо NaN установим значение 'unknown'.
- В FuelType можно заполнить пропуски на популярный тип.
- Для NotRepaired на месте пропусков установим значение 'yes'.

In [13]:
#заменим NaN в Gearbox
data['Gearbox'] = data['Gearbox'].fillna('manual')

In [14]:
#заменим NaN в Model
data['Model'] = data['Model'].fillna('unknown')

In [15]:
data['NotRepaired'].value_counts()

no     153640
yes     18344
Name: NotRepaired, dtype: int64

In [16]:
#заменим NaN в NotRepaired  
data['NotRepaired'] = data['NotRepaired'].fillna('yes')

In [17]:
data['VehicleType'].value_counts()

sedan          66120
wagon          53130
bus            20736
convertible    14691
coupe          13226
suv            10065
small           8603
other           1374
Name: VehicleType, dtype: int64

In [18]:
#заменим NaN в VehicleType на популярный вариант
data['VehicleType'] = data['VehicleType'].fillna('sedan')

In [19]:
data['FuelType'].value_counts()

petrol      107881
gasoline     75759
lpg           4132
cng            265
hybrid          77
other           38
electric        14
Name: FuelType, dtype: int64

In [20]:
#заменим NaN в FuelType на популярный вариант
data['FuelType'] = data['FuelType'].fillna('petrol')

In [21]:
data.isna().sum()

Price               0
VehicleType         0
RegistrationYear    0
Gearbox             0
Power               0
Model               0
Kilometer           0
FuelType            0
Brand               0
NotRepaired         0
dtype: int64

## Обучение моделей

В обучении будут использованы следующие модели:
- LinearRegression
- DecisionTreeRegressor
- CatBoostRegressor
- LightGBM

В качестве метрики для всех моделей будет RMSE. Подготовим признаки, применим OHE

In [22]:
data[['VehicleType', 'Gearbox', 'FuelType', 'Brand', 'NotRepaired', 'Model']]= data[['VehicleType', 'Gearbox', 'FuelType', 'Brand', 'NotRepaired', 'Model']].astype('category')

In [23]:
data.info()
data.sample(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199956 entries, 1 to 354368
Data columns (total 10 columns):
Price               199956 non-null int64
VehicleType         199956 non-null category
RegistrationYear    199956 non-null int64
Gearbox             199956 non-null category
Power               199956 non-null int64
Model               199956 non-null category
Kilometer           199956 non-null int64
FuelType            199956 non-null category
Brand               199956 non-null category
NotRepaired         199956 non-null category
dtypes: category(6), int64(4)
memory usage: 9.0 MB


Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,FuelType,Brand,NotRepaired
262293,7000,sedan,2017,manual,170,unknown,150000,gasoline,hyundai,yes
46995,2100,wagon,2002,manual,101,focus,150000,petrol,ford,no
179355,13800,sedan,2010,auto,177,1er,125000,gasoline,bmw,no
14632,2250,wagon,2008,manual,121,other,150000,lpg,chevrolet,no
347208,13900,coupe,1994,auto,286,other,150000,petrol,bmw,no


In [24]:
#OHE
data_ohe = pd.get_dummies(data, drop_first='true', prefix_sep='_')
data_ohe.sample(5)

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,VehicleType_suv,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_yes
104327,5900,2005,116,125000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
210860,5200,2002,180,150000,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
193434,2300,2005,107,150000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
105170,9750,2003,340,150000,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
308824,550,1997,101,150000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [25]:
features = data_ohe.drop('Price', axis=1)
target = data_ohe['Price']

In [26]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345)

### Линейная регрессия

In [27]:
start_time = time.clock()

model = LinearRegression()
model.fit(features_train, target_train)
lr_time_train = time.clock() - start_time
print(lr_time_train)

18.875647999999998


In [28]:
start_time = time.clock()
predicted_valid = model.predict(features_valid)
lr_time_pred = time.clock() - start_time
print(lr_time_pred)

0.2543380000000006


In [29]:
mse_lr = mean_squared_error(target_valid, predicted_valid)
rmse_lr = mse_lr ** 0.5
print("Linear Regression")
print("RMSE =", rmse_lr)

Linear Regression
RMSE = 3146.6851751508048


### Случайный лес

In [30]:
start_time = time.clock()

model = RandomForestRegressor(n_estimators=80, max_depth=11, random_state=12345)
model.fit(features_train, target_train)
rfr_time_train = time.clock() - start_time
print(rfr_time_train)

173.268091


In [31]:
start_time = time.clock()
predicted_valid = model.predict(features_valid)
rfr_time_pred = time.clock() - start_time
print(rfr_time_pred)

0.6338830000000257


In [32]:
mse_rfr = mean_squared_error(target_valid, predicted_valid)
rmse_rfr = mse_rfr ** 0.5
print("Linear Regression")
print("RMSE =", rmse_rfr)

Linear Regression
RMSE = 2251.7328740075204


In [33]:
regressor = RandomForestRegressor() 
max_depth_list = [x for x in range(1,10)]
n_estimators_list = [x for x in range(1,10)]
hyperparams = [{'n_estimators': n_estimators_list, 
                'max_depth':max_depth_list, 
                'random_state':[12345]}]

print('# Tuning hyper-parameters for root_mean_squared_error')
print()
clf = GridSearchCV(regressor, hyperparams, scoring='neg_mean_squared_error')
clf.fit(features_train, target_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% ((mean*-1)** 0.5, params))
print()

cv_RMSE_DTR_ordinal = (max(means)*-1) ** 0.5

# Tuning hyper-parameters for root_mean_squared_error



KeyboardInterrupt: 

### CatBoostRegressor

In [None]:
regressor = CatBoostRegressor() 
hyperparams = [{'learning_rate':[0.1, 0.5, 0.8],
                'random_state':[12345],
                'verbose':[False]}]

print('# Tuning hyper-parameters for root_mean_squared_error')
print()
clf = GridSearchCV(regressor, hyperparams, scoring='neg_mean_squared_error')
clf.fit(features_train, target_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% ((mean*-1)** 0.5, params))
print()

cv_RMSE_CBR_ordinal = (max(means)*-1) ** 0.5

In [34]:
start_time = time.clock()
model = CatBoostRegressor(iterations=2, learning_rate=1, depth=2)
model.fit(features_train, target_train)
cbr_time_train = time.clock() - start_time
print(cbr_time_train)

0:	learn: 3676.3805287	total: 57.8ms	remaining: 57.8ms
1:	learn: 3293.6095006	total: 73.5ms	remaining: 0us
3.726341000000019


In [35]:
start_time = time.clock()
preds = model.predict(features_valid)
cbr_time_pred = time.clock() - start_time
print(cbr_time_pred)

0.04186999999998875


In [36]:
mse_cbr = mean_squared_error(target_valid, preds)
rmse_cbr = mse_cbr ** 0.5

print("CatBoostRegressor")
print("RMSE =", rmse_cbr)

CatBoostRegressor
RMSE = 3309.700127007489


### LightGBM Regressor

In [37]:
features_lgbm = data.drop('Price', axis=1)
target_lgbm = data['Price']

features_train_lgbm, features_valid_lgbm, target_train_lgbm, target_valid_lgbm = train_test_split(
    features_lgbm, target_lgbm, test_size=0.25, random_state=12345)

In [None]:
regressor = LGBMRegressor() 
hyperparams = [{'num_leaves':[31, 100, 200], 
                'learning_rate':[0.1, 0.3, 0.5],
                'random_state':[12345]}]

print('# Tuning hyper-parameters for root_mean_squared_error')
print()
clf = GridSearchCV(regressor, hyperparams, scoring='neg_mean_squared_error')
clf.fit(features_train_lgbm, target_train_lgbm)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% ((mean*-1)** 0.5, params))
print()

cv_RMSE_LGBMR = (max(means)*-1) ** 0.5

In [38]:
start_time = time.clock()
model = LGBMRegressor(learning_rate=0.1, num_leaves=200, random_state=12345)
model.fit(features_train_lgbm, target_train_lgbm)
lgbmr_time_train = time.clock() - start_time
print(lgbmr_time_train)

60.88244900000001


In [39]:
start_time = time.clock()
predicted = model.predict(features_valid_lgbm)
lgbmr_time_pred = time.clock() - start_time
print(lgbmr_time_pred)

1.5533829999999966


In [40]:
mse_lgbmr = mean_squared_error(target_valid_lgbm, predicted)
rmse_lgbmr = mse_lgbmr ** 0.5

print("LightGBM Regressor")
print("RMSE =", rmse_lgbmr)

LightGBM Regressor
RMSE = 1941.856030426121


## Анализ моделей

In [42]:
index = ['LinearRegression',
         'RandomForestRegressor',
         'CatBoostRegressor',
         'LGBMRegressor']
data = {'RMSE':[rmse_lr,
                rmse_rfr,
                rmse_cbr,
                rmse_lgbmr],
        
        'Время обучения модели, сек':[lr_time_train,
                                      rfr_time_train,
                                      cbr_time_train,
                                      lgbmr_time_train],
        'Время предсказания модели, сек':[lr_time_pred,
                                          rfr_time_pred,
                                          cbr_time_pred,
                                          lgbmr_time_pred]}

scores_data = pd.DataFrame(data=data, index=index)
scores_data['Рейтинг RMSE'] = (scores_data['RMSE'].min() /
                              scores_data['RMSE'])
scores_data['Рейтинг времени обучения'] = (scores_data['Время обучения модели, сек'].min() / 
                              scores_data['Время обучения модели, сек'])
scores_data['Рейтинг времени предсказания'] = (scores_data['Время предсказания модели, сек'].min() / 
                              scores_data['Время предсказания модели, сек'])
scores_data['Итоговый рейтинг'] = (scores_data['Рейтинг RMSE'] +
                                   scores_data['Рейтинг времени обучения'] +
                                   scores_data['Рейтинг времени предсказания'])
scores_data

Unnamed: 0,RMSE,"Время обучения модели, сек","Время предсказания модели, сек",Рейтинг RMSE,Рейтинг времени обучения,Рейтинг времени предсказания,Итоговый рейтинг
LinearRegression,3146.685175,18.875648,0.254338,0.617112,0.197415,0.164623,0.97915
RandomForestRegressor,2251.732874,173.268091,0.633883,0.862383,0.021506,0.066053,0.949942
CatBoostRegressor,3309.700127,3.726341,0.04187,0.586717,1.0,1.0,2.586717
LGBMRegressor,1941.85603,60.882449,1.553383,1.0,0.061206,0.026954,1.08816


**Вывод**

В ячейках рейтинга наибольшее значение обозначает модель с самым минимальным временем или RMSE, другие получают рейтинг как отношение лучшего в столбце к своему значению. 

Исходя из итогового рейтинга наилучшей стала модель CatBoostRegressor.

По параметрам:
- Если рассматривать точность, то в лидеры выходит LGBMRegressor.
- Если оценивать скорость, то тут выигрывает CatBoostRegressor.

Выбор модели ререссора зависит от специфики задачи.