**Review**
	  
Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project was a pleasure to review. Keep up the good work on the next sprint! :)

# Used car price prediction

#### Finding the value market of a used car using the following features: 
 VehicleType, RegistrationYear, Gearbox, Power, Model, Mileage, RegistrationMonth, FuelType, Brand, NotRepaired, NumberOfPictures and PostalCode. 

## 1. Data preprocessing

In [1]:
import numpy as np 
import pandas as pd 
import sklearn.metrics
import time as t
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.impute import SimpleImputer, KNNImputer 
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder, StandardScaler 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error 
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, HalvingGridSearchCV
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression, ElasticNet, SGDRegressor 
from sklearn import set_config
import warnings
warnings.filterwarnings("ignore")

In [2]:
try:
    df= pd.read_csv('car_data.csv')
except:
    df= pd.read_csv('/datasets/car_data.csv')

In [3]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


- the different dates in the dataset are on the whole irrelevant to a car's price, so let's remove the date features:

In [4]:
df= df.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1)
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,NumberOfPictures,PostalCode
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,0,70435
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,0,66954
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,0,90480
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,0,91074
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,0,60437


In [5]:
df.shape

(354369, 13)

In [6]:
df.isna().sum()

Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
NumberOfPictures         0
PostalCode               0
dtype: int64

<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
Alright, the data was loaded and inspected!
	  
</div>

- although the entries with missing values constitute about 20% of the data, ~245,000 samples are quit enough for a regression problem, so we can drop missing values:

<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
Okay, that's reasonable!
	  
</div>

In [7]:
df= df.dropna().reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245814 entries, 0 to 245813
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              245814 non-null  int64 
 1   VehicleType        245814 non-null  object
 2   RegistrationYear   245814 non-null  int64 
 3   Gearbox            245814 non-null  object
 4   Power              245814 non-null  int64 
 5   Model              245814 non-null  object
 6   Mileage            245814 non-null  int64 
 7   RegistrationMonth  245814 non-null  int64 
 8   FuelType           245814 non-null  object
 9   Brand              245814 non-null  object
 10  NotRepaired        245814 non-null  object
 11  NumberOfPictures   245814 non-null  int64 
 12  PostalCode         245814 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 24.4+ MB


In [8]:
numerical_data= df.select_dtypes(include='number')
numerical_data_statistics = numerical_data.describe(percentiles=[.25, .75]).T
numerical_data_statistics['low_outliers']= numerical_data_statistics['25%'] - 1.5*(numerical_data_statistics['75%']- numerical_data_statistics['25%'])
numerical_data_statistics['high_outliers']= numerical_data_statistics['75%'] + 1.5*(numerical_data_statistics['75%']- numerical_data_statistics['25%'])
numerical_data_statistics

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,low_outliers,high_outliers
Price,245814.0,5125.346717,4717.948673,0.0,1499.0,3500.0,7500.0,20000.0,-7502.5,16501.5
RegistrationYear,245814.0,2002.918699,6.163765,1910.0,1999.0,2003.0,2007.0,2018.0,1987.0,2019.0
Power,245814.0,119.970884,139.387116,0.0,75.0,110.0,150.0,20000.0,-37.5,262.5
Mileage,245814.0,127296.716216,37078.820368,5000.0,125000.0,150000.0,150000.0,150000.0,87500.0,187500.0
RegistrationMonth,245814.0,6.179701,3.479519,0.0,3.0,6.0,9.0,12.0,-6.0,18.0
NumberOfPictures,245814.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PostalCode,245814.0,51463.186002,25838.058847,1067.0,30966.0,50769.0,72379.0,99998.0,-31153.5,134498.5


- as can be seen, we can drop the 'NumberOfPictures' feature which has no added value since it has only 1 value. 
- we can also drop the PostalCode feature as it has little to do with the actual car itself and although it's a numeric. it's actually a categorical feature in nature and treating it as numeric by the estimators might through it off. 
- RegistrationMonth is also redundant in determining a car's price.
- we can get rid of outliers as well.

<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
Excellent points!
	  
</div>

In [9]:
df= df.drop(['NumberOfPictures', 'PostalCode', 'RegistrationMonth'], axis=1)

In [10]:
outliers= df[(df['Price']>16501.5) | (df['RegistrationYear']<1987) | (df['Power']> 262.5) | (df['Mileage']<87500)]
df= df.drop(outliers.index).reset_index(drop=True)
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196583 entries, 0 to 196582
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             196583 non-null  int64 
 1   VehicleType       196583 non-null  object
 2   RegistrationYear  196583 non-null  int64 
 3   Gearbox           196583 non-null  object
 4   Power             196583 non-null  int64 
 5   Model             196583 non-null  object
 6   Mileage           196583 non-null  int64 
 7   FuelType          196583 non-null  object
 8   Brand             196583 non-null  object
 9   NotRepaired       196583 non-null  object
dtypes: int64(4), object(6)
memory usage: 15.0+ MB


<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
Great, outliers were detected and removed!
	  
</div>

In [11]:
X= df.drop('Price', axis=1)
y= df['Price']

In [12]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=42)

<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
The data was split into train and test sets. The proportion is reasonable
	  
</div>

In [13]:
print(X_train.shape, y_train.shape)

(137608, 9) (137608,)


Setting up the preprocessing Pipeline:

In [14]:
numerical_features= X_train.select_dtypes(include='number').columns.to_list()
numerical_features

['RegistrationYear', 'Power', 'Mileage']

In [15]:
categorical_features= X_train.select_dtypes(exclude='number').columns.to_list()
categorical_features

['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

In [16]:
numeric_pipeline = Pipeline(steps=[
    ('scale', StandardScaler())
])


categorical_pipeline = Pipeline(steps=[
    ('one-hot', OneHotEncoder(handle_unknown='ignore', drop='if_binary'))
])

In [17]:
preprocessor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])

<div class="alert alert-success">
<b>Reviewer's comment</b>
	  
Great use of pipelines and ColumnTransformer!
    
</div>

## 2. Model training and comparison

#### Linear Regression

In [18]:
linear_model= LinearRegression()
linear_pipeline= Pipeline(steps=[('preprocesing', preprocessor), ('linear_regression', linear_model)])

In [19]:
set_config(display='diagram')
linear_pipeline

In [20]:
start = t.time()
linear_scores= cross_validate(linear_pipeline , X_train, y_train, cv=5, scoring='neg_mean_squared_error')
end= t.time()
print('runtime:', round((end - start),1))

runtime: 22.3


In [21]:
rmse_linear= np.sqrt(-linear_scores['test_score'].mean()).round(1)
rmse_linear

1920.1

In [22]:
linear_pipeline.fit(X_train, y_train)
linearRegression_prediction= linear_pipeline.predict(X_test)
test_score_linearRegression= np.sqrt(mean_squared_error(y_test, linearRegression_prediction)).round(1)
test_score_linearRegression

1921.9

#### ElasticNetRegressor

In [23]:
elastic_model= ElasticNet(random_state=42)
elastic_pipeline= Pipeline(steps=[('preprocesing', preprocessor), ('elastic_model', elastic_model)])

In [24]:
#elastic_pipeline.get_params().keys()

In [25]:
params= dict(elastic_model__l1_ratio= [.1, .5, .7, .9, .95, .99, 1], elastic_model__alpha= [0.5, 1, 1.5], elastic_model__max_iter= [10000])

In [26]:
elastic_pipeline

In [27]:
start = t.time()
elasticNet= HalvingGridSearchCV(elastic_pipeline, param_grid= params, cv=3, scoring='neg_mean_squared_error', random_state=42)
elasticNet.fit(X_train, y_train)
end= t.time()
print('runtime:', round((end - start), 1))

runtime: 255.0


In [28]:
elasticNet_score= np.sqrt(-elasticNet.best_score_).round(1)
elasticNet_score

1939.5

In [29]:
elasticNet.best_params_

{'elastic_model__alpha': 0.5,
 'elastic_model__l1_ratio': 1,
 'elastic_model__max_iter': 10000}

In [30]:
elasticNet_prediction= elasticNet.predict(X_test)
test_score_elasticNet= np.sqrt(mean_squared_error(y_test, elasticNet_prediction)).round(1)
test_score_elasticNet

1938.5

#### RandomForestRegressor

In [31]:
randomForest_model= RandomForestRegressor(random_state=42)

In [32]:
randomForest_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('randomForest_model', randomForest_model)
])

In [33]:
#randomForest_pipeline.get_params().keys()

In [34]:
params= dict(randomForest_model__n_estimators=[50, 100], randomForest_model__max_depth= [3,5])

In [35]:
randomForest_pipeline

In [36]:
start = t.time()
randomForest= HalvingGridSearchCV(randomForest_pipeline, param_grid= params, cv=3, scoring='neg_mean_squared_error', random_state=42)
randomForest.fit(X_train, y_train)
end= t.time()
print('runtime:', round((end - start), 1))

runtime: 150.3


In [37]:
#sklearn.metrics.SCORERS.keys()

In [38]:
randomForest_score= np.sqrt(-randomForest.best_score_).round(1)
randomForest_score

1884.5

In [39]:
randomForest.best_params_

{'randomForest_model__max_depth': 5, 'randomForest_model__n_estimators': 100}

In [40]:
randomForest_prediction= randomForest.predict(X_test)
test_score_randomForest= np.sqrt(mean_squared_error(y_test, randomForest_prediction)).round(1)
test_score_randomForest

1870.0

#### AdaBoostRegressor

In [41]:
adaBoost_model= AdaBoostRegressor(random_state=42)

In [42]:
adaBoost_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('adaBoost_model', adaBoost_model)
])

In [43]:
#adaBoost_pipeline.get_params().keys()

In [44]:
params= dict(adaBoost_model__n_estimators=[50, 100], adaBoost_model__learning_rate= [ 0.01, 0.2])

In [45]:
adaBoost_pipeline

In [46]:
start = t.time()
adaBoost= HalvingGridSearchCV(adaBoost_pipeline, param_grid= params, cv=3, scoring='neg_mean_squared_error', random_state=42)
adaBoost.fit(X_train, y_train)
end= t.time()
print('runtime:', round((end - start), 1))

runtime: 132.0


In [47]:
adaBoost_score= np.sqrt(-adaBoost.best_score_).round(1)
adaBoost_score

2125.8

In [48]:
adaBoost.best_params_

{'adaBoost_model__learning_rate': 0.2, 'adaBoost_model__n_estimators': 50}

In [49]:
adaBoost_prediction= adaBoost.predict(X_test)
test_score_adaBoost= np.sqrt(mean_squared_error(y_test, adaBoost_prediction)).round(1)
test_score_adaBoost

2139.0

#### GradientBoostingRegressor

In [50]:
gradB_model= GradientBoostingRegressor(random_state=42)

In [51]:
gradB_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('gradB_model', gradB_model)
])

In [52]:
#gradB_pipeline.get_params().keys()

In [53]:
params= dict(gradB_model__n_estimators= [100, 200] )

In [54]:
gradB_pipeline

In [55]:
start = t.time()
gradientBoosting= HalvingGridSearchCV(gradB_pipeline, param_grid= params, cv=3, scoring='neg_mean_squared_error', random_state=42)
gradientBoosting.fit(X_train, y_train)
end= t.time()
print('runtime:', round((end - start), 1))

runtime: 96.4


In [56]:
gradientBoosting_score= np.sqrt(-gradientBoosting.best_score_).round(1)
gradientBoosting_score

1491.4

In [57]:
gradientBoosting.best_params_

{'gradB_model__n_estimators': 200}

In [58]:
gradientBoosting_prediction= gradientBoosting.predict(X_test)
test_score_gradientBoosting= np.sqrt(mean_squared_error(y_test, gradientBoosting_prediction)).round(1)
test_score_gradientBoosting

1495.8

## 3. Conclusion

Score performance:

- several estimators and hyperparameters have been tested for the prediction of the used cars price: Linear regression, ElasticNet regression, RandomForest regression, AdaBoost regression and gradientBoosting regression. 
- the Linear regression has been selected as the baseline estimator with rmse value of 1921.9, which constitutes about 50% error off the mean price (4002.5)...
- 2 models proved as improvements to this value: RandomForest regression (rmse= 1870.0) and GradientBoosting regression (rmse= 1495.8). 
- gradient boosting proved to yield the optimal predictor with the least rmse value.  

Runtime performance: 
- in terms of runtime performance the best performance is that of linear regression because there was no hyperparameter tuning (16.2 seconds), followed by gradient boosting regression (90.3 seconds). 

overall: the best performance in both runtime and rmse terms, was that of GradientBoosting regression.

<div class="alert alert-warning">
<b>Reviewer's comment</b>
	  
Excellent work! You tried several different models, did some hyperaparameter tuning with cross-validation and analyzed the models' performance both in terms of their error and runtime.
    
One thing I would suggest is to measure the time it takes to train only the model with fixed hyperparameters, because otherwise the runtime of a grid search depends on how many combinations we're considering. Also, it would be nice to separately measure the time it takes to make predictions for each model (this is probably even more important metric: because the model is used to make predictions much more often than it is trained).
	  
</div>