Hello, my name is Artem. I'm going to review your project!

You can find my comments in <font color='green'>green</font>, <font color='blue'>blue</font> or <font color='red'>red</font> boxes like this:

<div class="alert alert-block alert-success">
<b>Success:</b> if everything is done succesfully
</div>

<div class="alert alert-block alert-info">
<b>Improve: </b> "Improve" comments mean that there are tiny corrections that could help you to make your project better.
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with the red comments.
</div>

### <font color='orange'>General feedback</font>
* You've worked really hard and submitted a solid project.
* Thank you for structuring the project. It's a pleasure to check such notebooks.
* There are a couple of things that need to be done before your project is complete, but they're pretty straightforward.
* There are few things I'd like you to check. They're not mistakes, but your project could be improved if you correct them.
* I believe you can easily fix it! Good luck!

### <font color='orange'>General feedback (review 2)</font>
* I really appreciate the corrections you sent in! Thanks for taking the time to do so.
* You've fixed a lot of bugs. Good job!
* All new comments are marked with "review 2" keyword.
* Keep working on it, you are improving!

### <font color='orange'>General feedback (review 3)</font>
* Your corrections look great, you've improved your work significantly!
* Your project has passed code review. Congratulations!
* Keep up the good work, and good luck on the next sprint!

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# 1. Data preparation

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor


<div class="alert alert-block alert-info">
<b>Improve: </b> Please collect all imports in the first cell.
</div>

Let's take a quick view of data columns

In [2]:
df = pd.read_csv('/datasets/car_data.csv')
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
print('Rows: {} | Columns: {}'.format(df.shape[0], df.shape[1]))
df.info()

Rows: 354369 | Columns: 16
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


Note that we got date features, so we should adecuate them to a datetime type. But, the DateCrawled it is not covenient to work with it, cause it is not essential for our study

So, we decided to delete columns related to dates, because 
they do not contribute anything

<div class="alert alert-block alert-success">
<b>Success:</b> Data loading and initial analysis were done well.
</div>

In [4]:
del df['DateCreated']
del df['LastSeen']
del df['DateCrawled']
del df['PostalCode']

Now, we check which of our features are categorical

<div class="alert alert-block alert-info">
<b>Improve: </b> We can just drop these columns because they are useless.
</div>

In [5]:
cat_cols = df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(exclude=['object']).columns

print(df[cat_cols].columns)
categorical_columns = df[cat_cols]

Index(['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'], dtype='object')


We need to be sure of he amoun of NaN or missing values in our dataset

In [6]:
missing_values = df.isnull().sum()
total = missing_values.sort_values(ascending=True)
# Percentage  
percent = (missing_values / len(df.index)*100).round(2).sort_values(ascending=True)
table_missing = pd.concat([total, percent], axis=1, keys=['Number of Nulls', 'Percentage of Nulls'])
table_missing

Unnamed: 0,Number of Nulls,Percentage of Nulls
Price,0,0.0
RegistrationYear,0,0.0
Power,0,0.0
Mileage,0,0.0
RegistrationMonth,0,0.0
Brand,0,0.0
NumberOfPictures,0,0.0
Model,19705,5.56
Gearbox,19833,5.6
FuelType,32895,9.28


A few of categorical features requires a special attention. this is the case of VehicleType, NotRepaired

In [7]:
df[['VehicleType', 'NotRepaired']].describe()

Unnamed: 0,VehicleType,NotRepaired
count,316879,283215
unique,8,2
top,sedan,no
freq,91457,247161


In [8]:
df['VehicleType'].value_counts()

sedan          91457
small          79831
wagon          65166
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: VehicleType, dtype: int64

In [9]:
df['NotRepaired'].value_counts()

no     247161
yes     36054
Name: NotRepaired, dtype: int64

we will include in another group others to NaN values in catehorical variable VehicleType and NotRepaired

In [10]:
df['VehicleType']= df['VehicleType'].replace(np.nan, 'other')
df['VehicleType'].isna().sum()

0

In [11]:
df['NotRepaired']= df['NotRepaired'].fillna('unknown')
df['NotRepaired'].isna().sum()

0

Casually our variables that contain NaN values are only the categorical features. There are cases were we dont care about them, and we cam simply undo those values (percentaje lower than 10%).

In [12]:
df.dropna(inplace=True)

<div class="alert alert-block alert-success">
<b>Success:</b> NaN values processing was done correctly.
</div>

<div class="alert alert-block alert-info">
<b>Improve: </b> There is one constant column. Please find and drop it. Also, some columns contain some strange values. It would be better if you've made a small EDA and drop them.
</div>

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [14]:
target = df['Price']
features = df.drop(['Price'], axis=1)
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=12345)
print(features_train.shape)
print(features_test.shape)


(224421, 11)
(74807, 11)


<div class="alert alert-block alert-info">
<b>Improve: </b> It's more correct to name features_test instead of features_valid.
</div>

It is time to prepare data for our models. Adapts our features for our linear regression model

In [15]:
numercial_feat = [col for col in df.columns if col not in df[cat_cols].columns and col!= 'Price' ]
scaler = StandardScaler()
scaler.fit(features_train[numercial_feat])
features_train[numercial_feat] = scaler.transform(features_train[numercial_feat])
features_test[numercial_feat] = scaler.transform(features_test[numercial_feat])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the 

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299228 entries, 0 to 354368
Data columns (total 12 columns):
Price                299228 non-null int64
VehicleType          299228 non-null object
RegistrationYear     299228 non-null int64
Gearbox              299228 non-null object
Power                299228 non-null int64
Model                299228 non-null object
Mileage              299228 non-null int64
RegistrationMonth    299228 non-null int64
FuelType             299228 non-null object
Brand                299228 non-null object
NotRepaired          299228 non-null object
NumberOfPictures     299228 non-null int64
dtypes: int64(6), object(6)
memory usage: 29.7+ MB


<div class="alert alert-block alert-success">
<b>Success:</b> Glad to see that scaler was fitted only on train part of the data.
</div>

# 2. Model training

In [30]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
import xgboost as xgb
import time

We should tunning hyperparameters of our models. The model we have choosen are RandomForest, XGBoost, CatBoost and LightGBM. In case of lienar regression we only use it as a sanity check for the rest of models and check if are good enought. remember one of our objetives is to obtain a good quality of the model.

<div class="alert alert-block alert-danger">
<b>RandomForest<b>
</div>

Before training the model, we should convert categories to integer for our model regression

In [18]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for feature in cat_cols:
    features_train[feature] = le.fit_transform(features_train[feature])
    features_test[feature] = le.transform(features_test[feature])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


For RandomForest Classifier we try different numbers of n_estimators 

In [25]:
# Establish model
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators 
estimators = np.arange(10, 80, 10)
rmse = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(features_train, target_train)
    predictions = model.predict(features_test)
    rmse.append(np.sqrt(mean_squared_error(target_test,predictions)))
    
rmse

[1744.9388140597164,
 1706.1614827587496,
 1693.9975598036142,
 1689.8570490709985,
 1687.7482926096066,
 1680.8697001150522,
 1679.438743075741]

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> Please don't forget that parameters tuning and testing should be done on different datasets. Another way to solve this problem is to use cross-validation-based methods.
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing (review 2):</b> You're using classifier but it's regression task. This is what caused the error.
</div>

### XGBoost

We tuning hyperparameters with number boost round

In [35]:
train_data = xgb.DMatrix(features_train, label=target_train)
test_data = xgb.DMatrix(features_test, label=target_test)
params = {
    # Parameters that we are going to tune.
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    # Other parameters
    'objective':'reg:linear',
}
params['eval_metric'] = "rmse"
num_boost_round = 30
model = xgb.train(
    params,
    train_data,
    num_boost_round=num_boost_round,
    evals=[(test_data, "Test")],
    early_stopping_rounds=5
)

print("Best MAE: {:.2f} with {} rounds".format(
                 model.best_score,
                 model.best_iteration+1))

Exception ignored in: <function Booster.__del__ at 0x7f62bf5ad0e0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/xgboost/core.py", line 957, in __del__
    if self.handle is not None:
AttributeError: 'Booster' object has no attribute 'handle'
  if getattr(data, 'base', None) is not None and \


[0]	Test-rmse:4938.11
Will train until Test-rmse hasn't improved in 5 rounds.
[1]	Test-rmse:3825.67
[2]	Test-rmse:3107.84
[3]	Test-rmse:2668.63
[4]	Test-rmse:2401.23
[5]	Test-rmse:2236.01
[6]	Test-rmse:2130.63
[7]	Test-rmse:2058.04
[8]	Test-rmse:2001.86
[9]	Test-rmse:1972.23
[10]	Test-rmse:1953.79
[11]	Test-rmse:1928.61
[12]	Test-rmse:1911.95
[13]	Test-rmse:1887.94
[14]	Test-rmse:1877.51
[15]	Test-rmse:1871.6
[16]	Test-rmse:1865.42
[17]	Test-rmse:1860.26
[18]	Test-rmse:1854.84
[19]	Test-rmse:1848.33
[20]	Test-rmse:1842.27
[21]	Test-rmse:1838.74
[22]	Test-rmse:1830.82
[23]	Test-rmse:1827.37
[24]	Test-rmse:1824.11
[25]	Test-rmse:1817
[26]	Test-rmse:1814.51
[27]	Test-rmse:1812.52
[28]	Test-rmse:1809
[29]	Test-rmse:1808.58
Best MAE: 1808.58 with 30 rounds


### LightGBM

Previocsly we transform the categorical features into a type 'category

In [28]:
for c in df.columns:
    col_type = df[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        df[c] = df[c].astype('category')
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 224421 entries, 309829 to 257767
Data columns (total 11 columns):
VehicleType          224421 non-null int64
RegistrationYear     224421 non-null float64
Gearbox              224421 non-null int64
Power                224421 non-null float64
Model                224421 non-null int64
Mileage              224421 non-null float64
RegistrationMonth    224421 non-null float64
FuelType             224421 non-null int64
Brand                224421 non-null int64
NotRepaired          224421 non-null int64
NumberOfPictures     224421 non-null float64
dtypes: float64(5), int64(6)
memory usage: 20.5 MB


In [37]:
train_data = lgb.Dataset(features_train, label=target_train)
test_data = lgb.Dataset(features_test, label=target_test)
parameters = {

            'metric': 'rmse',
            'is_unbalance': 'true',
            'boosting': 'gbdt',
            'num_leaves': 30,
            'feature_fraction': 0.5,
            'bagging_fraction': 0.5,
            'bagging_freq': 20,
            'learning_rate': .05,
            'verbose': 0
        }
model = lgb.train(parameters,
               train_data,
               valid_sets=test_data,
               num_boost_round=50,
               early_stopping_rounds=50)

print( "Best MAE: {} ".format(model.best_score))

[1]	valid_0's rmse: 4492.4
Training until validation scores don't improve for 50 rounds
[2]	valid_0's rmse: 4410.68
[3]	valid_0's rmse: 4333.53
[4]	valid_0's rmse: 4183.23
[5]	valid_0's rmse: 4044.05
[6]	valid_0's rmse: 3977.73
[7]	valid_0's rmse: 3916.04
[8]	valid_0's rmse: 3806.38
[9]	valid_0's rmse: 3751.19
[10]	valid_0's rmse: 3642.82
[11]	valid_0's rmse: 3534.8
[12]	valid_0's rmse: 3433.14
[13]	valid_0's rmse: 3389.95
[14]	valid_0's rmse: 3297.92
[15]	valid_0's rmse: 3215.57
[16]	valid_0's rmse: 3142.35
[17]	valid_0's rmse: 3066.33
[18]	valid_0's rmse: 3011.81
[19]	valid_0's rmse: 2967.5
[20]	valid_0's rmse: 2908.32
[21]	valid_0's rmse: 2861.5
[22]	valid_0's rmse: 2801.75
[23]	valid_0's rmse: 2745.88
[24]	valid_0's rmse: 2700.36
[25]	valid_0's rmse: 2656.57
[26]	valid_0's rmse: 2618.07
[27]	valid_0's rmse: 2594.88
[28]	valid_0's rmse: 2559.83
[29]	valid_0's rmse: 2520.19
[30]	valid_0's rmse: 2483.15
[31]	valid_0's rmse: 2448.8
[32]	valid_0's rmse: 2416.26
[33]	valid_0's rmse: 2386

<div class="alert alert-block alert-danger">

<b>Needs fixing:</b> LightGBM model have internal method of categorical features encoding. All you need is to change type of categorical features like this: `.astype('category')`.

</div>

<div class="alert alert-block alert-success">
<b>Success (review 2):</b> Well done!
</div>

# 3. Model analysis

After tuning our models, we should evaluate our models qith RMSE metric

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> Please compare the models by the time required for training and predicting. Which one is more important from your point of view?
</div>

<div class="alert alert-block alert-info">
<b>Improve (review 2): </b> It would be better if you've measure time 2 times for each algorithm. Separately time for fitting the model and time for predicting.
</div>

## Linear regression

In [26]:
linear_model = LinearRegression()
linear_model.fit(features_train, target_train)
predicted_test = linear_model.predict(features_test)
rmse= np.sqrt(mean_squared_error(target_test,predicted_test))
print("RMSE: %f" % (rmse))

RMSE: 3629.631833


In [32]:
start_time=time.time()
forest = RandomForestRegressor(n_estimators= 80, random_state=12345)
forest.fit(features_train, target_train)
predictions = model.predict(features_test)
rmse= np.sqrt(mean_squared_error(target_test, predictions))
print("RMSE: %f" % (rmse))
print(time.time()-start_time)

RMSE: 2108.881847
69.20774817466736


## XGBoost

In [36]:
start_time=time.time()
num_boost_round = model.best_iteration + 1
xg_reg = xgb.train(params, train_data, num_boost_round)
preds = xg_reg.predict(test_data)
rmse = np.sqrt(mean_squared_error(target_test,preds))
print("RMSE: %f" % (rmse))
print(time.time()-start_time)

RMSE: 1808.585640
19.394006490707397


## LightGBM

In [38]:
start_time=time.time()
pred = model.predict(features_test)
rmse = np.sqrt(mean_squared_error(target_test,pred))
print("RMSE: %f" % (rmse))
print(time.time()-start_time)

RMSE: 2108.881847
0.33984804153442383


## Conclusion

As a final resutl, RandomForest, XGboost And LightGBM perfom well as our sanity check specify (RMSE lower than Linear Regression). but if we take a look of the ebst one we would choose the XGBoost model, but efficiently is not the fastest. the one which mmets this conditions is the LightGBM (a few seconds). The worst one would be the randomForest, because it takes a lot of time to finish and its metric is not the best one. 

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed