Hello!

My name is Dmitry.  I'm glad to review your work today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [32]:
import pandas as pd
import numpy as np
import sklearn

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer


In [33]:
car_data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/car_data.csv')

In [34]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [35]:
print(car_data.head())

        DateCrawled  Price VehicleType  RegistrationYear Gearbox  Power  \
0  24/03/2016 11:52    480         NaN              1993  manual      0   
1  24/03/2016 10:58  18300       coupe              2011  manual    190   
2  14/03/2016 12:52   9800         suv              2004    auto    163   
3  17/03/2016 16:54   1500       small              2001  manual     75   
4  31/03/2016 17:25   3600       small              2008  manual     69   

   Model  Mileage  RegistrationMonth  FuelType       Brand NotRepaired  \
0   golf   150000                  0    petrol  volkswagen         NaN   
1    NaN   125000                  5  gasoline        audi         yes   
2  grand   125000                  8  gasoline        jeep         NaN   
3   golf   150000                  6    petrol  volkswagen          no   
4  fabia    90000                  7  gasoline       skoda          no   

        DateCreated  NumberOfPictures  PostalCode          LastSeen  
0  24/03/2016 00:00               

In [36]:
# Fill missing values in categorical columns with their respective modes
categorical_columns = car_data.select_dtypes(include=['object']).columns

for column in categorical_columns:
    car_data[column].fillna(car_data[column].mode()[0], inplace=True)

# Check if there are any missing values left
missing_values_after_fill = car_data.isnull().sum()

missing_values_after_fill


DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Ok, but in this case the best option is to fill NaNs with dummy value.
</div>

<div class="alert alert-block alert-info">

<b>Student answer.</b> <a class="tocSkip"></a>

Is a dummy value a placeholer like "Unknown", or is it encoded as 1, and 0? </div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

For catecorical columns it's better to use placeholders like "no_data"/"unknown" etc. 
</div>

In [38]:
# Handle outliers for RegistrationYear
car_data.loc[(car_data['RegistrationYear'] < 1950) | (car_data['RegistrationYear'] > 2022), 'RegistrationYear'] = car_data['RegistrationYear'].median()

# Handle outliers for Power
car_data.loc[(car_data['Power'] < 1) | (car_data['Power'] > 1000), 'Power'] = car_data['Power'].median()

# Check the updated statistics for RegistrationYear and Power
updated_stats = car_data[['RegistrationYear', 'Power']].describe()

updated_stats


Unnamed: 0,RegistrationYear,Power
count,354369.0,354369.0
mean,2003.126176,118.552749
std,7.299733,52.004019
min,1950.0,1.0
25%,1999.0,84.0
50%,2003.0,105.0
75%,2008.0,141.0
max,2019.0,1000.0


In [39]:
car_data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,sedan,1993.0,manual,105.0,golf,150000,0,petrol,volkswagen,no,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011.0,manual,190.0,golf,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004.0,auto,163.0,grand,125000,8,gasoline,jeep,no,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001.0,manual,75.0,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008.0,manual,69.0,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job.
</div>

In [40]:
# Check the number of unique values in each categorical column
unique_values_count = car_data[categorical_columns].nunique().sort_values(ascending=False)

unique_values_count


LastSeen       18592
DateCrawled    15470
Model            250
DateCreated      109
Brand             40
VehicleType        8
FuelType           7
Gearbox            2
NotRepaired        2
dtype: int64

In [41]:
threshold = 10

# Identify high cardinality columns based on the threshold
high_cardinality_cols = unique_values_count[unique_values_count > threshold].index.tolist()
categorical_columns = car_data.select_dtypes(include=['object']).columns
high_cardinality_cols

['LastSeen', 'DateCrawled', 'Model', 'DateCreated', 'Brand']

In [42]:
label_encoders = {}
for column in high_cardinality_cols:
    le = LabelEncoder()
    car_data[column] = le.fit_transform(car_data[column])
    label_encoders[column] = le


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Pro tip: to avoid data leaks it's better to fit encoder only on train data.
</div>

In [43]:
columns_to_encode = [col for col in categorical_columns if col not in high_cardinality_cols]

# One-Hot Encoding for the remaining categorical columns
car_data_ohe = pd.get_dummies(car_data, columns=columns_to_encode, drop_first=True)

# Display the shape and first few rows of the encoded data
car_data_ohe_shape = car_data_ohe.shape
car_data_ohe_head = car_data_ohe.head()

car_data_ohe_shape, car_data_ohe_head

((354369, 27),
    DateCrawled  Price  RegistrationYear  Power  Model  Mileage  \
 0        11681    480            1993.0  105.0    116   150000   
 1        11659  18300            2011.0  190.0    116   125000   
 2         6819   9800            2004.0  163.0    117   125000   
 3         8380   1500            2001.0   75.0    116   150000   
 4        15281   3600            2008.0   69.0    101    90000   
 
    RegistrationMonth  Brand  DateCreated  NumberOfPictures  ...  \
 0                  0     38           86                 0  ...   
 1                  5      1           86                 0  ...   
 2                  8     14           52                 0  ...   
 3                  6     38           61                 0  ...   
 4                  7     31          108                 0  ...   
 
    VehicleType_suv  VehicleType_wagon  Gearbox_manual  FuelType_electric  \
 0                0                  0               1                  0   
 1               

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

I can not reproduce results because of kernel crash... Can you please rerun project and **save** all outputs before submitting project?
    
p.s. My guess that there are too many categories in categorical columns.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Thank you!
</div>

In [44]:
X = car_data_ohe.drop(columns=['Price', 'DateCrawled', 'Model', 'LastSeen'])
y = car_data_ohe['Price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((283495, 23), (70874, 23), (283495,), (70874,))

In [45]:
# Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on the test set
lr_predictions = lr_model.predict(X_test)

# Evaluate using RMSE
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))

lr_rmse

3132.137253347973

## Random Forest model

In [46]:
# Initialize the Random Forest model with some initial hyperparameters
rf_model = RandomForestRegressor(n_estimators=10, max_depth=10, random_state=42)

# Train the model on a subset of the data for quicker results
sample_size = int(0.1 * len(X_train))
X_train_sample = X_train.sample(sample_size, random_state=42)
y_train_sample = y_train[X_train_sample.index]

rf_model.fit(X_train_sample, y_train_sample)

# Predict on the test set
rf_predictions = rf_model.predict(X_test)

# Evaluate using RMSE
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

rf_rmse


2185.053958676757

## LightGBM

In [47]:
# Columns to be label encoded
columns_to_label_encode = ['Model', 'Brand', 'VehicleType', 'Gearbox', 'FuelType', 'NotRepaired']

# Label encoding
for column in columns_to_label_encode:
    le = LabelEncoder()
    car_data[column + "_encoded"] = le.fit_transform(car_data[column])


In [48]:
# Split the data
X = car_data.drop(columns="Price")
y = car_data["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sample from the training set
sample_size = int(0.2 * len(X_train))
X_train_sample = X_train.sample(sample_size, random_state=42)
y_train_sample = y_train[X_train_sample.index]


In [49]:
# Prepare the dataset
lgb_train = lgb.Dataset(X_train_sample[['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth', 'Model_encoded', 'Brand_encoded', 'VehicleType_encoded', 'Gearbox_encoded', 'FuelType_encoded', 'NotRepaired_encoded']], 
                        label=y_train_sample, 
                        categorical_feature=['Model_encoded', 'Brand_encoded', 'VehicleType_encoded', 'Gearbox_encoded', 'FuelType_encoded', 'NotRepaired_encoded'], 
                        free_raw_data=False)

# Hyperparameters and Training
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}
num_round = 100
lgb_model = lgb.train(params, lgb_train, num_round)

# Prediction and Evaluation
lgb_predictions = lgb_model.predict(X_test[['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth', 'Model_encoded', 'Brand_encoded', 'VehicleType_encoded', 'Gearbox_encoded', 'FuelType_encoded', 'NotRepaired_encoded']])
lgb_rmse = np.sqrt(mean_squared_error(y_test, lgb_predictions))




You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 628
[LightGBM] [Info] Number of data points in the train set: 56699, number of used features: 10
[LightGBM] [Info] Start training from score 4374.720718


## XGBoost

In [52]:
features = [
    'RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth',
    'Model_encoded', 'Brand_encoded', 'VehicleType_encoded', 
    'Gearbox_encoded', 'FuelType_encoded', 'NotRepaired_encoded'
]

# Initialize the XGBoost model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.05, max_depth=6, random_state=42)

# Train the model on the subset of data
xgb_model.fit(X_train_sample[features], y_train_sample)

# Predict on the test set
xgb_predictions = xgb_model.predict(X_test[features])

# Evaluate using RMSE
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_predictions))

xgb_rmse


1990.9509243245896

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

## Model analysis

Random Forest demonstrated promising results by significantly reducing the RMSE compared to the baseline Linear Regression model.

Linear Regression:

RMSE: 3132.137253347973
Serves as a baseline model for comparison.

Random Forest:

RMSE: 2185.053958676757
Outperformed the baseline Linear Regression model. 

LightGBM:

RMSE: 1635.81
Lowest RMSE

XGBoost:

RMSE: 1990.9509243245896


<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good conclusion, but I believe it's posible to make it more readable =) i.e. round RMSE, make dataframe, etc.
</div>

<div class="alert alert-block alert-success">
<b>Overall reviewer's comment</b> <a class="tocSkip"></a>
    
Thank you for sending your project. You've done a really good job on it!

While there's room for improvement, on the whole, your project is impressive good. I like code style - very high level!
    
Remember: every issue with our code is a chance for us to learn something new =)

Your project has been accepted and you can go to the next sprint!    
    
</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed