## **Model Building**

In [1]:
import pandas as pd
import numpy as np
import sklearn.metrics as metrics

cars_data = pd.read_csv("cars_data_updated.csv")

In [2]:
# Independent variables in X (predictors)
# We drop 'Name' since we split that column into Brand and Model
# We drop 'Price' and 'price_log' since they are dependent variables
# We drop 'Kilometers_Driven' since we took the log of that column and the original is not needed
X = cars_data.drop(['Name','Price','price_log','Kilometers_Driven'], axis = 1)

# Dependent variable in y (target)
y = cars_data[["price_log", "Price"]]

In [3]:
# Uses one-hot encoding to convert categorical features into numerical values 
X = pd.get_dummies(X, drop_first = True)

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
# Splitting data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
print(X_train.shape, X_test.shape)

(5076, 275) (2176, 275)


In [6]:
# Function for calculating r2 value and RMSE on train and test data
# Takes model as input

def get_model_score(model, flag = True):
    '''
    model : regressor to predict values of X

    '''
    score_list = [] 
    
    pred_train = model.predict(X_train)
    pred_train_ = np.exp(pred_train)
    
    pred_test = model.predict(X_test)
    pred_test_ = np.exp(pred_test)
    
    train_r2 = metrics.r2_score(y_train['Price'], pred_train_)
    test_r2 = metrics.r2_score(y_test['Price'], pred_test_)
    
    train_rmse = metrics.mean_squared_error(y_train['Price'], pred_train_, squared = False)
    test_rmse = metrics.mean_squared_error(y_test['Price'], pred_test_, squared = False)
    
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))
    
    if flag == True: 
        print("R-sqaure on training set : ", metrics.r2_score(y_train['Price'], pred_train_))
        print("R-square on test set : ", metrics.r2_score(y_test['Price'], pred_test_))
        print("RMSE on training set : ", np.sqrt(metrics.mean_squared_error(y_train['Price'], pred_train_)))
        print("RMSE on test set : ", np.sqrt(metrics.mean_squared_error(y_test['Price'], pred_test_)))

    return score_list

<hr>

I will use the following models to predict the Price:

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Linear Regression**

In [7]:
from sklearn.linear_model import LinearRegression

In [8]:
lr = LinearRegression()
lr.fit(X_train, y_train['price_log'])

In [9]:
LR_score = get_model_score(lr)

R-sqaure on training set :  0.9193878014791699
R-square on test set :  0.8642631469804376
RMSE on training set :  3.050342028500473
RMSE on test set :  3.94290688639171


**Observations from results:**

The R-squared value on the training set is about 0.92, which is very high. This is good because it means that our model fits really well with our training data.

The RMSE on the training set is 3.05, which is fairly high for our model.

**Important variables of Linear Regression**

In [10]:
# Convert boolean columns to integers
X_train = X_train.astype({col: 'int' for col in X_train.select_dtypes('bool').columns})
X_train = X_train.apply(pd.to_numeric, errors='coerce')

In [11]:
import statsmodels.api as sm

x_train = sm.add_constant(X_train)
x_test = sm.add_constant(X_test)

def build_ols_model(train):
    olsmodel = sm.OLS(y_train["price_log"], train)
    return olsmodel.fit()

olsmodel1 = build_ols_model(x_train)

print(olsmodel1.summary())

                            OLS Regression Results                            
Dep. Variable:              price_log   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.939
Method:                 Least Squares   F-statistic:                     329.6
Date:                Tue, 28 Jan 2025   Prob (F-statistic):               0.00
Time:                        15:47:34   Log-Likelihood:                 795.45
No. Observations:                5076   AIC:                            -1117.
Df Residuals:                    4839   BIC:                             431.3
Df Model:                         236                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

In [12]:
# Retrive Coefficient values and p-values
olsmod = pd.DataFrame(olsmodel1.params, columns = ['coef'])
olsmod['pval'] = olsmodel1.pvalues

In [13]:
# Filter by significant p-value (pval <= 0.05)

olsmod = olsmod.sort_values(by = "pval", ascending = False)

pval_filter = olsmod['pval']<= 0.05

olsmod[pval_filter]

Unnamed: 0,coef,pval
Model_Gallardo,3.703743e-16,0.041134
Model_CR-V,-1.116858e-01,0.039933
Model_Camry,-1.743258e-01,0.038092
Model_SLK-Class,3.317574e-01,0.032207
Model_Swift,-1.699596e-01,0.027335
...,...,...
Brand_Chevrolet,-5.193682e+00,0.000000
Year,8.945477e-02,0.000000
Brand_Lamborghini,1.464234e-13,0.000000
Fuel_Type_Electric,3.387654e-13,0.000000


In [14]:
# Most significant varaibles

pval_filter = olsmod['pval']<= 0.05
mp_vars = olsmod[pval_filter].index.tolist()

# Obtain overall variables (undo one-hot encoding)
sig_var = []
for col in mp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in cars_data.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)

print('Most overall significant categorical varaibles of LINEAR REGRESSION  are ',':\n', sig_var)

Most overall significant categorical varaibles of LINEAR REGRESSION  are  :
 ['Model', 'Location', 'Engine', 'New_price', 'Owner_Type', 'Transmission', 'Power', 'kilometers_driven_log', 'Brand', 'Year', 'Fuel_Type']


### **Ridge Regression**

In [15]:
from sklearn.linear_model import Ridge

In [16]:
clf = Ridge()
clf.fit(X_train, y_train['price_log'])

In [17]:
ridge_score = get_model_score(clf)

R-sqaure on training set :  0.901984351072462
R-square on test set :  0.8847544041062825
RMSE on training set :  3.3635343365734474
RMSE on test set :  3.633120496703515


**Observations from results:**

The R-squared value on the training set is 0.90, which is really high and good because it means the training data fits the Ridge model well.

The RMSE on the training set is too high at 3.36, so variables must be eliminated to lower that value.

### **Decision Tree** 

In [18]:
from sklearn.tree import DecisionTreeRegressor

In [19]:
dtree = DecisionTreeRegressor(random_state = 1)
dtree.fit(X_train, y_train['price_log'])

In [20]:
Dtree_model = get_model_score(dtree)

R-sqaure on training set :  0.999908205603226
R-square on test set :  0.8232778973335595
RMSE on training set :  0.10293338189909358
RMSE on test set :  4.4989696991075645


**Observations from results:**

The R-squared value on the training set is almost 1, which means the Decision tree fits the model extremely well, almost perfectly.

The RMSE on the training set is 0.10, which is very low and means better fit of the model.

**Feature Importance**

Importance of features in the tree building (Gini Importance):

In [40]:
print(pd.DataFrame(dtree.feature_importances_, columns = ["Importance Score"], index = X_train.columns).sort_values(by = 'Importance Score', ascending = False))

                       Importance Score
Power                          0.650185
Year                           0.191520
New_price                      0.059473
kilometers_driven_log          0.019872
Engine                         0.009325
...                                 ...
Model_MUX                      0.000000
Model_MU                       0.000000
Model_M-Class                  0.000000
Model_Logan                    0.000000
Model_Fluence                  0.000000

[275 rows x 1 columns]


### **Random Forest**

In [22]:
from sklearn.ensemble import RandomForestRegressor

In [23]:
regr = RandomForestRegressor()
regr.fit(X_train, y_train['price_log'])

In [24]:
forest_score = get_model_score(regr)

R-sqaure on training set :  0.9742585367810555
R-square on test set :  0.864274426808014
RMSE on training set :  1.7237122602766253
RMSE on test set :  3.942743053825756


**Observations and insights:**

The R-squared value is 0.97, which is really high and means the model fits the data really well.

The RMSE on the training set is 1.81, which is also somewhat high and means the data does not fit the model that well. Also, the RMSE is much higher on the test set, meaning it didn't perform well.

**Feature Importance**

In [25]:
print(pd.DataFrame(regr.feature_importances_, columns = ["Importance Score"], index = X_train.columns).sort_values(by = 'Importance Score', ascending = False))

                       Importance Score
Power                          0.641714
Year                           0.192946
New_price                      0.064183
kilometers_driven_log          0.019399
Engine                         0.013275
...                                 ...
Model_370Z                     0.000000
Model_F                        0.000000
Model_E                        0.000000
Model_WR-V                     0.000000
Model_Gallardo                 0.000000

[275 rows x 1 columns]


**Observations and insights:**

Since Power has the highest feature importance score by a lot, it is clearly the most important factor in calculating the price of the car. 

Other somewhat significant variables include Year and New_price, with importance scores of 0.19 and 0.064, respectively.

### **Hyperparameter Tuning: Decision Tree**

In [36]:
from sklearn.model_selection import GridSearchCV

dtree_tuned = DecisionTreeRegressor(random_state = 1)

# Dictionary of parameters to choose from
parameters = {"max_depth": [None, 1, 2, 5, 10, 20], "min_samples_leaf": [1, 3, 5, 10, 20]}

# Run the grid search
grid_obj = GridSearchCV(estimator = dtree_tuned, param_grid = parameters, scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train['price_log'])

# Set the model to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
dtree_tuned.fit(X_train, y_train['price_log'])

In [37]:
get_model_score(dtree_tuned)

R-sqaure on training set :  0.8809744608578453
R-square on test set :  0.8072076290269816
RMSE on training set :  3.706536100875415
RMSE on test set :  4.699076932168065


[0.8809744608578453, 0.8072076290269816, 3.706536100875415, 4.699076932168065]

**Observations and insights:**

The R-squared on the model is 0.88, which is pretty high and means the model is fairly accurate in predicting the data. Also, it is fairly close to the R-squared on the training set, which is good.

However, the RMSE on the training set is 3.7, which is pretty high.

**Feature Importance**

In [28]:
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Importance Score"], index = X_train.columns).sort_values(by = 'Importance Score', ascending = False))

                       Importance Score
Power                          0.695023
Year                           0.200072
New_price                      0.063771
kilometers_driven_log          0.007814
Brand_Honda                    0.007163
...                                 ...
Model_D-MAX                    0.000000
Model_Duster                   0.000000
Model_Dzire                    0.000000
Model_E                        0.000000
Model_redi-GO                  0.000000

[275 rows x 1 columns]


**Observations and insights:**

The most important feature by far in this model is Power, with an importance score of 0.695. The other two features with somewhat importance are Year and New_price, with importance scores of 0.2 and 0.06, respectively.

### **Hyperparameter Tuning: Random Forest**

In [38]:
forest_tuned = RandomForestRegressor(random_state = 1)

# Dictionary of parameters to choose from
parameters = {"max_depth": [None, 10, 20], "min_samples_leaf": [1, 5, 10], "max_leaf_nodes": [10, 15]}

# Run the grid search
grid_obj = GridSearchCV(estimator = forest_tuned, param_grid = parameters, scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train['price_log'])

# Set the model to the best combination of parameters
forest_tuned = grid_obj.best_estimator_
forest_tuned.fit(X_train, y_train['price_log'])

In [39]:
# Get score of the model
get_model_score(forest_tuned)

R-sqaure on training set :  0.7651397680904627
R-square on test set :  0.7191151895889665
RMSE on training set :  5.206584227723524
RMSE on test set :  5.671941725460772


[0.7651397680904627, 0.7191151895889665, 5.206584227723524, 5.671941725460772]

**Observations and insights:**

The R-squared value is 0.77, which is pretty high and means that the model fits the data well since the R-squared value on the training test and test set are almost similar.

The RMSE on the training set is pretty high at 5.2, meaning the model does not really fit that well and can be better options

**Feature Importance**

In [31]:
print(pd.DataFrame(forest_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

                    Imp
Power          0.749373
Year           0.191893
New_price      0.044147
Engine         0.011125
Brand_Honda    0.002417
...                 ...
Model_Celerio  0.000000
Model_Ciaz     0.000000
Model_City     0.000000
Model_Civic    0.000000
Model_redi-GO  0.000000

[275 rows x 1 columns]


**Observations:**

The importance scores are very similar to the tuned decision tree, as the most important feature is Power, with Year and New_price behind.

In [32]:
# List of models
models = [lr, clf, dtree, regr, dtree_tuned, forest_tuned]

r2_train = []
r2_test = []
rmse_train = []
rmse_test = []

for model in models:
    
    j = get_model_score(model, False)
    
    r2_train.append(j[0])
    r2_test.append(j[1])
    rmse_train.append(j[2])
    rmse_test.append(j[3])

In [33]:
comparison_frame = pd.DataFrame({'Model':['Linear Regression', 'Ridge', 'Decision Tree', 'Random Forest', 'dtree_tuned', 'forest_tuned'], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE': rmse_train,'Test_RMSE': rmse_test}) 
comparison_frame

Unnamed: 0,Model,Train_r2,Test_r2,Train_RMSE,Test_RMSE
0,Linear Regression,0.919388,0.864263,3.050342,3.942907
1,Ridge,0.901984,0.884754,3.363534,3.63312
2,Decision Tree,0.999908,0.823278,0.102933,4.49897
3,Random Forest,0.974259,0.864274,1.723712,3.942743
4,dtree_tuned,0.880974,0.807208,3.706536,4.699077
5,forest_tuned,0.76514,0.719115,5.206584,5.671942


**Observations:**

From this chart, it seems that the most reasonable model to use is Ridge Regression. The R-squared value for the test set is 0.88, which is the highest out of all the models. Also, it has the lowest test RMSE value at 3.63, meaning the model fits the data well.