## *Import Libraries*

In [1]:
import pandas as pd
import numpy as np
import warnings

In [2]:
warnings.filterwarnings("ignore")

## *Load Dataset*

In [3]:
df = pd.read_csv("cars_sales_ohe.csv")

In [4]:
def load_inputs_outputs():

    X = df.drop(columns=['price'])
    y = df.price
    
    return X,y

In [5]:
X,y = load_inputs_outputs()

## *Split Data*

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

### *Array Transform*

In [8]:
def array(x):
    return np.asarray(x)

In [9]:
X_train,Y_train = array(X_train),array(Y_train)
X_test,Y_test = array(X_test),array(Y_test)

We create numpy arrays, as it will speed up the model training process.

In [10]:
X_train.shape,X_test.shape

((69185, 175), (17297, 175))

## *Model Creation*

In [11]:
from xgboost import XGBRegressor

### *Explanation of Parameters*

* max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 



* learning_rate: Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.



* subsample: Subsample ratio of the training instances. 



* colsample_bytree: colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.



* colsample_bynode: colsample_bynode is the subsample ratio of columns for each node (split) subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.



* gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be range.


In [12]:
params1 = {"max_depth":8,
           "learning_rate":0.01,
           "subsample":0.9,
           "n_jobs":-1,
           "n_estimators":900}

params2 = {"max_depth":9,
           "learning_rate":0.01,
           "subsample":0.8,
           "n_jobs":-1,
           "n_estimators":800}

params3 = {"max_depth":10,
           "learning_rate":0.01,
           "subsample":0.7,
           "colsample_bytree":0.7,
            "colsample_bynode" : 0.8,
           "n_jobs": -1,
           "gamma":20,
          "n_estimators":700}

In the last model, since you assign a max_depth equal to 10, you will have more chance of overfitting. Therefore, I chose to use other hyperparameters in order to minimize the effect, in addition to drastically reducing the number of estimators.

In [13]:
models = {'XGB 1':XGBRegressor(**params1),
          'XGB 2':XGBRegressor(**params2),
          'XGB 3':XGBRegressor(**params3)}

Pass the parameters corresponding to each model as a dictionary argument.

### *Train Models*

In [14]:
def train_model(model):
    return model.fit(X_train,Y_train)

In [15]:
for model in models.values():
    train_model(model)

In [16]:
from sklearn.metrics import mean_squared_error

In [17]:
def MSE(model):
    
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    
    mse_train = mean_squared_error(Y_train,pred_train)
    mse_test = mean_squared_error(Y_test,pred_test)
    
    return mse_train,mse_test

In [18]:
for name,model in models.items():
    mse_train,mse_test = MSE(model)
    print(f"{name} Train MSE: {mse_train}")
    print(f"{name} Test MSE:  {mse_test}")

XGB 1 Train MSE: 3994753.4117338764
XGB 1 Test MSE:  4608877.266877357
XGB 2 Train MSE: 3597746.7250362164
XGB 2 Test MSE:  4351160.770324405
XGB 3 Train MSE: 3498283.265723869
XGB 3 Test MSE:  4105622.523214639


The last model offers a good result, however we can still help you with more parameters to make the model more robust.

## *GridSearch CV*

In [19]:
from sklearn.model_selection import GridSearchCV

* n_estimators: Number of trees.
* reg_alpha: L1 regularization term on weights. Increasing this value will make model more conservative.

In [20]:
params = {"n_estimators":[600,700,800],"reg_alpha":[0.1,0.5,0.8]}

In [21]:
base_model = XGBRegressor(max_depth = 10,
                        learning_rate = 0.01,
                       colsample_bytree = 0.7,
                       subsample = 0.7,
                       colsample_bynode = 0.8,
                       gamma = 20,
                        random_state = 0,
                        n_jobs = -1)

In [22]:
grid = GridSearchCV(base_model,params,cv = 3,n_jobs = -1)

In [23]:
grid.fit(X_train,Y_train)

In [24]:
grid.best_params_

{'n_estimators': 800, 'reg_alpha': 0.1}

## *Best Model*

In [25]:
best_model = grid.best_estimator_

In [26]:
mse_train,mse_test = MSE(best_model)

In [27]:
print(f" Train MSE: {mse_train}")
print(f" Test MSE:  {mse_test}")

 Train MSE: 3386712.2032006355
 Test MSE:  4023474.084994841


Thanks to the parameters, the MSE for the test data decreases, something that is very positive, reducing the effect of overfitting the model.