<a href="https://colab.research.google.com/github/sarnavadatta/Predictive-Modelling/blob/main/XGBoost_Regressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [None]:
! pip install --user xgboost



In [None]:
import xgboost as xgb

In [None]:
df = sns.load_dataset("diamonds")
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
df.shape

(53940, 10)

In [None]:
df.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


**Check Null values**

In [None]:
# Check features with nan value
df.isnull().sum()

Unnamed: 0,0
carat,0
cut,0
color,0
clarity,0
depth,0
table,0
price,0
x,0
y,0
z,0


**Creating feature & Target array**

Here we try to predict diamond prices using their physical measurements. Thus, the target will be the price column.

In [None]:
from sklearn.model_selection import train_test_split

# Extract feature and target arrays
X, y = df.drop('price', axis=1), df[['price']]

The dataset contains three categorical columns. Typically, these columns would be encoded using either ordinal encoding or one-hot encoding before training a model. However, XGBoost has built-in support for handling categorical variables internally.

To enable this functionality, the categorical columns must be explicitly converted to the Pandas *category data type*. By default, these columns are treated as text (string) data, which does not allow XGBoost to leverage its native categorical handling.

In [None]:
# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to Pandas category
for col in cats:
   X[col] = X[col].astype('category')

X.dtypes

Unnamed: 0,0
carat,float64
cut,category
color,category
clarity,category
depth,float64
table,float64
x,float64
y,float64
z,float64


**Train-Test split**

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Convert the dataset into an optimized data structure called **Dmatrix** that XGBoost supports and gives it acclaimed performance and efficiency gains.

The class accepts both the training features and the labels. To enable automatic encoding of Pandas category columns, we also set enable_categorical to True.

In [None]:
# Create regression matrices
dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

**Training:**
The chosen objective function and any other hyperparameters of XGBoost should be specified in a dictionary.

In [None]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
)

**Evaluation**

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error

preds = model.predict(dtest_reg)
mse = mean_squared_error(y_test, preds)
print(f"MSE of the base model: {mse:.3f}")

rmse = root_mean_squared_error(y_test, preds)
print(f"RMSE of the base model: {rmse:.3f}")

MSE of the base model: 301458.438
RMSE of the base model: 549.052


1.   Using Validation Sets During Training.
2.   *verbose_eval* parameter is used to reduce the rows of output.
3. *Early stopping* is used to watch the validation loss, and if it stops improving for a specified number of rounds, it automatically stops training.

In [None]:
params = {"objective": "reg:squarederror",
          "tree_method": "hist"}
evals = [(dtrain_reg, "train"),
         (dtest_reg, "validation")]

n = 1000

model = xgb.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   early_stopping_rounds=50
)

# Get the best iteration where RMSE was the lowest
print("Best iteration:", model.best_iteration)

[0]	train-rmse:2865.76483	validation-rmse:2831.67876
[50]	train-rmse:428.91275	validation-rmse:548.08216
[100]	train-rmse:369.22408	validation-rmse:548.83870
[108]	train-rmse:363.54939	validation-rmse:548.24673
Best iteration: 58


In [None]:
preds = model.predict(dtest_reg)
mse = mean_squared_error(y_test, preds)
print(f"MSE of the base model: {mse:.3f}")

rmse = root_mean_squared_error(y_test, preds)
print(f"RMSE of the base model: {rmse:.3f}")

MSE of the base model: 300574.500
RMSE of the base model: 548.247
