## Learning Objectives:

At the end of the experiment, you will be able to:

*  perform preprocessing for different types of features
*  build pipeline for preprocessing of features
*  implement feature selection manually and automatically
*  build an XG-Boost regressor model and check its performance

## Introduction

Predicting house prices is helpful to identify profitable investments or to determine whether the price advertised is over or under-estimated. Here, we will build an ML model to predict the sale price of homes based on different explanatory variables describing the aspects of residential houses.

## Dataset

The dataset chosen is a [Housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) containing 79 features, one target feature (`SalePrice`), and 1460 samples. Visit the data source to understand each feature/column. Download the 'data_description.txt' file, which gives a full description.

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

### Install XG-Boost and feature_engine library

In [None]:
%%capture
!pip -q install xgboost
!pip -q install feature_engine

Ignore the above warning.

### Import required packages

In [None]:
# Mathematical functions (e.g., square root)
from math import sqrt

# Data manipulation and analysis library, often used for working with DataFrames
import pandas as pd

# Numerical operations, especially arrays and matrix computations
import numpy as np

# Plotting library for creating static, interactive, and animated visualizations
import matplotlib.pyplot as plt

# Statistical data visualization library based on Matplotlib
import seaborn as sns

# Implementation of gradient boosting algorithms (used for regression and classification tasks)
import xgboost as xgb

# Utility for splitting datasets into training and testing sets
from sklearn.model_selection import train_test_split

# Pipeline for chaining multiple steps (like data preprocessing and modeling) together
from sklearn.pipeline import Pipeline

# Handles missing data by filling with specified values or statistical measures
from sklearn.impute import SimpleImputer

# Encodes categorical features with an ordinal encoding (labels are assigned as ordered integers)
from sklearn.preprocessing import OrdinalEncoder as OrdinalEncoder_Sk

# Selects features based on importance scores from a trained model
from sklearn.feature_selection import SelectFromModel

# Evaluation metrics for regression models (Mean Squared Error and R² score)
from sklearn.metrics import mean_squared_error, r2_score

# Imputes missing values with arbitrary numbers or based on specific rules (Feature-engine library)
from feature_engine.imputation import ArbitraryNumberImputer, CategoricalImputer

# Encodes rare labels (those with few occurrences) and performs ordinal encoding (Feature-engine library)
from feature_engine.encoding import RareLabelEncoder, OrdinalEncoder

# Drops specified features from the dataset (Feature-engine library)
from feature_engine.selection import DropFeatures

# to visualise all the columns and upto 100 rows in the dataframe
pd.set_option('display.max_columns', None)
pd.set_option("display.max_rows", 100)

# for supressing warnings
import warnings
warnings.filterwarnings('ignore')

### Load the data

In [None]:
# Read the 'housing_dataset.csv' file into a DataFrame using pandas.
# This loads the dataset into a tabular format, making it easy to analyze and manipulate.
data = pd.read_csv('Demo2_housing_dataset.csv')

# Print the dimensions of the dataset (rows, columns).
# Useful to check the size of the dataset and ensure it was loaded correctly.
print(data.shape)

# Display the first 5 rows of the dataset to get an overview of the data.
# Helps to quickly inspect the structure and contents (e.g., column names and sample values).
data.head()

## Exploratory Data Analysis

Check the information of the dataframe regarding number of rows and columns, any null values, data types, etc.

In [None]:
data.info()

### Summarising the Data

The following cell displays the types of features, the number of unique entries, and the percentage of Null entries in each feature column.

In [None]:
# Create a DataFrame that summarizes the data types of each column in the dataset.
summary = pd.DataFrame(data.dtypes, columns=['dtype'])

# Reset the index to turn the column names into a regular column called 'Name'.
summary = summary.reset_index()

# Rename the 'index' column to 'Name' for better readability.
summary = summary.rename(columns={'index': 'Name'})

# Add a new column 'Null_Counts' that shows the total number of missing (null) values in each column.
summary['Null_Counts'] = data.isnull().sum().values

# Add a column 'Uniques' that counts the number of unique values for each column.
summary['Uniques'] = data.nunique().values

# Calculate the percentage of missing values for each column and store it in 'Null_Percent'.
summary['Null_Percent'] = (summary['Null_Counts'] * 100) / len(data)

# Sort the summary DataFrame by 'Null_Percent' in descending order to prioritize columns with the most missing data.
summary.sort_values(by='Null_Percent', ascending=False, inplace=True)

# Display the summary DataFrame to review the structure, null values, and uniqueness of each column.
summary

### Split dataset into train and test

Separating the data into training and testing set before engineering. This is to avoid over-fitting.

In [None]:
# Split the dataset into training and testing sets using train_test_split.
# The target variable 'SalePrice' is separated from the feature variables.

# 'data.drop('SalePrice', axis=1)' removes the target column, keeping only feature columns for X (input).
# 'data.SalePrice' extracts the target variable (output) as y.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),  # Feature set (X)
    data.SalePrice,                  # Target variable (y)
    test_size=0.1,                   # 10% of the data used for testing, 90% for training.
    random_state=0                   # Ensures reproducibility by setting a random seed.
)

# Display the shape (dimensions) of the training and testing sets.
# Useful to confirm correct data splitting and understand the number of samples and features.
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Missing values
* Separating Date, Numerical, and Categorical variables
* Checking missing entries in each data type


In [None]:
# Define a list of columns that represent dates or years.
# These variables often need special treatment (e.g., extracting age).
vars_dates = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']

# Create a list of categorical variables (those with data type 'object').
# These typically represent categories or text data (e.g., 'Neighborhood', 'HouseStyle').
vars_cat = [var for var in X_train.columns if X_train[var].dtypes == 'O']

# Create a list of numerical variables (those that are not of data type 'object').
# Exclude the 'Id' column, as it is an identifier and not a feature to be used for modeling.
vars_num = [var for var in X_train.columns if X_train[var].dtypes != 'O' and var not in ['Id']]

In [None]:
# Calculate the proportion of missing values for each date-related variable in the training set.
# 'X_train[vars_dates]' selects only the date-related columns.
# 'isnull().mean()' calculates the fraction of missing values for each column.
# 'sort_values(ascending=False)' sorts the columns in descending order,
# prioritizing those with the most missing values at the top.
Date_V = X_train[vars_dates].isnull().mean().sort_values(ascending=False)

# Display the sorted proportions of missing values.
Date_V

In [None]:
# Visualize missing values in our date variables
plt.figure(figsize=(7, 5))
sns.barplot(x=Date_V.index, y=Date_V.values, hue=Date_V.index)
plt.xlabel("Date features")
plt.ylabel("Missing values")
plt.show()

In [None]:
# Missing values in our numerical variables
Num_V = X_train[vars_num].isnull().mean().sort_values(ascending=False)
print(Num_V)
len(Num_V)

In [None]:
# Visualize missing values in our numerical variables
plt.figure(figsize=(20, 5))

# 'x=Num_V.index' sets the x-axis as the variable names (column names).
# 'y=Num_V.values' sets the y-axis as the corresponding values (e.g., mean, missing value proportion, etc.).
# 'hue=Num_V.index' colors each bar differently based on the variable name, adding a legend.
sns.barplot(x=Num_V.index, y=Num_V.values, hue=Num_V.index)
plt.xlabel("Numerical features")
plt.ylabel("Missing values")
plt.xticks(rotation=80)
plt.show()

In [None]:
# Missing values in our categorical variables
Cat_V = X_train[vars_cat].isnull().mean().sort_values(ascending=False)
print(Cat_V)
len(Cat_V)

In [None]:
# Visualize missing values in our categorical variables
plt.figure(figsize=(20, 5))
sns.barplot(x=Cat_V.index, y=Cat_V.values, hue=Cat_V.index)
plt.xlabel("Categorical features")
plt.ylabel("Missing values")
plt.xticks(rotation=80)
plt.show()

### Handling missing data through imputation

In [None]:
# Imputate numerical variables
# Create a SimpleImputer to fill missing values in 'LotFrontage' with a constant value of -1.
# This ensures missing values are handled consistently across both training and testing sets.
imputer = SimpleImputer(strategy='constant', fill_value=-1)

# Apply the imputer to the 'LotFrontage' column in the training set.
# 'to_frame()' converts the column into a DataFrame to match the input format expected by the imputer.
X_train['LotFrontage'] = imputer.fit_transform(X_train['LotFrontage'].to_frame())

# Use the same imputer (already fit on the training data) to transform the 'LotFrontage' column in the test set.
X_test['LotFrontage'] = imputer.transform(X_test['LotFrontage'].to_frame())

# Create another SimpleImputer to fill missing values in numerical variables with the most frequent value (mode).
# This method is useful for variables with repeated values where mode is a meaningful replacement.
imputer = SimpleImputer(strategy='most_frequent')

# Fit the imputer on the numerical variables in the training set and transform them.
X_train[vars_num] = imputer.fit_transform(X_train[vars_num])

# Apply the trained imputer to the test set to ensure consistent transformations.
X_test[vars_num] = imputer.transform(X_test[vars_num])

In [None]:
# Imputate categorical variables
imputer = SimpleImputer(strategy='constant', fill_value='missing')
X_train[vars_cat] = imputer.fit_transform(X_train[vars_cat])
X_test[vars_cat] = imputer.transform(X_test[vars_cat])

### Temporal features

Extracting information from the data to capture the difference in years between the year in which the house was built, and the year in which the house was sold.

In [None]:
# Create new temporal features from date variables
def elapsed_years(df, var):
    # capture difference between year variable and year the house was sold
    df[var] = df['YrSold'] - df[var]
    return df

In [None]:
# Apply it to both train and test set
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [None]:
# Check that test set does not contain null values in the engineered variables
[var for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'] if X_test[var].isnull().sum() > 0]

### Checking for any Null still exists either in a train or test set

In [None]:
# Train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

In [None]:
# Test set
[var for var in X_train.columns if X_test[var].isnull().sum() > 0]

### Replacing  all rarely appearing categories with 'Rare':

The `RareLabelEncoder()` groups rare or infrequent categories in a new category called “`Rare`”, or any other name entered by the user.

In [None]:
# Encode rare categories
# Create a RareLabelEncoder to group infrequent categories in categorical variables.
# 'tol=0.01' sets the tolerance, meaning categories that appear in less than 1% of the data will be grouped as 'Rare'.
# 'n_categories=5' ensures that only variables with more than 5 unique categories will undergo rare label encoding.
# 'variables=vars_cat' specifies the categorical variables to apply this transformation.

rare_enc = RareLabelEncoder(tol=0.01, n_categories=5, variables=vars_cat)

# Fit the encoder on the training data to learn the rare categories.
rare_enc.fit(X_train)

# Transform the training set by grouping rare categories under a common label ('Rare').
X_train = rare_enc.transform(X_train)

# Apply the same transformation to the test set to ensure consistency.
X_test = rare_enc.transform(X_test)

### Checking for rare categories

In [None]:
# Initialize an empty dictionary to store the frequency distribution of each categorical variable.
cat_dic = {}

# Loop through each categorical variable in 'vars_cat'.
for i in vars_cat:
    # Calculate the frequency (proportion) of each category in the current variable.
    # 'value_counts()' counts the occurrences of each unique category.
    # Dividing by 1460 (number of rows in X_train) gives the proportion of each category.
    freq_df = pd.DataFrame(X_train[vars_cat][i].value_counts() / 1460)

    # Print the frequency distribution of the current categorical variable.
    print(freq_df)

### Encoding of Categorical variables

Transform the string values of categorical variables into numerical values.

In [None]:
# Encode with labels
# Create an instance of OrdinalEncoder from scikit-learn to encode categorical variables as ordinal integers.
ordinal_enc = OrdinalEncoder_Sk()

# Fit the ordinal encoder on the categorical variables of the training set and transform them.
# This assigns each category an integer based on the order of appearance in the data.
X_train[vars_cat] = ordinal_enc.fit_transform(X_train[vars_cat])

# Transform the categorical variables in the test set using the same encoder.
# This ensures that the same encoding scheme is applied to both training and testing datasets.
X_test[vars_cat] = ordinal_enc.transform(X_test[vars_cat])

In [None]:
# Check any null values in test set
[var for var in X_train.columns if X_test[var].isnull().sum() > 0]

### Building Pipeline for Pre-Processing

All the pre-processing steps above can be implemented inside Pre-Processing Pipeline. Building a pipeline removes the dual task of hard coding for the same operation on the train and test set separately. Apart from this, it helps in the automation of testing and deployment without much human intervention.

#### Creating Class for temporal transformation that is compatible with SK_learn pipeline:

In the pre-processing steps above, a function was created to calculate the year elapsed. Now we are converting that function into a class suitable for inserting inside the pipeline.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class TemporalVariableTransformer(BaseEstimator, TransformerMixin):
    # Transformer for calculating the elapsed time between a reference variable and specified temporal variables.

    def __init__(self, variables, reference_variable):
        # Initialize the transformer with the list of temporal variables and a reference variable.

        # Check that 'variables' is a list; if not, raise a ValueError.
        if not isinstance(variables, list):
            raise ValueError('variables should be a list')

        self.variables = variables  # Store the list of temporal variables.
        self.reference_variable = reference_variable  # Store the reference variable.

    def fit(self, X, y=None):
        # Fit method required for scikit-learn pipeline compatibility.
        # This method does not need to perform any operations, so it just returns self.
        return self

    def transform(self, X):
        # Transform the input DataFrame to calculate elapsed time.

        # Create a copy of the DataFrame to avoid modifying the original data.
        X = X.copy()

        # Loop through each temporal variable and calculate the difference between the reference variable and the temporal variable.
        for feature in self.variables:
            X[feature] = X[self.reference_variable] - X[feature]  # Calculate elapsed time.

        return X  # Return the modified DataFrame with updated temporal variables.

#### Building the Pre-Processing pipeline

In [None]:
price_pipe = Pipeline([

    # ===== IMPUTATION =====
    # impute numerical variables with the ArbitraryNumberImputer
    ('ArbitraryNumber_imputation', ArbitraryNumberImputer( arbitrary_number=-1, variables='LotFrontage' )),

     # impute numerical variables with the mostfrequent
    ('frequentNumber_imputation', CategoricalImputer(imputation_method='frequent', variables=vars_num, ignore_format=True)),

    # impute categorical variables with string missing
    ('missing_imputation', CategoricalImputer(imputation_method='missing', variables=vars_cat)),

    # == TEMPORAL VARIABLES ====
    ('elapsed_time', TemporalVariableTransformer(
        variables=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'], reference_variable='YrSold')),

    ('drop_features', DropFeatures(features_to_drop=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'])),

      # == CATEGORICAL ENCODING
    ('rare_label_encoder', RareLabelEncoder(tol=0.01, n_categories=5, variables=vars_cat)),

    # encode categorical and discrete variables using the target mean
    ('categorical_encoder', OrdinalEncoder(encoding_method='ordered', variables=vars_cat)), #

])

Since we have already done pre-processing before the pipeline, we can't apply the pipeline to pre-preprocessed data. To apply the pipeline, copy the same train-test split cell again here so that we can get un-processed data as a train and test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('SalePrice', axis=1), # predictors
                                                    data.SalePrice, # target
                                                    test_size=0.1,
                                                    random_state=0)  # for reproducibility

X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Finally applying pipeline in train & test set

In [None]:
price_pipe.fit(X_train,y_train) # Fitting

In [None]:
X_train_tfr = price_pipe.transform(X_train)        # Transformation for train set

In [None]:
X_test_tfr = price_pipe.transform(X_test)          # Transformation for test set

In [None]:
X_train_tfr

## XG-Boost Regressor

In [None]:
# Create an xgboost regression model
# Create an instance of the XGBoost regressor for a regression task.
# This model will be used for predicting continuous target variables.

model = xgb.XGBRegressor(
    n_estimators=100,            # Number of boosting rounds (trees) to be created. More trees can lead to better performance.
    max_depth=7,                 # Maximum depth of each tree. Controls the complexity of the model; deeper trees can capture more information but may overfit.
    eta=0.1,                     # Learning rate (also known as 'alpha'). A smaller value makes the model more robust but requires more boosting rounds.
    subsample=0.7,               # Fraction of samples used for fitting individual trees. Reduces overfitting by randomly sampling training data.
    colsample_bytree=0.8,        # Fraction of features used for each tree. Helps prevent overfitting by introducing randomness in feature selection.
    objective='reg:squarederror',# The learning task objective. Here, it indicates a regression task using squared error loss.
    random_state=0               # Random seed for reproducibility. Ensures that the results can be replicated across runs.
)

**Note :**  Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.

**The most commonly configured hyperparameters are the following:**

**n_estimators:** The number of trees in the ensemble, often increased until no further improvements are seen.

**max_depth:** The maximum depth of each tree, often values are between 1 and 10.

**eta:** The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.

**subsample:** The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.

**colsample_bytree:** Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

**XGBoost Parameters Detail** - [Ref.](https://xgboost.readthedocs.io/en/stable/parameter.html)

In [None]:
# Train on training set
model.fit(X_train_tfr, y_train)

In [None]:
# Evaluate the model:
# Evaluate performance using the mean squared error and the root of the mean squared error
pred = model.predict(X_train_tfr)
print('linear train mse: {}'.format(mean_squared_error(y_train, pred)))
print('linear train rmse: {}'.format(sqrt(mean_squared_error(y_train, pred))))
print()

pred = model.predict(X_test_tfr)
print('linear test mse: {}'.format(mean_squared_error(y_test, pred)))
print('linear test rmse: {}'.format(sqrt(mean_squared_error(y_test, pred))))

In [None]:
# Evaluating predictions with respect to the original price
plt.scatter(y_test, model.predict(X_test_tfr))
plt.xlabel('True House Price')
plt.ylabel('Predicted House Price')
plt.title('Evaluation of XGBoost Predictions')
plt.show()

### Displaying the feature importance value given by the XG-Boost model

In [None]:
# List features
print(X_train_tfr.columns.to_list())

In [None]:
# Feature importance given by XGB
print(model.feature_importances_)

In [None]:
# Feature Importance in dataframe
# Create a DataFrame to store the feature names and their corresponding importance scores.
dfeature = pd.DataFrame({
    'Var': X_train_tfr.columns.to_list(),  # Extract feature names from the transformed training set.
    'Importance': model.feature_importances_  # Get the importance scores from the trained XGBoost model.
}).sort_values(by='Importance', ascending=False)  # Sort the DataFrame by importance scores in descending order.

# Display the DataFrame containing features and their importance scores.
dfeature

In [None]:
# Plot bar plot showing feature importances
plt.figure(figsize=(24, 6))
sns.barplot(x=dfeature['Var'], y=dfeature['Importance'], hue=dfeature['Var'])
plt.xticks(rotation=80)
plt.show()

We can pick features having the highest feature importance values,  for example choosing the top 15:

In [None]:
# Top 15 features
dfeature[:16]['Var'].to_list()

## Feature Selection

Above manual selection of best features can be done automatically using Scikit-Learn's `SelectFromModel` class. Here, we need to specify the model which has `feature_importances_` or `coef_` attribute after fitting, then train it.

In [None]:
# Feature selection using SelectFromModel, with XGBoost Regressor

sel_ = SelectFromModel(xgb.XGBRegressor(n_estimators=150, objective='reg:squarederror', random_state=0))
sel_.fit(X_train_tfr, y_train)

In [None]:
# Show the number of total features and selected features
selected_feat = X_train_tfr.columns[(sel_.get_support())]
print('total features: {}'.format((X_train_tfr.shape[1])))
print('selected features: {}'.format(len(selected_feat)))

In [None]:
selected_feat

## Re-build model with selected features

In [None]:
model.fit(X_train_tfr[selected_feat], y_train)

In [None]:
# Evaluate performance using the mean squared error and the root of the mean squared error
pred = model.predict(X_train_tfr[selected_feat])
print('linear train mse: {}'.format(mean_squared_error(y_train, pred)))
print('linear train rmse: {}'.format(sqrt(mean_squared_error(y_train, pred))))
print()
pred = model.predict(X_test_tfr[selected_feat])
print('linear test mse: {}'.format(mean_squared_error(y_test, pred)))
print('linear test rmse: {}'.format(sqrt(mean_squared_error(y_test, pred))))

### Evaluating predictions with respect to the original price

In [None]:
# Evaluating predictions with respect to the original price
plt.scatter(y_test, model.predict(X_test_tfr[selected_feat]))
plt.xlabel('True House Price')
plt.ylabel('Predicted House Price')
plt.title('Evaluation of XGBoost Predictions')

## Training XGBoost without Pre-Processing

XGBoost can handle categorical variable( [Ref.](https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html) )and supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros. When the missing parameter is specifed, values in the input predictor that is equal to missing will be treated as missing and removed. By default it’s set to NaN.Considering the same we are going to train without pre-processing and compare the result.([Ref.](https://xgboost.readthedocs.io/en/stable/faq.html))

In [None]:
data_no_pro = pd.read_csv('housing_dataset.csv')
print(data.shape)
data_no_pro.head()

In [None]:
# Create a list of categorical variables (those with data type 'object') from the training DataFrame.
# This is done by iterating over all columns in X_train and checking their data types.

vars_cat = [var for var in X_train.columns if X_train[var].dtypes == 'O']

# 'vars_cat' will contain the names of all categorical variables, which can be useful for further preprocessing steps.

In [None]:
# Convert all categorical variables in the DataFrame 'data_no_pro' to the 'category' data type.
# This can help reduce memory usage and improve performance when dealing with categorical data.

data_no_pro[vars_cat] = data_no_pro[vars_cat].apply(lambda x: x.astype('category'))

# The 'apply' function is used to apply the lambda function to each column specified in 'vars_cat'.
# The lambda function converts each column to the 'category' type, which is more efficient for categorical data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_no_pro.drop('SalePrice', axis=1), # predictors
                                                    data_no_pro.SalePrice, # target
                                                    test_size=0.1,
                                                    random_state=0)  # for reproducibility

X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Temporal features
Extracting information from the data to capture the difference in years between the year in which the house was built, and the year in which the house was sold.

In [None]:
# Create new temporal features from date variables
def elapsed_years(df, var):
    # capture difference between year variable and year the house was sold
    df[var] = df['YrSold'] - df[var]
    return df

In [None]:
# Apply it to both train and test set
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [None]:
X_train=X_train.drop(columns=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'])
X_test=X_test.drop(columns=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt'])

###Modelling with unprocessed features



In [None]:
# Create an xgboost regression model
model_no_pro = xgb.XGBRegressor(n_estimators=100,max_depth=6, eta=0.1, subsample=0.7, colsample_bytree=0.8, objective='reg:squarederror', random_state=0, enable_categorical=True, tree_method='approx')

In [None]:
# Train on training set
model_no_pro.fit(X_train, y_train)

**Note - Hyperparameters:**   **enable_categorical** - [Ref.](https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html), **tree_method** - [Ref.](https://xgboost.readthedocs.io/en/stable/parameter.html)

In [None]:
# Evaluate the model:
# Evaluate performance using the mean squared error and the root of the mean squared error
pred = model_no_pro.predict(X_train)
print('linear train mse: {}'.format(mean_squared_error(y_train, pred)))
print('linear train rmse: {}'.format(sqrt(mean_squared_error(y_train, pred))))
print()
pred = model_no_pro.predict(X_test)
print('linear test mse: {}'.format(mean_squared_error(y_test, pred)))
print('linear test rmse: {}'.format(sqrt(mean_squared_error(y_test, pred))))