In [8]:
# =============================================================
# Copyright © 2020 Intel Corporation
# 
# SPDX-License-Identifier: MIT
# =============================================================

# XGBoost Getting Started Example on Linear Regression
## Importing and Organizing Data
In this example we will be predicting prices of houses in California based on the features of each house using Intel optimizations for XGBoost shipped as a part of the oneAPI AI Analytics Toolkit.
Let's start by **importing** all necessary data and packages.

In [8]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np

Now let's **load** in the dataset and **organize** it as necessary to work with our model.

In [9]:
#loading the data
california = fetch_california_housing()

#converting data into a pandas dataframe
data = pd.DataFrame(california.data)
data.columns = california.feature_names

#setting price as value to be predicted, This is what to be predicted (You will get only one colomn as output and that is price
data['PRICE'] = california.target

#extracting rows, we slice. 
X, y = data.iloc[:,:-1],data.iloc[:,-1]

#using dmatrix values for xgboost
#DMatrix is an internal data structure that is used by XGBoost
#Which is optimized for both memory efficiency and training speed. 
#You can construct DMatrix from multiple different sources of data.
#XGBoost uses CART(Classification and Regression Trees) Decision trees. 
data_dmatrix = xgb.DMatrix(data=X,label=y)

#splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1693)
#The random state hyperparameter in the train_test_split() function controls the shuffling process.


**Instantiate and define XGBoost regresion object** by calling the XGBRegressor() class from the library. Use hyperparameters to define the object. Intel optimizations for XGBoost trainingcan be used by calling the `hist` tree method in the parameters, as shown below.

In [10]:
'''
Learning rate- simply means how fast the model learns
If the learning rate is very large we will skip the optimal solution. 
If it is too small we will need too many iterations to converge to the best values.
So using a good learning rate is crucial.

Maximum depth of a tree: 
Increasing this value will make the model more complex and more likely to overfit.
XGBoost aggressively consumes memory when training a deep tree!

colsample_bytree:
This is a family of parameters for subsampling of columns.
All colsample_by* parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.
colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed
                                              
ALPHA: L1 regularization term on weights. Increasing this value will make model more conservative. default =1 ; combats overfitting
n_estimators — the number of runs XGBoost will try to learn; 
This is the number of trees you want to build before taking the maximum voting or averages of predictions.
Higher number of trees give you better performance but makes your code slower

tree_method string [default= auto]: The tree construction algorithm used in XGBoost
hist: Faster histogram optimized approximate greedy algorithm.

'''
   

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10, tree_method='hist')
# Model without training. 
print (xg_reg)

XGBRegressor(alpha=10, base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=0.3, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=0.1, max_delta_step=None, max_depth=5,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=10, n_jobs=None, num_parallel_tree=None,
             random_state=None, reg_alpha=None, reg_lambda=None,
             scale_pos_weight=None, subsample=None, tree_method='hist',
             validate_parameters=None, verbosity=None)


## Training and Saving the model

**Fitting and training model** using training datasets and predicting values. Note that Intel optimizations for XGBoost inference are enabled by default. 

In [11]:
xg_reg.fit(X_train,y_train) # Trained model 
# we train our model, with our data. 
preds = xg_reg.predict(X_test)
# Trained model is to be tested with the test dataset. 
print(X_test)
print(preds) # price here!

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
8111   2.7183      48.0  3.905380   1.098330      1807.0  3.352505     33.79   
379    2.4962      37.0  5.324324   1.108108       835.0  3.223938     37.75   
18071  9.1569      22.0  7.252669   0.925267       773.0  2.750890     37.28   
4462   4.9231       8.0  4.748936   1.136170       435.0  1.851064     34.10   
6077   4.5100      22.0  5.858597   1.067873      2315.0  2.618778     34.10   
...       ...       ...       ...        ...         ...       ...       ...   
6354   2.5275      27.0  4.246654   1.036329      1328.0  2.539197     34.14   
14797  2.3253      14.0  4.239732   1.088852      3662.0  3.069573     32.57   
9718   3.3472      37.0  6.625000   1.015625       266.0  4.156250     36.87   
14789  2.3603      27.0  4.071839   1.071839       814.0  2.339080     32.58   
1595   6.8686      35.0  6.666667   1.053030       352.0  2.666667     37.89   

       Longitude  
8111     -118.20  
3

**Finding root mean squared error** of predicted values.

In [12]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE:",rmse)
# Accuracy for the model built is being found, We are evaluating the model. 
# Lower values of RMSE indicate better fit. 
# RMSE is a good measure of how accurately the model predicts the response

RMSE: 1.0823382872176526


 ##Saving the Results

Now let's **export the predicted values to a CSV file**.

In [13]:
pd.DataFrame(preds).to_csv('foo.csv',index=False)

In [14]:
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")

[CODE_SAMPLE_COMPLETED_SUCCESFULLY]
