# XGBoost on SQLFlow Tutorial

This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the [User Guide](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/user_guide.md), in this tutorial you will learn how to:
- Train a XGBoost model to fit the boston housing dataset; and
- Predict the housing price using the trained model;


## The Dataset

This tutorial would use the [Boston Housing](https://www.kaggle.com/c/boston-housing) as the demonstration dataset.
The database contains 506 lines and 14 columns, the meaning of each column is as follows:

Column | Explain 
-- | -- 
crim|per capita crime rate by town.
zn|proportion of residential land zoned for lots over 25,000 sq.ft.
indus|proportion of non-retail business acres per town.
chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox|nitrogen oxides concentration (parts per 10 million).
rm|average number of rooms per dwelling.
age|proportion of owner-occupied units built prior to 1940.
dis|weighted mean of distances to five Boston employment centres.
rad|index of accessibility to radial highways.
tax|full-value property-tax rate per \$10,000.
ptratio|pupil-teacher ratio by town.
black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat|lower status of the population (percent).
medv|median value of owner-occupied homes in $1000s.

We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress.

In [1]:
%%sqlflow
describe boston.train;

+---------+---------+------+-----+---------+-------+
|  Field  |   Type  | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
|   crim  |  float  | YES  |     |   None  |       |
|    zn   |  float  | YES  |     |   None  |       |
|  indus  |  float  | YES  |     |   None  |       |
|   chas  | int(11) | YES  |     |   None  |       |
|   nox   |  float  | YES  |     |   None  |       |
|    rm   |  float  | YES  |     |   None  |       |
|   age   |  float  | YES  |     |   None  |       |
|   dis   |  float  | YES  |     |   None  |       |
|   rad   | int(11) | YES  |     |   None  |       |
|   tax   | int(11) | YES  |     |   None  |       |
| ptratio |  float  | YES  |     |   None  |       |
|    b    |  float  | YES  |     |   None  |       |
|  lstat  |  float  | YES  |     |   None  |       |
|   medv  |  float  | YES  |     |   None  |       |
+---------+---------+------+-----+---------+-------+

In [2]:
%%sqlflow
describe boston.test;

+---------+---------+------+-----+---------+-------+
|  Field  |   Type  | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
|   crim  |  float  | YES  |     |   None  |       |
|    zn   |  float  | YES  |     |   None  |       |
|  indus  |  float  | YES  |     |   None  |       |
|   chas  | int(11) | YES  |     |   None  |       |
|   nox   |  float  | YES  |     |   None  |       |
|    rm   |  float  | YES  |     |   None  |       |
|   age   |  float  | YES  |     |   None  |       |
|   dis   |  float  | YES  |     |   None  |       |
|   rad   | int(11) | YES  |     |   None  |       |
|   tax   | int(11) | YES  |     |   None  |       |
| ptratio |  float  | YES  |     |   None  |       |
|    b    |  float  | YES  |     |   None  |       |
|  lstat  |  float  | YES  |     |   None  |       |
|   medv  |  float  | YES  |     |   None  |       |
+---------+---------+------+-----+---------+-------+

## Fit Boston Housing Dataset

First, let's train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for `30 rounds`,
and using `squarederror` loss function that the SQLFLow extended SQL can be like:

``` sql
TRAIN xgboost.gbtree
WITH
    train.num_boost_round=30,
    objective="reg:squarederror"
```

`xgboost.gbtree` is the estimator name, `gbtree` is one of the XGBoost booster, you can find more information from [here](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters).

We can specify the training data columns in `COLUMN clause`, and the label by `LABEL` keyword:

``` sql
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
```

To save the trained model, we can use `INTO clause` to specify a model name:

``` sql
INTO sqlflow_models.my_xgb_regression_model
```

Second, let's use a standar SQL to fetch the traning data from table `boston.train`:

``` sql
SELECT * FROM boston.train
```

Finally, the following is the SQLFlow Train statment of this regression task, you can run it in the cell:

In [5]:
%%sqlflow
SELECT * FROM boston.train
TRAIN xgboost.gbtree
WITH
    objective="reg:squarederror",
    train.num_boost_round = 30
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
INTO sqlflow_models.my_xgb_regression_model;

[03:44:56] 387x13 matrix with 5031 entries loaded from train.txt

[03:44:56] 109x13 matrix with 1417 entries loaded from test.txt

[0]	train-rmse:17.0286	validation-rmse:17.8089

[1]	train-rmse:12.285	validation-rmse:13.2787

[2]	train-rmse:8.93071	validation-rmse:9.87677

[3]	train-rmse:6.60757	validation-rmse:7.64013

[4]	train-rmse:4.96022	validation-rmse:6.0181

[5]	train-rmse:3.80725	validation-rmse:4.95013

[6]	train-rmse:2.94382	validation-rmse:4.2357

[7]	train-rmse:2.36361	validation-rmse:3.74683

[8]	train-rmse:1.95236	validation-rmse:3.43284

[9]	train-rmse:1.66604	validation-rmse:3.20455

[10]	train-rmse:1.4738	validation-rmse:3.08947

[11]	train-rmse:1.35336	validation-rmse:3.0492

[12]	train-rmse:1.22835	validation-rmse:2.99508

[13]	train-rmse:1.15615	validation-rmse:2.98604

[14]	train-rmse:1.11082	validation-rmse:2.96433

[15]	train-rmse:1.01666	validation-rmse:2.96584

[16]	train-rmse:0.953761	validation-rmse:2.94013

[17]	train-rmse:0.905753	validation-rmse:2.91569



### Predict the housing price
After training the regression model, let's predict the house price using the trained model.

First, we can specify the trained model by `USING clause`: 

```sql
USING sqlflow_models.my_xgb_regression_model
```

Than, we can specify the prediction result table by `PREDICT clause`:

``` sql
PREDICT boston.predict.medv
```

And using a standar SQL to fetch the prediction data:

``` sql
SELECT * FROM boston.test
```

Finally, the following is the SQLFLow Prediction statment:

In [8]:
%%sqlflow
SELECT * FROM boston.test
PREDICT boston.predict.medv
USING sqlflow_models.my_xgb_regression_model;

[03:45:18] 10x13 matrix with 130 entries loaded from predict.txt

Done predicting. Predict table : boston.predict



Let's have a glance at prediction results.

In [10]:
%%sqlflow
SELECT * FROM boston.predict;

+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+
|   crim  |  zn | indus | chas |  nox  |   rm  | age  |  dis   | rad | tax | ptratio |   b    | lstat |   medv  |
+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+
|  0.2896 | 0.0 |  9.69 |  0   | 0.585 |  5.39 | 72.9 | 2.7986 |  6  | 391 |   19.2  | 396.9  | 21.14 | 21.9436 |
| 0.26838 | 0.0 |  9.69 |  0   | 0.585 | 5.794 | 70.6 | 2.8927 |  6  | 391 |   19.2  | 396.9  |  14.1 | 21.9667 |
| 0.23912 | 0.0 |  9.69 |  0   | 0.585 | 6.019 | 65.3 | 2.4091 |  6  | 391 |   19.2  | 396.9  | 12.92 | 22.9708 |
| 0.17783 | 0.0 |  9.69 |  0   | 0.585 | 5.569 | 73.5 | 2.3999 |  6  | 391 |   19.2  | 395.77 |  15.1 | 22.6373 |
| 0.22438 | 0.0 |  9.69 |  0   | 0.585 | 6.027 | 79.7 | 2.4982 |  6  | 391 |   19.2  | 396.9  | 14.33 | 21.9439 |
| 0.06263 | 0.0 | 11.93 |  0   | 0.573 | 6.593 | 69.1 | 2.4786 |  1  | 273 |   21.0  | 3