# Ant-XGBoost on sqlflow Tutorial
This tutorial demonstrates how to
1. train a XGBoost model for Iris flower classification
2. auto-train a XGBoost model to fit boston housing price 

## The Dataset
#### Iris
The Iris data set contains four features and one label. The four features identify the botanical characteristics of individual Iris flowers. Each feature is stored as a single float number. The label indicates the class of individual Iris flowers. The label is stored as a integer and has possible value of 0, 1, 2.
#### Boston housing price
The Boston data frame has 506 rows and 14 columns.This data frame contains the following columns:
- crim
  - per capita crime rate by town.
- zn
  - proportion of residential land zoned for lots over 25,000 sq.ft.
- indus
  - proportion of non-retail business acres per town.
- chas
  - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox
  - nitrogen oxides concentration (parts per 10 million).
- rm
  - average number of rooms per dwelling.
- age
  - proportion of owner-occupied units built prior to 1940.
- dis
  - weighted mean of distances to five Boston employment centres.
- rad
  - index of accessibility to radial highways.
- tax
  - full-value property-tax rate per 10,000 dollar.
- ptratio
  - pupil-teacher ratio by town.
- black
  - 1000 * (Bk - 0.63) ^ 2 where Bk is the proportion of blacks by town.
- lstat
  - lower status of the population (percent).
- medv(Label)
  - median value of owner-occupied homes united by 1000 dollar.


We have separated two datasets in train and test tables: `iris.train`, `iris.test`, `boston.train`, `boston.test`. We will be using them as training data and test data respectively.

We can have a quick peek of the data by running the following standard SQL statements.

In [None]:
%%sqlflow
SELECT * FROM iris.train LIMIT 5;

In [1]:
%%sqlflow
SELECT * FROM boston.train LIMIT 5;

## Iris Classification
At first, let's train a xgboost model to classify Iris flower. Since there exists three kinds of Iris flowers, we use `multi:softprob` objective and set `num_class` to 3. We also configure tree depth, learning rate and number of iteration. All of above can be done by specifying the training clause for SQLFlow's extended syntax.

```
TRAIN xgboost.Estimator
WITH
    train.objective = "multi:softprob",
    train.num_class = 3,
    train.max_depth = 4,
    train.eta = 0.5,
    train.num_round = 10
```

To specify the training data, we use standard SQL statements like `SELECT * FROM iris.train`.

We explicit specify which column is used for features and which column is used for the label by writing

```
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
```
At the end of the training process, we save the trained xgboost model into table `sqlflow_models.my_iris_xgboost_model` by writing
```
INTO sqlflow_models.my_iris_xgboost_model
```

Putting it all together, SQLFlow training statement of iris task is done.

In [2]:
%%sqlflow
SELECT *
FROM iris.train
TRAIN xgboost.Estimator
WITH
    train.objective = "multi:softprob",
    train.num_class = 3,
    train.max_depth = 4,
    train.eta = 0.5,
    train.num_round = 10
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_iris_xgboost_model;

Secondly, let's do prediction on `iris.test`.

To specify the prediction data, we use standard SQL statements like `SELECT * FROM iris.test`.

Say we want the model, previously stored at sqlflow_models.my_iris_xgboost_model, to read the prediction data and write the predicted result into table `iris.predict` column `result`. 

We can add some supplementary outputs by setting `pred.attributes`.
In this case, we append ground truth of prediction data with `pred.append_columns = [class]`.
We also want to inspect probability information.
So, we require probability of chosen class with `pred.prob_column = p`; require probability distribution with `pred.detail_column = dist`.

We can write the following SQLFlow prediction statement.

In [4]:
%%sqlflow
SELECT *
FROM iris.test
predict iris.predict.result
WITH
    pred.append_columns = [class],
    pred.prob_column = p,
    pred.detail_column = dist
USING sqlflow_models.my_iris_xgboost_model;

After the prediction, we can checkout the prediction result by

In [5]:
%%sqlflow
SELECT *
FROM iris.predict
LIMIT 5;

## Fitting Boston Housing Price 
After iris demo, we have essential concepts about SQLFlow. For now, let's try another case with auto-train, additional feature of Ant-XGBoost.

Since `medv` is continuous, we use `reg:squarederror` objective to fit it. With SQLFlow, we have an alternative approach to define xgboost objective; naming a specialized estimator. In this case, we specify `TRAIN xgboost.Regressor` instead of writing an objective explicitly.

Above all, we get a quite concise training statement.

In [None]:
%%sqlflow
SELECT *
FROM boston.train
TRAIN xgboost.Regressor
WITH
    train.auto_train = true,
    train.num_round = 50
COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
LABEL medv
INTO sqlflow_models.my_boston_xgboost_model; 

Below is corresponding prediction statement, we append all columns of prediction data into result table.

In [24]:
%%sqlflow
SELECT *
FROM boston.test
PREDICT boston.predict.score
WITH
    pred.append_columns = [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv]
USING sqlflow_models.my_boston_xgboost_model;

Let's have a glance at prediction results.

In [None]:
%%sqlflow
SELECT * FROM boston.predict;