[design doc] XGBoost on SQLFlow#753
[design doc] XGBoost on SQLFlow#753Yancey0623 merged 25 commits intosql-machine-learning:developfrom
Conversation
3f2b143 to
999c673
Compare
|
Need to merge #754 first to keep the commit history of ant-xgboost design. |
|
|
||
| ``` sql | ||
| SELECT * FROM train_table | ||
| TRAIN XGBoost |
There was a problem hiding this comment.
TRAIN XGBoost.someModel?
There was a problem hiding this comment.
XGBoost use the objective paramter to specify the training objective such as:
objective=binary:logistic , ref https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters
Maybe we can use Train XGBoost in the train statement, and specify the objective in WITH statement: model.objective=binary:logistic ?
There was a problem hiding this comment.
I still prefer to put the objective in the TRAIN clause, it seems quite similar to tf.estimator.*.
There was a problem hiding this comment.
Update the doc, would use TRAIN xgboost.multi.softmax to fill the objective parameter.
There was a problem hiding this comment.
I still prefer to put the
objectivein theTRAINclause, it seems quite similar totf.estimator.*.
@Yancey1989 @typhoonzero objective corresponds to the loss of function of a model. So it shouldn't be in the model name. For different types of models, there are gbtree, gblinear and dart as listed here.
There was a problem hiding this comment.
@tonyyang-svail Thanks, you are right. I'll update to putobjective in attributes.
| 0.77 4.0 2.6 2 3 | ||
| ``` | ||
|
|
||
| `codegen_xgboost.go` would write down the `train.txt.group` file like: |
There was a problem hiding this comment.
Why need to write a file?
There was a problem hiding this comment.
XGboost use DMatrix as the input dataset , and it seems the text file format is popular in XGBoost:
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV
ref: https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html
…flow into xgboost_design
typhoonzero
left a comment
There was a problem hiding this comment.
LGTM, just one line needs to be fixed
| SELECT * FROM train_table | ||
| TRAIN xgboost.multi.softmax | ||
| WITH | ||
| train.objective="multi:softmax", |
There was a problem hiding this comment.
should remove train.objective now?
|
Hi @wangkuiyi , thanks for correcting the grammar, I updated the design and remove some detail paragraphs:
|
|
|
||
| The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: | ||
| 1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface. | ||
| 1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters. |
There was a problem hiding this comment.
I think the parsing of the WITH clause is the parser's work, but not the submitter's work, am I right?
There was a problem hiding this comment.
The parser can parse the WITH clause to a general attrs struct which is a Go struct map[string]*expr, and each generator would resolve theattrs to program parameters, such as XGBoost generator would convert the attrs as follows:
- keys with
train.prefix toxgboost.trainarguments. - keys without any prefix to XGBoost Parameters which is JSON format.
A part work of #749