[Design] Intermediate Representation#785
Conversation
| ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"} | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} | ||
| Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} |
There was a problem hiding this comment.
Feature only have information about how to parse the column data, we still need information about how to construct a feature column on it.
There was a problem hiding this comment.
Thanks. I added FeatureColumn field in FieldMeta.
| Select string // e.g. "select * from iris.train" | ||
| ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"} | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} |
There was a problem hiding this comment.
#774 here's some of my thoughts about refactoring attribute resolution.
There was a problem hiding this comment.
@typhoonzero Thanks for the link. I think Attribute map[string]interface{} here is fine.
I believe AttrMeta belong to the code generation package since different machine learning toolkits accept different attributes.
There was a problem hiding this comment.
I believe AttrMeta belong to the code generation package since different machine learning toolkits accept different attributes
It is. Each submitter's code generation package define it's AttrMeta and call a function to get an Attribute map[string]interface{}.
| ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential | ||
| Select string // e.g. "select * from iris.train" | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} |
There was a problem hiding this comment.
I think singular is fine. The map type already indicates there are multiple attributes. The same reasoning applies to Feature.
| IsSparse bool // e.g. false | ||
| } | ||
|
|
||
| // TrainIR is the intermediate representation for code generation of a training job |
There was a problem hiding this comment.
Maybe we don't need to implement an IR for each job, how about simplifying like:
type FeatureMeta struct {
DType string
Delimiter string
...
}
type DBConn struct {
Driver string
User string
....
}
type ClauseIR struct {
Estimator string
SelectClause string
Attributes map[string]interface{}
DBConn DBConn
Features map[string]FeatureMeta
...
}Each generator can extend the ClauseIR as needed.
There was a problem hiding this comment.
@Yancey1989 Thanks for the suggestion. Combining all three IRs to a single ClauseIR does save some code. However, I still advocate using separate IRs for different job types. Here is my reasoning.
- Avoid confusion. The developer of
xgboost.Predictwould be confused by theValidationSelectfield inClauseIR. Also, as we adding more features to SQLFlow, more fields would be added toCluaseIR, and the confusion will increase. - Less work. We either distinguish the job type in
sqlor incodegen. However, there are manycodegens and only onesql. Distinguishing the job type insqlsaves works in allcodegens.
No description provided.