For an eventual recipes 1.0.0 release, what would we like to change that would have major implications?
non-step steps
So far, steps are defined as data transformation operations that
- add, subtract, or modify variables in the data using statistical/mathematical transformations or the data
- do not modify the number of rows
- are executed both on the training set (during
prep) and any new data sets (viabake)
There are a few scenarios where recipes could benefit from operations that are not steps per se:
- checks on data characteristics. You might want a step that will stop operation if certain conditions are (or are not) met. Otherwise, the check can return the data unaltered.
-
dplyroperations: it would be helpful to be able to mutate, filter, or possible summarize the data.step_rmis basicallydplyr::select. - another class of operations that might affect the rows of the training data. For example, down-sampling for class imbalances might remove rows during
prepbut should only be baked on the training set. It should not affect the new data processed bybake. Another procedure for imbalances, called SMOTE both down-samples the data and creates new instances from the existing data set.
rewrite fun_calls and terms replacement
fun_calls throws issues when the formula is long.
terms... It would be good to have something else that can take standard R formula with . and minus signs and
- returns the expanded list of individual terms (as calls)
- returns which ones were subtracted
- works with simple formulas on the lhs (e.g.
y1 + y2 + y2 ~ xwithout cbind)