Proposal for new R interface (discussion thread)

ref https://github.com/dmlc/xgboost/issues/9475
ref https://github.com/dmlc/xgboost/issues/9810 (roadmap)

I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for `xgboost` should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything that `xgboost` can do so perhaps I might be missing something important here.

(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)

Looking at the python and C interfaces, I see that there's a low-level interface with objects like `Booster` with associated functions/methods `xgb.train` that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.

In the current R interface, there's to some degree a division into lower/higher-level interfaces with `xgb.train()` and `xgboost()`, but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R's `stats::glm()` or other popular R packages - e.g. it doesn't take formulas, nor data frames.

*****************************

In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like `survival::Surv`, etc.), **for both inputs and outputs**.

In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages  - for example, `randomForest()` will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface **under a single function** `xgboost()` and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.

<details>
<summary>In my view, a good separation between the two would involve:</summary>

* Having the low-level interface work **only** with `DMatrix` (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work **only** with common R objects (e.g. `data.frame`, `matrix`, `dgCMatrix`, `dgRMatrix`, etc.), but **not** with `DMatrix` (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions).
* Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the `predict` function for the object returned by `xgb.train()` would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the `predict` function from the object returned from `xgboost()` would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives.
    * The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface.
* Having separate associated `cv` functions for the low-level and the high-level interfaces (e.g. `xgboost.cv()` or `cv.xgboost()`, like there is `cv.glmnet()`; and another `xgb.cv()` that would work with `DMatrix`).
* Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc.
* Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like `xgboost::normalize` or internal classes/methods like `Booster.handle`.
</details>

<details>
<summary>A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:</summary>

* It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow).
* The x/y interface should accept `x` in the format of common R classes like `data.frame`, `matrix`, `dgCMatrix`, perhaps `arrow` (although I wouldn't consider it a priority), and perhaps other classes like `float32` (from the `float` package). I don't think it'd be a good idea to support more obscure potential classes like `dist` that aren't typically used in data pipelines.
    * If I'm not mistaken, CSR is supported for prediction but not for training, so `predict` should additionally take `dgRMatrix` and `sparseVector` like it does currently.
    * I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here.
    * If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it.
* (**Controversial**) It should not only accept but also **require** different classes for `y` depending on the objective, and by default, if not passed, should select an objective based on the type of y:
    * `factor` with 2 levels -> `binary:logistic`.
    * `factor` with >2 levels -> `multi:softmax`.
    * `survival::Surv` -> `survival:aft` (with normal link), or maybe `survival:cox` if only right-censored.
    * others -> `reg:squarederror`.
    * And consequently, require `factor` types for classification, `Surv` for survival, and numeric/integer for regression (taking `logical` (boolean) as 0/1 numeric).
    * If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either `data.frame` or `matrix`, and keep their column names as metadata if they had.
* If `x` is a `data.frame`, it should automatically recognize `factor` columns as categorical types, and (**controversial**) also take `character` (string) class as categorical, converting them to `factor` on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter.
* (**Controversial**) It should only support categorical columns in the form of an input `x` being a `data.frame` with `factor`/`character` types. Note: see question below about sparse types and categorical types.
* If `x` is either a `data.frame` or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name.
    * For example, if `x` has 4 columns `[c1, c2, c3, c4]` it should allow passing `monotone_constraints` as either `c(-1, 0, 0, 1)`, or as `c("c1" = -1, "c4" = 1)`, or as `list(c1 = -1, c4 = 1)`; but not as e.g. `c("1" = -1, "4" = -1)`.
    * OTOH if `x` is a `matrix`/`dgCMatrix` without column names, it should accept them as `c(-1, 0, 0, 1)`, or as `c("1" = -1, "4" = -1)`, erroring out on non-integer names, not allowing negative numbers, and not allowing `list` with length<ncols as the matching would be ambiguous.
    * If there are duplicated column names, I think the most logical scenario here would be to error out with an informative message.
* Regardless of whether `x` has column names, column-vector inputs like `qid` should be taken from what's passed on the arguments, and not guessed from the column names of `x` like the python scikit-learn interface does.
* It should have function arguments (for IDE autocomplete) and in-package documentation for all of the accepted training parameters. Given the amount of possible parameters, perhaps the most reasonable way would be to put them into a separate generator function `xgb.train_control()` or similar (like `C50::C5.0Control`) that would return a `list`, or perhaps it should have them all as top-level arguments. I don't have a particular preference here, but:
    * If parameters are moved to a separate train control function, I think at least the following should remain top-level arguments: `objective`, `nthread`, `verbosity`, `seed` (if not using R's - see next point below), `booster`; plus arguments that reference columns names or indices: `monotone_constraints`, `interaction_constraints`; and data inputs like `base_score` or `qid`. `nrounds` I guess is most logical to put in the list of additional parameters, but given that it's a required piece of metadata for e.g. determining output dimensions, perhaps it should also be top-level.
    * If using a separate train control function, given that R allows inspecting the function call, it might be better to create a list with only the arguments that were explicitly supplied - e.g. if passing only something like `xgb.train_control(eta=0.01)`, it could return only `list(eta = 0.01)` instead of a full list of parameters. In this case, it might not be strictly required to know what are the default values for every parameter for the purposes of developing this interface, but would be nice if they were easily findable from the signature anyway.
* (**Controversial**) Not a high-priority thing, but would be ideal if one could pass `seed` and only rely on R's PRNG if `seed` is not passed.
* It should have more informative `print`/`show` and `summary` methods that would display info such as the objective, booster type, rounds, dimensionality of the data to which it was fitted, whether it had categorical columns, etc.
    * I think the `summary` and the `print` method could both print the exact same information. Not sure if there's anything additional worth putting into the `summary` method that wouldn't be suitable for `print`. Perhaps evaluation metric or objective function per round could be shown (head + tail) for `summary`, but maybe it's not useful.
* It should stick to common R idioms and conventions in terms of inputs and outputs - for example, `print` should not only print but also return the model object as invisible, methods like `predict` should take the same named arguments as others like `predict(object, newdata, type, ...)`, return an output with one row per input in the data, keep row names in the output if there were any in the input, use name `(Intercept)` instead of `BIAS`, etc.
    * A tricky convention in this regard is base1 indexing - there's many places where it's used like `interaction_constraints`, `iterationrange` for prediction, node indices from `predict`, etc. Perhaps the high-level interface could do the conversion internally but I'm not sure if I'm missing more potential places where indices about something might be getting used, and probably more such parameters and outputs will be added in the future.
* Following up on the point above, the `predict` function should:
    * Accept a `type` argument with potential values like "response", "class", "raw", "leaf", "contrib", "interaction", "approx.contrib", "approx.interaction" (not sure what to name the last 2 though).
    * Output `factor` types for `class` type in classification objectives.
    * Use column names of `x` for outputs of `contrib`/`interaction`, and factor levels of `y` as column names from e.g. `response` in multi-class classification and class-specific `contrib`/`interaction` (named as `level:column`, like base R).
* Xgboost functions should avoid doing in-place modifications to R objects other than internal classes like `DMatrix` or handles - e.g. calling the plotting function should not leave the input with an additional `Importance` column after the call is done.
* The plotting and feature importance functions should not require passing additional metadata, using instead the names that the input `x` had.
    * The function for plotting feature importances should in turn not require passing the calculated feature importances first, calculating them internally on-the-fly if not passed, given that it will have all the required inputs already.
* There should be a function to convert objects between the low-level and the high-level R interface, which in the absence of metadata, could be done by autofilling column names with "V1", "V2", etc. and factor levels with "1", "2", "3", etc.
* (**Controversial**) The `predict` function for the x/y interface should have a parameter to determine whether it should subset/re-order columns of `newdata` according to the column names that were used in training, and in this case, it should also recode factor levels. Note that this subsetting and reordering is always done for base R's `predict.glm`, but factors are not re-coded there (so e.g. if the levels in factor columns of `newdata` differ from the training data, `predict.glm` would instead output something that might not make sense).
* Custom objectives and custom metrics for the high-level interface should be passed the same kinds of response variables that were used for input - e.g. a `factor` instead of the binarized numeric vector that xgboost will use internally.
    * Perhaps custom objectives and metrics could be required to be passed as a list with necessary fields instead of as separate arguments, and required to have extra metadata like whether they are regression/classification, a potential inverse link function for `predict` that would be kept inside the output, etc. Don't have a strong preference here though, and I expect most users of custom objectives would be using the low-level interface anyway.
* (Very low priority) Perhaps the high-level interface should allow using base R's "families" (as returned by e.g. `stats::poisson(link = "log")`) and the structure they involve as custom objectives (like `glmnet` does), so that one could pass a family-compliant object from other packages as objective. I see an issue here though in that these families are meant for Fisher scoring, which means in the case of non-cannonical link functions like `binomial(link = "probit")`, they wouldn't calculate the true Hessian function, but I guess Fisher's should work just as fine with gradient boosting. Not an expert in this topic though.
</details>

The low-level interface in turn should support everything that the C interface offers - e.g. creating `DMatrix` from libsvm files, specifying categorical columns from a non-data-frame array, etc.

*****************************

<details>
<summary>As per the formula interface, this is quite tricky to implement in a good way for xgboost.</summary>

In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but:
* Stepwise selection from base R doesn't work with xgboost.
* Transformations like `log(x1)` have no effect in decision trees, transformations like `x1:x2` don't have the same practical effect as decision trees implicitly create interactions, and transformations like `x^2` the way R does them do not make a difference for decision trees compared to simpler `I(x^2)+x`, for example.
* Other outputs from formulas like contrasts are not used either way.
* Transformations like `log(y)` aren't any harder to do with an x/y interface than with a formula interface.

Nevertheless, a formula interface can still be handy for calls like `y ~ x1 + x2 + x3` when there are many columns, and packages like `ranger` offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use.

Some nuances in terms of formulas though:
* Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named `(Intercept)` that would have the same value for every row.
* Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns.
* Using a `formula` for processing the training data also implies using it for `predict` - so for example, formulas do not recode levels of factor variables when calling `predict`, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces.
* Some packages have their own formula parsers which allow additional interpretations for something like `y ~ x | z` than what base R would do (for example, `lme4` would interpret `z` here as something that controls mixed effects for `x`, while base R would interpret this as a feature "x or z"), and in some cases `xgboost()` would also need a different interpretations of formulas (e.g. for parameter `qid`, which doesn't fit at either side of the formula).
* Packages like `randomForest` don't use the base R formula parser, taking it instead (by copying the code) from a different library `e1071` which is GPLv2 licensed, which is incompatible with xgboost's apache license.
* Refering to columns that are the result of a formula by index in e.g. `monotone_constraints` could be tricky - e.g. if we remove the auto-added `(Intercept)` columns, should the numbers re-adjust?

Hence, supporting a formula interface for `xgboost()` would be tricky:
* It could be possible to use base R's formula parser, but:
    * In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded.
    * It theoretically could be possible to try to forcibly try to add `-1` at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding `(Intercept)`, but I can foresee cases in which this might not work depending on how the formula is inputted.
    * It would not be possible to specify additional columns like `qid` in these formulas.
* It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret `|` differently, but what should happen in this case if the user requests something like `xnum*xcat` or `f(xcat)` (with `f` being some arbitrary base function) ?

Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to:
* Use base R's formula parser.
* Don't support parameters like `monotone_constraints` or `qid` in the formula interface.
* Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula.
* Try to remove the intercept by trying to create a new formula `<curr formula> - 1` and error out if this doesn't succeed.
* Let the formula handle all the `predict` post-processing, regardless of how the x/y interface does it.
* Encourage users and reverse dependencies to avoid the formula interface for serious usage by being clear about all these limitations in the docs and having a warning message in the "description" section (which comes before the signatures in the doc). We're not in control of how reverse dependencies use it, but if any of the larger ones is tempted to use this formula interface anyway, could be followed up in their githubs.
</details>

*****************************
<details>
<summary>A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:</summary>

* Does `xgboost` support passing a mixture of dense and sparse data together? `pandas` supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the `DMatrix` object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a `DMatrix` for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a `list`.
    * Is there a C-level `DMatrix` equivalent of `cbind` / `np.c_` ? I don't see any but this would make things easier.
* Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface.
* In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function `XGDMatrixCreateFromDT`, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like `int64`, these would be casted to a floating-point type. In R, especially when using `data.table`, it's oftentimes common to have 64-bit integers from the package `bit64`, which if casted to `float64`, would lose integral precision beyond $2^52$ (same for `int32` which could lose precision when casted to `float32`) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a `DMatrix` from a structure like an R `data.frame`, which is a list of arrays that aren't contiguous in memory and might have different dtypes.
* Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like `binary:cauchit`, adds it to the python `XGBClassifier` class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective.
* How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value?
* How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for `xgboost` regardless.
* Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C `Booster`, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from `xgboost()`, an earlier object saved with `saveRDS` will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features.
    * How does the python scikit-learn interface deal with `pickle` and object attributes that aren't part of the C `Booster`? How does it deal with converting between `Booster` and scikit-learn classes?
    * I see the R interface currently has a big notice about objects serialized with non-xgboost functions like `saveRDS` not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C `Booster`, but there could be cases in which the R interface changes independently of the C struct.
    * If one were to load a C `Booster` into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the `Booster` wouldn't? Is it somehow standardized?
* How does `xgboost` determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it?
* Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a `DMatrix`? If so, I guess support for `arrow` in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.
</details>

*****************************
<details>
<summary>Some other questions for R users (@mayer79 ?):</summary>

* Are there any popular data.frame-like objects from packages related to GPU computing (like `cuDF` for python) or distributed computing (other than spark) that I might perhaps be missing in this post?
* Is there any popular R type for learning-to-rank tasks that would involve the `qid` and position information that `xgboost` uses? I am not aware of any but I'm no expert in this area.
* Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. `factor` / `numeric` / etc. ?
* Is there some package offering a custom formula parser meant for decision tree models?
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal for new R interface (discussion thread) #9734

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal for new R interface (discussion thread) #9734

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions