Skip to content

Proposal for new R interface (discussion thread) #9734

Open
@david-cortes

Description

@david-cortes

ref #9475
ref #9810 (roadmap)

I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for xgboost should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything that xgboost can do so perhaps I might be missing something important here.

(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)

Looking at the python and C interfaces, I see that there's a low-level interface with objects like Booster with associated functions/methods xgb.train that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.

In the current R interface, there's to some degree a division into lower/higher-level interfaces with xgb.train() and xgboost(), but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R's stats::glm() or other popular R packages - e.g. it doesn't take formulas, nor data frames.


In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like survival::Surv, etc.), for both inputs and outputs.

In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example, randomForest() will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single function xgboost() and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.

In my view, a good separation between the two would involve:
  • Having the low-level interface work only with DMatrix (which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work only with common R objects (e.g. data.frame, matrix, dgCMatrix, dgRMatrix, etc.), but not with DMatrix (IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions).
  • Having separate classes for the outputs returned from both interfaces, with different associated methods and method signatures, so that e.g. the predict function for the object returned by xgb.train() would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while the predict function from the object returned from xgboost() would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives.
    • The plot functions perhaps could be shared, with some extra arguments being required for the low-level interface but forbidden for the high-level interface.
  • Having separate associated cv functions for the low-level and the high-level interfaces (e.g. xgboost.cv() or cv.xgboost(), like there is cv.glmnet(); and another xgb.cv() that would work with DMatrix).
  • Keeping extra metadata inside the object returned by the high-level interface, such as column names, types, factor levels, the actual string of the function call (the way base R does), etc.
  • Exporting from the package only what's necessary to support these two modes of operation alongside with their methods - e.g. not exporting internal functions like xgboost::normalize or internal classes/methods like Booster.handle.
A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:
  • It should have an x/y interface, and perhaps also a formula interface (details on the latter to follow).
  • The x/y interface should accept x in the format of common R classes like data.frame, matrix, dgCMatrix, perhaps arrow (although I wouldn't consider it a priority), and perhaps other classes like float32 (from the float package). I don't think it'd be a good idea to support more obscure potential classes like dist that aren't typically used in data pipelines.
    • If I'm not mistaken, CSR is supported for prediction but not for training, so predict should additionally take dgRMatrix and sparseVector like it does currently.
    • I am not into GPU computing, nor into distributed computing, so perhaps I might be missing important classes here.
    • If supporting Spark in R, I guess the best way would be to have a separate package dedicated to it.
  • (Controversial) It should not only accept but also require different classes for y depending on the objective, and by default, if not passed, should select an objective based on the type of y:
    • factor with 2 levels -> binary:logistic.
    • factor with >2 levels -> multi:softmax.
    • survival::Surv -> survival:aft (with normal link), or maybe survival:cox if only right-censored.
    • others -> reg:squarederror.
    • And consequently, require factor types for classification, Surv for survival, and numeric/integer for regression (taking logical (boolean) as 0/1 numeric).
    • If multi-column response/target variables are to be allowed, they should be required to be 2-dimensional (i.e. no vector recycling), as either data.frame or matrix, and keep their column names as metadata if they had.
  • If x is a data.frame, it should automatically recognize factor columns as categorical types, and (controversial) also take character (string) class as categorical, converting them to factor on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter.
  • (Controversial) It should only support categorical columns in the form of an input x being a data.frame with factor/character types. Note: see question below about sparse types and categorical types.
  • If x is either a data.frame or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name.
    • For example, if x has 4 columns [c1, c2, c3, c4] it should allow passing monotone_constraints as either c(-1, 0, 0, 1), or as c("c1" = -1, "c4" = 1), or as list(c1 = -1, c4 = 1); but not as e.g. c("1" = -1, "4" = -1).
    • OTOH if x is a matrix/dgCMatrix without column names, it should accept them as c(-1, 0, 0, 1), or as c("1" = -1, "4" = -1), erroring out on non-integer names, not allowing negative numbers, and not allowing list with length<ncols as the matching would be ambiguous.
    • If there are duplicated column names, I think the most logical scenario here would be to error out with an informative message.
  • Regardless of whether x has column names, column-vector inputs like qid should be taken from what's passed on the arguments, and not guessed from the column names of x like the python scikit-learn interface does.
  • It should have function arguments (for IDE autocomplete) and in-package documentation for all of the accepted training parameters. Given the amount of possible parameters, perhaps the most reasonable way would be to put them into a separate generator function xgb.train_control() or similar (like C50::C5.0Control) that would return a list, or perhaps it should have them all as top-level arguments. I don't have a particular preference here, but:
    • If parameters are moved to a separate train control function, I think at least the following should remain top-level arguments: objective, nthread, verbosity, seed (if not using R's - see next point below), booster; plus arguments that reference columns names or indices: monotone_constraints, interaction_constraints; and data inputs like base_score or qid. nrounds I guess is most logical to put in the list of additional parameters, but given that it's a required piece of metadata for e.g. determining output dimensions, perhaps it should also be top-level.
    • If using a separate train control function, given that R allows inspecting the function call, it might be better to create a list with only the arguments that were explicitly supplied - e.g. if passing only something like xgb.train_control(eta=0.01), it could return only list(eta = 0.01) instead of a full list of parameters. In this case, it might not be strictly required to know what are the default values for every parameter for the purposes of developing this interface, but would be nice if they were easily findable from the signature anyway.
  • (Controversial) Not a high-priority thing, but would be ideal if one could pass seed and only rely on R's PRNG if seed is not passed.
  • It should have more informative print/show and summary methods that would display info such as the objective, booster type, rounds, dimensionality of the data to which it was fitted, whether it had categorical columns, etc.
    • I think the summary and the print method could both print the exact same information. Not sure if there's anything additional worth putting into the summary method that wouldn't be suitable for print. Perhaps evaluation metric or objective function per round could be shown (head + tail) for summary, but maybe it's not useful.
  • It should stick to common R idioms and conventions in terms of inputs and outputs - for example, print should not only print but also return the model object as invisible, methods like predict should take the same named arguments as others like predict(object, newdata, type, ...), return an output with one row per input in the data, keep row names in the output if there were any in the input, use name (Intercept) instead of BIAS, etc.
    • A tricky convention in this regard is base1 indexing - there's many places where it's used like interaction_constraints, iterationrange for prediction, node indices from predict, etc. Perhaps the high-level interface could do the conversion internally but I'm not sure if I'm missing more potential places where indices about something might be getting used, and probably more such parameters and outputs will be added in the future.
  • Following up on the point above, the predict function should:
    • Accept a type argument with potential values like "response", "class", "raw", "leaf", "contrib", "interaction", "approx.contrib", "approx.interaction" (not sure what to name the last 2 though).
    • Output factor types for class type in classification objectives.
    • Use column names of x for outputs of contrib/interaction, and factor levels of y as column names from e.g. response in multi-class classification and class-specific contrib/interaction (named as level:column, like base R).
  • Xgboost functions should avoid doing in-place modifications to R objects other than internal classes like DMatrix or handles - e.g. calling the plotting function should not leave the input with an additional Importance column after the call is done.
  • The plotting and feature importance functions should not require passing additional metadata, using instead the names that the input x had.
    • The function for plotting feature importances should in turn not require passing the calculated feature importances first, calculating them internally on-the-fly if not passed, given that it will have all the required inputs already.
  • There should be a function to convert objects between the low-level and the high-level R interface, which in the absence of metadata, could be done by autofilling column names with "V1", "V2", etc. and factor levels with "1", "2", "3", etc.
  • (Controversial) The predict function for the x/y interface should have a parameter to determine whether it should subset/re-order columns of newdata according to the column names that were used in training, and in this case, it should also recode factor levels. Note that this subsetting and reordering is always done for base R's predict.glm, but factors are not re-coded there (so e.g. if the levels in factor columns of newdata differ from the training data, predict.glm would instead output something that might not make sense).
  • Custom objectives and custom metrics for the high-level interface should be passed the same kinds of response variables that were used for input - e.g. a factor instead of the binarized numeric vector that xgboost will use internally.
    • Perhaps custom objectives and metrics could be required to be passed as a list with necessary fields instead of as separate arguments, and required to have extra metadata like whether they are regression/classification, a potential inverse link function for predict that would be kept inside the output, etc. Don't have a strong preference here though, and I expect most users of custom objectives would be using the low-level interface anyway.
  • (Very low priority) Perhaps the high-level interface should allow using base R's "families" (as returned by e.g. stats::poisson(link = "log")) and the structure they involve as custom objectives (like glmnet does), so that one could pass a family-compliant object from other packages as objective. I see an issue here though in that these families are meant for Fisher scoring, which means in the case of non-cannonical link functions like binomial(link = "probit"), they wouldn't calculate the true Hessian function, but I guess Fisher's should work just as fine with gradient boosting. Not an expert in this topic though.

The low-level interface in turn should support everything that the C interface offers - e.g. creating DMatrix from libsvm files, specifying categorical columns from a non-data-frame array, etc.


As per the formula interface, this is quite tricky to implement in a good way for xgboost.

In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but:

  • Stepwise selection from base R doesn't work with xgboost.
  • Transformations like log(x1) have no effect in decision trees, transformations like x1:x2 don't have the same practical effect as decision trees implicitly create interactions, and transformations like x^2 the way R does them do not make a difference for decision trees compared to simpler I(x^2)+x, for example.
  • Other outputs from formulas like contrasts are not used either way.
  • Transformations like log(y) aren't any harder to do with an x/y interface than with a formula interface.

Nevertheless, a formula interface can still be handy for calls like y ~ x1 + x2 + x3 when there are many columns, and packages like ranger offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use.

Some nuances in terms of formulas though:

  • Formulas can determine whether to add an intercept or not, but xgboost AFAIK doesn't allow fitting without an intercept, and has no use for a column named (Intercept) that would have the same value for every row.
  • Base R's formulas will either do one-hot or dummy encoding for factor/character columns according to whether there is an intercept or not, while xgboost now has native support for categorical columns.
  • Using a formula for processing the training data also implies using it for predict - so for example, formulas do not recode levels of factor variables when calling predict, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces.
  • Some packages have their own formula parsers which allow additional interpretations for something like y ~ x | z than what base R would do (for example, lme4 would interpret z here as something that controls mixed effects for x, while base R would interpret this as a feature "x or z"), and in some cases xgboost() would also need a different interpretations of formulas (e.g. for parameter qid, which doesn't fit at either side of the formula).
  • Packages like randomForest don't use the base R formula parser, taking it instead (by copying the code) from a different library e1071 which is GPLv2 licensed, which is incompatible with xgboost's apache license.
  • Refering to columns that are the result of a formula by index in e.g. monotone_constraints could be tricky - e.g. if we remove the auto-added (Intercept) columns, should the numbers re-adjust?

Hence, supporting a formula interface for xgboost() would be tricky:

  • It could be possible to use base R's formula parser, but:
    • In this case it would not be possible to use categorical features (or their interactions) as such, as they will be either dummy- or one-hot encoded.
    • It theoretically could be possible to try to forcibly try to add -1 at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding (Intercept), but I can foresee cases in which this might not work depending on how the formula is inputted.
    • It would not be possible to specify additional columns like qid in these formulas.
  • It could be possible to develop a custom formula parser that would not one-hot-encode categorical features and e.g. interpret | differently, but what should happen in this case if the user requests something like xnum*xcat or f(xcat) (with f being some arbitrary base function) ?

Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to:

  • Use base R's formula parser.
  • Don't support parameters like monotone_constraints or qid in the formula interface.
  • Don't support native categorical columns, relying instead on the one-hot or dummy-encoding from the formula.
  • Try to remove the intercept by trying to create a new formula <curr formula> - 1 and error out if this doesn't succeed.
  • Let the formula handle all the predict post-processing, regardless of how the x/y interface does it.
  • Encourage users and reverse dependencies to avoid the formula interface for serious usage by being clear about all these limitations in the docs and having a warning message in the "description" section (which comes before the signatures in the doc). We're not in control of how reverse dependencies use it, but if any of the larger ones is tempted to use this formula interface anyway, could be followed up in their githubs.

A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:
  • Does xgboost support passing a mixture of dense and sparse data together? pandas supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating the DMatrix object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such a DMatrix for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like a list.
    • Is there a C-level DMatrix equivalent of cbind / np.c_ ? I don't see any but this would make things easier.
  • Can sparse columns be taken as categorical without being densified? If so, I guess specifying categorical columns of a sparse matrix should also be supported in the high-level R interface.
  • In other interfaces, does passing a data frame of different types always involve putting everything into a contiguous array of the same type? I see there's a C function XGDMatrixCreateFromDT, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types like int64, these would be casted to a floating-point type. In R, especially when using data.table, it's oftentimes common to have 64-bit integers from the package bit64, which if casted to float64, would lose integral precision beyond $2^52$ (same for int32 which could lose precision when casted to float32) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create a DMatrix from a structure like an R data.frame, which is a list of arrays that aren't contiguous in memory and might have different dtypes.
  • Should the R interface keep book on the objectives and their types, or should for example the C interface have functionalities to determine if an objective is e.g. binary classification? As it is right now, it'd be quite easy to do this in R, but I can foresee a situation in which in the future someone for example submits a PR adding a new objective like binary:cauchit, adds it to the python XGBClassifier class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective.
  • How does the python interface keep the list of parameters up-to-date with the C interface? Is it done manually by maintainers? Is there some way in which the R interface could automatically keep in synch with the C one every time a new parameter is added? Or will it need to manually keep recollection of every possible allowed argument and its default value?
  • How do other interfaces deal with idiosyncracies like base0 vs. base1 indexing? I see there's also a Julia interface for example, which just like R, uses base1 indexing, but from a brief look at the docs it looks like it sticks with base0 indexing for xgboost regardless.
  • Since the R object from the high-level interface in this case would have more metadata than what's kept in the serialized C Booster, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned from xgboost(), an earlier object saved with saveRDS will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features.
    • How does the python scikit-learn interface deal with pickle and object attributes that aren't part of the C Booster? How does it deal with converting between Booster and scikit-learn classes?
    • I see the R interface currently has a big notice about objects serialized with non-xgboost functions like saveRDS not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the C Booster, but there could be cases in which the R interface changes independently of the C struct.
    • If one were to load a C Booster into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that the Booster wouldn't? Is it somehow standardized?
  • How does xgboost determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it?
  • Do I understand it correctly from the python interface that there's no direct interop with arrow C structs when creating a DMatrix? If so, I guess support for arrow in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.

Some other questions for R users (@mayer79 ?):
  • Are there any popular data.frame-like objects from packages related to GPU computing (like cuDF for python) or distributed computing (other than spark) that I might perhaps be missing in this post?
  • Is there any popular R type for learning-to-rank tasks that would involve the qid and position information that xgboost uses? I am not aware of any but I'm no expert in this area.
  • Is there perhaps some object class from Bioconductor or from some other domain (like, don't know, geopositioning from e.g. gdal) that should logically map to some specific xgboost objective without undergoing transformation by the user to e.g. factor / numeric / etc. ?
  • Is there some package offering a custom formula parser meant for decision tree models?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions