Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Allow for valid data to be a slice of the training data #6844

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
9 changes: 7 additions & 2 deletions R-package/R/lgb.train.R
Original file line number Diff line number Diff line change
@@ -184,9 +184,14 @@ lgb.train <- function(params = list(),
next
}

# Update parameters, data
# Update parameters
valid_data$update_params(params)
valid_data$set_reference(data)

# No need to set reference if one exists (e.g., a slice)
if (is.null(valid_data$.__enclos_env__$private$reference)) {
valid_data$set_reference(data)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right.

This would result in not attempting the update if the validation Dataset has any reference. That's a problem, because it means that if the training data and validation data have different bin mappers, we might not find that out until much later in the program (and probably with hard-to-understand errors or maybe even more serious problems like segfaults).

(if you're not sure what the terms "reference" or "bin mapper" mean in the context of LightGBM, please tell me and I'll explain)

I think the root cause of this bug is actually right here:

set_reference = function(reference) {
# setting reference to this same Dataset object doesn't require any changes
if (identical(private$reference, reference)) {
return(invisible(self))
}

When private$reference is not NULL, it's an lgb.Dataset (an {R6} class instance).

library(lightgbm)

data("iris")

ds <- lgb.Dataset(
    as.matrix(iris[, seq_len(3L)])
    , label = iris[, 4L]
)

dtrain <- lgb.slice.Dataset(ds, seq_len(100L))
dtrain$.__enclos_env__$private$reference
# <lgb.Dataset>
#   Public:
#     construct: function () 
#     create_valid: function (data, label = NULL, weight = NULL, group = NULL, init_score = NULL, 

Using identical() on 2 {R6} class instances is not reliable... it can be affected by implementation details of how {R6} works:

But for that check, it does not matter whether they're the same exact R object (in fact, we can be confident that they aren't)... it's just important that their handles point to the same Dataset on the C++ side.

I don't really want to take on maintaining or implementing an ==.lgb.Dataset() generic here (as suggested in r-lib/R6#211)... for this purpose, I think we should try an internal function that only checks the equality of the characteristics we care about.

Maybe something like this in lgb.Dataset.R (you may have to experiment with this a bit):

.datasets_are_equal <- function(ds1, ds2) {
    if (is.null(ds1) && is.null(ds2)) {
        return(TRUE)
    }
    if (is.null(ds1) || is.null(ds2)) {
        return(FALSE)
    }
    return(
        identical(
            ds1$.__enclos_env__$private$get_handle()
            , ds1$.__enclos_env__$private$get_handle()
        )
    )
}

And then replacing identical() with a call to that function in the spot I linked to above.

Are you interested in attempting that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you taking the time to break that down! This makes sense so far. I’d like to make an attempt at those changes so I can understand the codebase better. Thanks!


reduced_valid_sets[[key]] <- valid_data

}
Loading
Oops, something went wrong.