-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
na.action default ignored in train #461
Comments
Here is the current code fragment: > train.formula
function (form, data, ..., weights, subset, na.action = na.fail,
contrasts = NULL) {
m <- match.call(expand.dots = FALSE)
if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
m$... <- m$contrasts <- NULL
m[[1]] <- as.name("model.frame") ### <- point of interest #1
m <- eval.parent(m) ### <- point of interest #2
if (nrow(m) < 1) stop("Every row has at least one missing value were found")
Terms <- attr(m, "terms")
x <- model.matrix(Terms, m, contrasts, na.action = na.action) ### <- point of interest #3 At point of interest 1, the object Browse[2]> m
model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)
Browse[2]> str(m)
language model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2) Note that At point of interest 2, this is evaluated and the actual data set ( Once we get to the These internals are almost identical to what is found in most R formula methods (e.g. |
Here is the solution to problem of train.formula <- function (form, data, ..., weights, subset, na.action = na.fail, contrasts = NULL) {
m <- match.call(expand.dots = FALSE)
if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
m$... <- m$contrasts <- NULL
## Look for missing `na.action` in call. To make the default (`na.fail`)
## recognizable by `eval.parent(m)`, we need to add it to the call
## object `m`
if(!("na.action" %in% names(m))) m$na.action <- quote(na.fail)
m[[1]] <- quote(stats::model.frame)
m <- eval.parent(m)
if(nrow(m) < 1) stop("Every row has at least one missing value were found")
Terms <- attr(m, "terms")
x <- model.matrix(Terms, m, contrasts)
cons <- attr(x, "contrast")
int_flag <- grepl("(Intercept)", colnames(x))
if (any(int_flag)) x <- x[, !int_flag, drop = FALSE]
w <- as.vector(model.weights(m))
y <- model.response(m) In this case, we get > library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
>
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
Error in na.fail.default(list(Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, :
missing values in object and > set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression
150 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ...
Resampling results:
RMSE Rsquared
0.4467646 0.02218261 At least the behavior is correct although the number of samples is still wrong. I've added another check for cases where someone wants to do imputation but accidentally uses > set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit, preProc = "knnImpute")
Warning message:
In check_na_conflict(match.call(expand.dots = TRUE)) :
`preProcess` includes an imputation method but missing data will be eliminated by the formula method using `na.action=na.omit`. Consider using `na.actin=na.pass` instead. |
To solve the issue of the erroneous number of samples, the issue occurs at the end of res$trainingData <- data Even though cc <- complete.cases(data[, all.vars(form), drop = FALSE])
res$trainingData <- data[cc,all.vars(form), drop = FALSE] Now: > iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression
145 samples
1 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ...
Resampling results:
RMSE Rsquared
0.4467646 0.02218261 |
Correct on last comment: cc <- complete.cases(data[, all.vars(Terms), drop = FALSE])
res$trainingData <- data[cc,all.vars(Terms), drop = FALSE] |
Regression tests have passed for these changes. |
First, the model should fail since the underlying code has:
Second, it says there were 150 samples when it should be 145 since missing data are omitted.
The text was updated successfully, but these errors were encountered: