Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upna.action default ignored in train #461
Comments
|
Here is the current code fragment: > train.formula
function (form, data, ..., weights, subset, na.action = na.fail,
contrasts = NULL) {
m <- match.call(expand.dots = FALSE)
if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
m$... <- m$contrasts <- NULL
m[[1]] <- as.name("model.frame") ### <- point of interest #1
m <- eval.parent(m) ### <- point of interest #2
if (nrow(m) < 1) stop("Every row has at least one missing value were found")
Terms <- attr(m, "terms")
x <- model.matrix(Terms, m, contrasts, na.action = na.action) ### <- point of interest #3At point of interest 1, the object Browse[2]> m
model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)
Browse[2]> str(m)
language model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)Note that At point of interest 2, this is evaluated and the actual data set ( Once we get to the These internals are almost identical to what is found in most R formula methods (e.g. |
|
Here is the solution to problem of train.formula <- function (form, data, ..., weights, subset, na.action = na.fail, contrasts = NULL) {
m <- match.call(expand.dots = FALSE)
if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
m$... <- m$contrasts <- NULL
## Look for missing `na.action` in call. To make the default (`na.fail`)
## recognizable by `eval.parent(m)`, we need to add it to the call
## object `m`
if(!("na.action" %in% names(m))) m$na.action <- quote(na.fail)
m[[1]] <- quote(stats::model.frame)
m <- eval.parent(m)
if(nrow(m) < 1) stop("Every row has at least one missing value were found")
Terms <- attr(m, "terms")
x <- model.matrix(Terms, m, contrasts)
cons <- attr(x, "contrast")
int_flag <- grepl("(Intercept)", colnames(x))
if (any(int_flag)) x <- x[, !int_flag, drop = FALSE]
w <- as.vector(model.weights(m))
y <- model.response(m) In this case, we get > library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
>
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
Error in na.fail.default(list(Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, :
missing values in objectand > set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression
150 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ...
Resampling results:
RMSE Rsquared
0.4467646 0.02218261At least the behavior is correct although the number of samples is still wrong. I've added another check for cases where someone wants to do imputation but accidentally uses > set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit, preProc = "knnImpute")
Warning message:
In check_na_conflict(match.call(expand.dots = TRUE)) :
`preProcess` includes an imputation method but missing data will be eliminated by the formula method using `na.action=na.omit`. Consider using `na.actin=na.pass` instead. |
|
To solve the issue of the erroneous number of samples, the issue occurs at the end of res$trainingData <- dataEven though cc <- complete.cases(data[, all.vars(form), drop = FALSE])
res$trainingData <- data[cc,all.vars(form), drop = FALSE]Now: > iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression
145 samples
1 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ...
Resampling results:
RMSE Rsquared
0.4467646 0.02218261 |
|
Correct on last comment: cc <- complete.cases(data[, all.vars(Terms), drop = FALSE])
res$trainingData <- data[cc,all.vars(Terms), drop = FALSE] |
|
Regression tests have passed for these changes. |
First, the model should fail since the underlying code has:
Second, it says there were 150 samples when it should be 145 since missing data are omitted.