Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.action default ignored in train #461

Closed
topepo opened this issue Aug 2, 2016 · 5 comments
Closed

na.action default ignored in train #461

topepo opened this issue Aug 2, 2016 · 5 comments
Labels

Comments

@topepo
Copy link
Owner

topepo commented Aug 2, 2016

> library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> 
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
> mod
Linear Regression 

150 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

First, the model should fail since the underlying code has:

> formals(train.formula)
$form

$data

$...

$weights

$subset

$na.action
na.fail

$contrasts
NULL

Second, it says there were 150 samples when it should be 145 since missing data are omitted.

@topepo topepo added the bug label Aug 2, 2016
@topepo
Copy link
Owner Author

topepo commented Aug 2, 2016

Here is the current code fragment:

> train.formula
function (form, data, ..., weights, subset, na.action = na.fail, 
    contrasts = NULL) {
    m <- match.call(expand.dots = FALSE)
    if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
    m$... <- m$contrasts <- NULL
    m[[1]] <- as.name("model.frame")      ### <- point of interest #1
    m <- eval.parent(m)                   ### <- point of interest #2
    if (nrow(m) < 1) stop("Every row has at least one missing value were found")
    Terms <- attr(m, "terms")
    x <- model.matrix(Terms, m, contrasts, na.action = na.action)  ### <- point of interest #3

At point of interest 1, the object m is a modified call object:

Browse[2]> m
model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)
Browse[2]> str(m)
 language model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)

Note that na.action is not there since it was not specified in the actual function call.

At point of interest 2, this is evaluated and the actual data set (iris2) is inserted into the object m without the missing data.

Once we get to the model.matrix call at point of interest 3, there is no missing data so no fail.

These internals are almost identical to what is found in most R formula methods (e.g. lm). The main difference here is that I'm using a different default.

@topepo
Copy link
Owner Author

topepo commented Aug 2, 2016

Here is the solution to problem of na.action being ignored when the default is used:

train.formula <- function (form, data, ..., weights, subset, na.action = na.fail, contrasts = NULL)  {
  m <- match.call(expand.dots = FALSE)
  if (is.matrix(eval.parent(m$data)))  m$data <- as.data.frame(data)
  m$... <- m$contrasts <- NULL

  ## Look for missing `na.action` in call. To make the default (`na.fail`) 
  ## recognizable by `eval.parent(m)`, we need to add it to the call
  ## object `m`

  if(!("na.action" %in% names(m))) m$na.action <- quote(na.fail)

  m[[1]] <- quote(stats::model.frame)
  m <- eval.parent(m)
  if(nrow(m) < 1) stop("Every row has at least one missing value were found")
  Terms <- attr(m, "terms")
  x <- model.matrix(Terms, m, contrasts)
  cons <- attr(x, "contrast")
  int_flag <- grepl("(Intercept)", colnames(x))
  if (any(int_flag)) x <- x[, !int_flag, drop = FALSE]
  w <- as.vector(model.weights(m))
  y <- model.response(m)  

In this case, we get

> library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> 
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
Error in na.fail.default(list(Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9,  : 
  missing values in object

and

> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression 

150 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

At least the behavior is correct although the number of samples is still wrong.

I've added another check for cases where someone wants to do imputation but accidentally uses na.omit:

> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit, preProc = "knnImpute")
Warning message:
In check_na_conflict(match.call(expand.dots = TRUE)) :
  `preProcess` includes an imputation method but missing data will be eliminated by the formula method using `na.action=na.omit`. Consider using `na.actin=na.pass` instead.

@topepo topepo changed the title na.fail default in train ignored na.action default ignored in train Aug 3, 2016
@topepo
Copy link
Owner Author

topepo commented Aug 3, 2016

To solve the issue of the erroneous number of samples, the issue occurs at the end of formula.train:

res$trainingData <- data

Even though train.default saves the training data, train.formula resaves it to avoid the conversion to dummy variables. However, the assignment about ignores na.action, so it was changed to

cc <- complete.cases(data[, all.vars(form), drop = FALSE])
res$trainingData <- data[cc,all.vars(form), drop = FALSE]

Now:

> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression 

145 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

@topepo
Copy link
Owner Author

topepo commented Aug 3, 2016

Correct on last comment: all.vars(form) fails with a formula such as y ~ . so the better solution appears to be

cc <- complete.cases(data[, all.vars(Terms), drop = FALSE])
res$trainingData <- data[cc,all.vars(Terms), drop = FALSE]

@topepo
Copy link
Owner Author

topepo commented Aug 3, 2016

Regression tests have passed for these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant