na.action default ignored in train #461

topepo · 2016-08-02T18:46:50Z

> library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> 
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
> mod
Linear Regression 

150 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

First, the model should fail since the underlying code has:

> formals(train.formula)
$form

$data

$...

$weights

$subset

$na.action
na.fail

$contrasts
NULL

Second, it says there were 150 samples when it should be 145 since missing data are omitted.

topepo · 2016-08-02T19:04:34Z

Here is the current code fragment:

> train.formula
function (form, data, ..., weights, subset, na.action = na.fail, 
    contrasts = NULL) {
    m <- match.call(expand.dots = FALSE)
    if (is.matrix(eval.parent(m$data))) m$data <- as.data.frame(data)
    m$... <- m$contrasts <- NULL
    m[[1]] <- as.name("model.frame")      ### <- point of interest #1
    m <- eval.parent(m)                   ### <- point of interest #2
    if (nrow(m) < 1) stop("Every row has at least one missing value were found")
    Terms <- attr(m, "terms")
    x <- model.matrix(Terms, m, contrasts, na.action = na.action)  ### <- point of interest #3

At point of interest 1, the object m is a modified call object:

Browse[2]> m
model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)
Browse[2]> str(m)
 language model.frame(form = Sepal.Width ~ Sepal.Length, data = iris2)

Note that na.action is not there since it was not specified in the actual function call.

At point of interest 2, this is evaluated and the actual data set (iris2) is inserted into the object m without the missing data.

Once we get to the model.matrix call at point of interest 3, there is no missing data so no fail.

These internals are almost identical to what is found in most R formula methods (e.g. lm). The main difference here is that I'm using a different default.

topepo · 2016-08-02T22:51:56Z

Here is the solution to problem of na.action being ignored when the default is used:

train.formula <- function (form, data, ..., weights, subset, na.action = na.fail, contrasts = NULL)  {
  m <- match.call(expand.dots = FALSE)
  if (is.matrix(eval.parent(m$data)))  m$data <- as.data.frame(data)
  m$... <- m$contrasts <- NULL

  ## Look for missing `na.action` in call. To make the default (`na.fail`) 
  ## recognizable by `eval.parent(m)`, we need to add it to the call
  ## object `m`

  if(!("na.action" %in% names(m))) m$na.action <- quote(na.fail)

  m[[1]] <- quote(stats::model.frame)
  m <- eval.parent(m)
  if(nrow(m) < 1) stop("Every row has at least one missing value were found")
  Terms <- attr(m, "terms")
  x <- model.matrix(Terms, m, contrasts)
  cons <- attr(x, "contrast")
  int_flag <- grepl("(Intercept)", colnames(x))
  if (any(int_flag)) x <- x[, !int_flag, drop = FALSE]
  w <- as.vector(model.weights(m))
  y <- model.response(m)

In this case, we get

> library(caret)
> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> 
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm")
Error in na.fail.default(list(Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9,  : 
  missing values in object

and

> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression 

150 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

At least the behavior is correct although the number of samples is still wrong.

I've added another check for cases where someone wants to do imputation but accidentally uses na.omit:

> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit, preProc = "knnImpute")
Warning message:
In check_na_conflict(match.call(expand.dots = TRUE)) :
  `preProcess` includes an imputation method but missing data will be eliminated by the formula method using `na.action=na.omit`. Consider using `na.actin=na.pass` instead.

topepo · 2016-08-03T00:30:45Z

To solve the issue of the erroneous number of samples, the issue occurs at the end of formula.train:

res$trainingData <- data

Even though train.default saves the training data, train.formula resaves it to avoid the conversion to dummy variables. However, the assignment about ignores na.action, so it was changed to

cc <- complete.cases(data[, all.vars(form), drop = FALSE])
res$trainingData <- data[cc,all.vars(form), drop = FALSE]

Now:

> iris2 <- iris
> iris2[1:5, "Sepal.Length"] <- NA
> set.seed(35)
> mod <- train(Sepal.Width ~ Sepal.Length, data = iris2, method = "lm", na.action = na.omit)
> mod
Linear Regression 

145 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 145, 145, 145, 145, 145, 145, ... 
Resampling results:

  RMSE       Rsquared  
  0.4467646  0.02218261

topepo · 2016-08-03T00:38:22Z

Correct on last comment: all.vars(form) fails with a formula such as y ~ . so the better solution appears to be

cc <- complete.cases(data[, all.vars(Terms), drop = FALSE])
res$trainingData <- data[cc,all.vars(Terms), drop = FALSE]

topepo · 2016-08-03T11:32:07Z

Regression tests have passed for these changes.

topepo added the bug label Aug 2, 2016

topepo changed the title ~~na.fail default in train ignored~~ na.action default ignored in train Aug 3, 2016

topepo added a commit that referenced this issue Aug 3, 2016

changes for na.fail in issue #461

3e65af5

topepo closed this as completed Aug 3, 2016

farbodr mentioned this issue Aug 28, 2016

Error in na.fail.default and #461 fix #479

Closed

topepo mentioned this issue Jan 12, 2017

the missing value in xgbTree training with caret #573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

na.action default ignored in train #461

na.action default ignored in train #461

topepo commented Aug 2, 2016 •

edited

Loading

topepo commented Aug 2, 2016

topepo commented Aug 2, 2016

topepo commented Aug 3, 2016

topepo commented Aug 3, 2016

topepo commented Aug 3, 2016

na.action default ignored in train #461

na.action default ignored in train #461

Comments

topepo commented Aug 2, 2016 • edited Loading

topepo commented Aug 2, 2016

topepo commented Aug 2, 2016

topepo commented Aug 3, 2016

topepo commented Aug 3, 2016

topepo commented Aug 3, 2016

topepo commented Aug 2, 2016 •

edited

Loading