Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using the sampling parameter (SMOTE-ROSE) and few predictors #612

Closed
diugalde opened this issue Mar 8, 2017 · 7 comments
Closed

Comments

@diugalde
Copy link

@diugalde diugalde commented Mar 8, 2017

I'm currently getting an error when I use SMOTE or ROSE to train an unbalanced dataset with few columns. It happens with different algorithms (glm, avNNet, parRf..), so is not about an specific model implementation. I don't know if it has to do with the way SMOTE and ROSE behaves with few columns.

Example:

library(caret)
set.seed(1)

# Generate an example dataset.
 training <- twoClassSim(342, intercept = -10, linearVars = 1)

# Unbalanced.
table(training[, "Class"])

# Reduce the dataset to use only 1 predictor.
training <- training[, c("Nonlinear2", "Class")]

ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote") # With rose I get the same error. 

m <- train(Class ~ ., data = training,
                     method = "glm",
                     metric = "ROC",
                     trControl = ctrl)

Error:

Error in { : 
  task 1 failed - "arguments imply differing number of rows: 315, 34"

If I try to do the sampling AFTER the preProcessing, I get the same error.

Another important thing is that I don't get the error with up and down.

@diugalde
Copy link
Author

@diugalde diugalde commented Mar 8, 2017

If a preProcess option is added to train, I get the following error:

Code:

library(caret)
set.seed(1)

# Generate an example dataset.
 training <- twoClassSim(342, intercept = -10, linearVars = 1)

# Reduce the dataset to use only 1 predictor.
training <- training[, c("Nonlinear2", "Class")]

# Unbalanced.
table(training[, "Class"])

ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "rose")


preProcess <- c("nzv", "scale", "center", "YeoJohnson")

m <- train(Class ~ ., data = training,
                     method = "glm",
                     metric = "ROC",
                     trControl = ctrl,
                      preProcess = preProcess)

Error:

Error in get_types(x) : `x` must have column names
@diugalde
Copy link
Author

@diugalde diugalde commented Mar 8, 2017

I think the error is in these functions:

> getSamplingInfo()$smote

$smote
function (x, y) 
{
    checkInstall("DMwR")
    library(DMwR)
    dat <- if (is.data.frame(x)) 
        x
    else as.data.frame(x)
    dat$.y <- y
    dat <- SMOTE(.y ~ ., data = dat)
    list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)], 
        y = dat$.y)
}


> getSamplingInfo()$rose

$rose
function (x, y) 
{
    checkInstall("ROSE")
    library(ROSE)
    dat <- if (is.data.frame(x)) 
        x
    else as.data.frame(x)
    dat$.y <- y
    dat <- ROSE(.y ~ ., data = dat)$data
    list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)], 
        y = dat$.y)
}

If you call one of those functions with x = data.frame with 1 column, you will get a list with x as a numeric vector, when it should be a data.frame with 1 column. That's why in later phases like preProcessing, x does not have column names (is not a dataframe).

@m-dz
Copy link
Contributor

@m-dz m-dz commented Mar 13, 2017

It looks like both rose() and smote() are missing drop = FALSE argument, see here:

caret::getSamplingInfo()$rose(training[, c("Nonlinear1", "Nonlinear2")], training[, "Class"])
caret::getSamplingInfo()$rose(training[, c("Nonlinear2")], training[, "Class"])

rose_fix <- function (x, y) 
{
  checkInstall("ROSE")
  library(ROSE)
  dat <- if (is.data.frame(x)) 
    x
  else as.data.frame(x)
  dat$.y <- y
  dat <- ROSE(.y ~ ., data = dat)$data
  list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE), drop = FALSE], 
       y = dat$.y)
}

rose_fix(training[, c("Nonlinear1", "Nonlinear2")], training[, "Class"]) %>% str()
rose_fix(training[, c("Nonlinear2")], training[, "Class"]) %>% str()

This can be "hotfixed" with:

sampling_methods <- list(down = function(x, y) downSample(x, y, list = TRUE),
                         up = function(x, y) upSample(x, y, list = TRUE),
                         smote = function(x, y) {
                           checkInstall("DMwR")
                           library(DMwR)
                           dat <- if(is.data.frame(x)) x else as.data.frame(x)
                           dat$.y <- y
                           dat <- SMOTE(.y ~ ., data = dat)
                           list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE), drop = FALSE], 
                                y = dat$.y)
                         },
                         rose = function(x, y) {
                           checkInstall("ROSE")
                           library(ROSE)
                           dat <- if(is.data.frame(x)) x else as.data.frame(x)
                           dat$.y <- y
                           dat <- ROSE(.y ~ ., data = dat)$data
                           list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE), drop = FALSE], 
                                y = dat$.y)
                         })

save(sampling_methods, file = "**YOUR_R_LIB_PATH**/caret/models/sampling.RData")

I would try to prepare a pull request, but I am scared to break something...

@topepo
Copy link
Owner

@topepo topepo commented Mar 13, 2017

You diagnosis is correct. I'll create fixes in the next day or two and a new CRAN release should happen within a week or so.

Thanks

topepo added a commit that referenced this issue Mar 15, 2017
@topepo
Copy link
Owner

@topepo topepo commented Mar 15, 2017

Please test...

@hadjipantelis
Copy link
Contributor

@hadjipantelis hadjipantelis commented Mar 15, 2017

It works on some simple examples, I tried. Should a relevant unit-test be added? :D

topepo added a commit that referenced this issue Apr 7, 2017
@topepo
Copy link
Owner

@topepo topepo commented Apr 7, 2017

Yep!

Thanks. I'm aiming to get coverage about 20% =]

@topepo topepo closed this Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.