Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega dump of issues #172

Open
alexpghayes opened this Issue Apr 10, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@alexpghayes
Copy link

commented Apr 10, 2019

I'm sorry this is awful but I don't have time right now to turn this into atomic and nice issues, so I'm just dumping this here because there are so many bugs and gah it was frustrating.

library(tidymodels)
#> -- Attaching packages --------- tidymodels 0.0.2 --
#> v broom     0.5.2.9001     v purrr     0.3.2     
#> v dials     0.0.2          v recipes   0.1.5     
#> v dplyr     0.8.0.1        v rsample   0.0.4     
#> v ggplot2   3.1.0          v tibble    2.1.1     
#> v infer     0.4.0          v yardstick 0.0.3     
#> v parsnip   0.0.2
#> -- Conflicts ------------ tidymodels_conflicts() --
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

# you need the dev version of tidyr for the following
# devtools::install_github("tidyverse/tidyr")
library(tidyr)
library(furrr)
#> Loading required package: future

set.seed(27)  # the one true seed

fit_on_fold <- function(spec, prepped) {

  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())

  fit_xy(spec, x, y)
}

predict_helper <- function(fit, new_data, recipe) {

  if (inherits(new_data, "rsplit")) {
    obs <- as.integer(new_data, data = "assessment")
    new_data <- bake(recipe, assessment(new_data))
  } else {
    obs <- 1:nrow(new_data)
    new_data <- bake(recipe, new_data)
  }

  predict(fit, new_data, type = "prob") %>%
    mutate(obs = obs)
}

spread_nested_predictions <- function(data) {
  data %>%
    unnest(preds) %>%
    pivot_wider(
      id_cols = obs,
      names_from = model_id,
      values_from = contains(".pred")
    )
}

super_learner <- function(library, recipe, meta_spec, data) {

  folds <- vfold_cv(data, v = 10)

  cv_fits <- crossing(folds, library) %>%
    mutate(
      prepped = future_map(splits, prepper, recipe),
      fit = future_pmap(list(spec, prepped), fit_on_fold)
    )

  prepped <- prep(recipe, training = data)

  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())

  full_fits <- library %>%
    mutate(fit = future_map(spec, fit_xy, x, y))

  holdout_preds <- cv_fits %>%
    mutate(
      preds = future_pmap(list(fit, splits, prepped), predict_helper)
    ) %>%
    spread_nested_predictions() %>%
    select(-obs)

  metalearner <- fit_xy(meta_spec, holdout_preds, y)

  sl <- list(full_fits = full_fits, metalearner = metalearner, recipe = prepped)
  class(sl) <- "super_learner"
  sl
}

predict.super_learner <- function(x, new_data, type = c("class","prob")) {

  type <- rlang::arg_match(type)

  new_preds <- x$full_fits %>%
    mutate(preds = future_map(fit, predict_helper, new_data, x$recipe)) %>%
    spread_nested_predictions() %>%
    select(-obs)

  predict(x$metalearner, new_preds, type = type)
}

data_split <- credit_data %>%
  na.omit() %>%
  sample_n(1000) %>%
  initial_split(strata = "Status", prop = 0.75)

train <- training(data_split)
test  <- testing(data_split)

rec <- recipe(Status ~ ., data = train) %>%
  step_dummy(all_nominal(), -Status) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors())

meta <- multinom_reg(penalty = 1, mixture = 1) %>%
  set_engine("glmnet")

create_library <- function(model, grid) {
  tibble(spec = merge(model, grid)) %>%
    mutate(model_id = row_number())
}



lib <- create_library(
  svm_poly(mode = "classification", degree = 2, scale_factor = 0.1),
  grid_random(cost, size = 5)
)

# known error: https://stackoverflow.com/questions/18988956/predict-with-kernlab-package-error-error-in-localobject-test-vector-do
sl <- super_learner(lib, rec, meta, train)
#> Error in .local(object, ...): test vector does not match model !
predict(sl, test)
#> Error in predict(sl, test): object 'sl' not found

# technically this works but you can't tune over cost_complexity
# (see open dials bug) so it's a major pain to use this in practice

# cost_complexity doesn't exist as a dials object, there's an open issue
# in dials for this
lib2 <- create_library(
  decision_tree(mode = "classification", cost_complexity = 0.01, min_n = 3),
  grid_random(tree_depth, size = 5)
)

sl2 <- super_learner(lib2, rec, meta, train)
predict(sl2, test)
#> # A tibble: 250 x 1
#>    .pred_class
#>    <fct>      
#>  1 good       
#>  2 good       
#>  3 good       
#>  4 good       
#>  5 good       
#>  6 good       
#>  7 good       
#>  8 good       
#>  9 good       
#> 10 good       
#> # ... with 240 more rows


lib4 <- create_library(
  multinom_reg(mode = "classification", mixture = 0.5),
  grid_random(penalty, size = 5)
)

sl4 <- super_learner(lib4, rec, meta, train)
#> Error in cbind2(1, newx) %*% (nbeta[[i]]): invalid class 'NA' to dup_mMatrix_as_dgeMatrix
predict(sl4, test)
#> Error in predict(sl4, test): object 'sl4' not found


lib5 <- create_library(
  nearest_neighbor(mode = "classification", weight_func = "rectangular", dist_power = 2),
  grid_random(neighbors %>% range_set(c(1, 20)), size = 5)
)

sl5 <- super_learner(lib5, rec, meta, train)
predict(sl5, test)
#> # A tibble: 250 x 1
#>    .pred_class
#>    <fct>      
#>  1 good       
#>  2 good       
#>  3 good       
#>  4 good       
#>  5 good       
#>  6 good       
#>  7 good       
#>  8 good       
#>  9 good       
#> 10 good       
#> # ... with 240 more rows



lib6 <- create_library(
  rand_forest(mode = "classification", trees = 100, min_n = 3),
  grid_random(mtry %>% range_set(c(1, 10)) , size = 5)
)


sl6 <- super_learner(lib6, rec, meta, train)
predict(sl6, test)
#> # A tibble: 250 x 1
#>    .pred_class
#>    <fct>      
#>  1 good       
#>  2 good       
#>  3 good       
#>  4 good       
#>  5 good       
#>  6 good       
#>  7 good       
#>  8 good       
#>  9 good       
#> 10 good       
#> # ... with 240 more rows


lib7 <- create_library(
  svm_rbf(mode = "classification"),
  grid_random(cost, rbf_sigma, size = 5)
)

sl7 <- super_learner(lib7, rec, meta, train)
#> maximum number of iterations reached 0.000241564 -0.0002408284maximum number of iterations reached 0.006938929 -0.006721008maximum number of iterations reached 8.101725e-05 -8.095252e-05maximum number of iterations reached 0.005553964 -0.005492931maximum number of iterations reached 9.975335e-05 -9.965653e-05maximum number of iterations reached 0.005516504 -0.005424263maximum number of iterations reached 0.0003008345 -0.0002997647maximum number of iterations reached 0.007930973 -0.007639496maximum number of iterations reached 1.001517e-05 -1.000712e-05maximum number of iterations reached 0.0070771 -0.006885782maximum number of iterations reached 0.0001561512 -0.0001559803maximum number of iterations reached 0.007414135 -0.007110222maximum number of iterations reached 0.0002859747 -0.0002854247maximum number of iterations reached 0.005070323 -0.004995554maximum number of iterations reached 5.594107e-05 -5.588102e-05maximum number of iterations reached 0.006705364 -0.006483062maximum number of iterations reached 0.0004279861 -0.0004260297maximum number of iterations reached 0.006561554 -0.006379719maximum number of iterations reached 0.0005021773 -0.0004991793maximum number of iterations reached 0.007009218 -0.006734963
#> maximum number of iterations reached 9.188721e-05 -9.180038e-05maximum number of iterations reached 0.003940392 -0.003870358
#> Error in .local(object, ...): test vector does not match model !
predict(sl7, test)
#> Error in predict(sl7, test): object 'sl7' not found



# why doesn't this work?

lib8 <- tibble(
  spec = null_model(mode = "classification"),
  model_id = 1
)

sl8 <- super_learner(lib8, rec, meta, train)
#> `x` must be a vector, not a `null_model/model_spec` object
predict(sl8, test)
#> Error in predict(sl8, test): object 'sl8' not found

# Error: Both weight decay and dropout should not be specified.
# I didn't specify both though?

lib9 <- create_library(
  mlp(mode = "classification", hidden_units = 5, epochs = 10),
  grid_random(dropout, penalty, activation, size = 5)
)

sl9 <- super_learner(lib9, rec, meta, train)
#> Error: Both weight decay and dropout should not be specified.
predict(sl9, test)
#> Error in predict(sl9, test): object 'sl9' not found

lib10 <- create_library(
  boost_tree(mode = "classification", mtry = 3, trees = 100,
             min_n = 3, learn_rate = 0.1, loss_reduction = 0.1,
             sample_size = 1),
  grid_random(tree_depth, size = 5)
)

sl10 <- super_learner(lib10, rec, meta, train)
#> Error in xgboost::xgb.DMatrix(data = newdata, missing = NA): 'data' has class 'character' and length 1725.
#>   'data' accepts either a numeric matrix or a single filename.
predict(sl10, test)
#> Error in predict(sl10, test): object 'sl10' not found

Created on 2019-04-09 by the reprex package (v0.2.1)

@alejandroschuler

This comment has been minimized.

Copy link

commented Apr 10, 2019

The second-to-last error (from xgboost::DMatrix) is something I just hit as well. I believe it happens because the data include character columns and as.matrix gets called:

newdata <- as.matrix(newdata)

I solved the problem by doing two things:

  1. transform all predictors to numeric values
  2. exclude the outcome column from what gets passed to new_data

However, the same code is used at training time so I'm not sure why the error doesn't occur until prediction.

As is noted elsewhere in the code, as.matrix should probably be model.matrix(). This may introduce some issues with new factor levels at prediction time, but the convenience is well worth it. The alternative is to force preprocessing of the data to numeric matrices, which completely defeats the purpose of allowing a formula interface.

@alexpghayes

This comment has been minimized.

Copy link
Author

commented Apr 11, 2019

Interesting. There should be no character data in my case though since there's a step_dummy(all_nominal()) in the recipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.