Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions cause error when model was fit using x/y instead of formula #28

Closed
jaredlander opened this issue Aug 2, 2020 · 1 comment

Comments

@jaredlander
Copy link

When using the formula interface for C5.0() everything works as expected. But when using the x and y arguments predictions do not work. A Stackoverflow question from earlier this year came to the same conclusion.

Here is some code to illustrate.

library(dplyr)
library(C50)
library(rsample)
library(recipes)

data(credit_data,package='modeldata')
credit <- tibble::as_tibble(credit_data) %>% mutate(across(where(is.factor), as.character))

set.seed(28676)
data_split <- initial_split(credit, prop=.9, strata='Status')
train <- training(data_split)
test <- testing(data_split)

rec_C50 <- recipe(Status ~ ., data=train) %>% 
    themis::step_upsample(Status) %>% 
    step_other(all_nominal(), -Status, other='misc')
prep_c50 <- rec_C50 %>% prep()

train_data <- prep_c50 %>% juice()
test_data <- prep_c50 %>% bake(new_data=test)

# this works as expected
c5_formula <- C5.0(Status ~ ., data=prep_c50 %>% juice())
preds_formula <- predict(c5_formula, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
head(preds_formula)

# this causes an error
c5_xy <- C5.0(x=prep_c50 %>% juice(all_predictors()), y=prep_c50 %>% juice(Status) %>% pull(Status))
preds_xy <- predict(c5_xy, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
Error: 
*** line 1 of `undefined.cases': bad value of `c(3, 2, 2, 2, 5, 1, 2, 5, 2, 3, 1, 1, 2, 2, 1, 3, 2, 5, 5, 4, 2, 4, 2, 2, 2, 3, 2, 2, 2, 5, 2, 1, 2, 4, 3, 3, 2, 2, 2, 5, 5, 1, 2, 3, 5, 5, 4, 1, 4, 3, 2, 2, 3, 2, 1, 2, 5, 5, 2, 2, 2, 2, 2, 5, 2, 5, 2, 3, 2, 2, 2, 5, 2, 2, 1, 2, 2, 2, 2, 3, 2, 2, 3, 5, 2, 5, 5, 2, NA, 2, 3, 2, 2, 2, 5, 2, 3, 1, 3, 6, 2, 2, 3, 5, 5, 5, 2, 2, 2, 2, 4, 3, 3, 5, 2, 2, 3, 6, 2, 2, 4, 1, 3, 2, 2, 3, 3, 5, 2, 2, 1, 2, 2, 2, 3, 2, 1, 1, 4, 2, 2, 4, 4, 3, 2, 5, 2, 2, 2, 3, 2, 5, 2, 1, 2, 2, 5, 2, 2, 1, 2, 3, 5, 2, 5, 3,' for attribute `Home'

Error limit exceeded

Interestingly, when fitting using {workflows}, predictions work for an untuned boost_tree() model and for a tuned or untuned decision_tree() model. But this error occurs when trying to tune a boost_tree() model.

To make matters worse the {C5.0} website shows this error in the documentation for the predict() function as seen in the image below.

image

@topepo
Copy link
Owner

topepo commented May 7, 2021

I think that this works if you install the current dev version of Cubist. It was an issue with how lappy() works with tibbles.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(C50)
library(rsample)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(credit_data,package='modeldata')
credit <- tibble::as_tibble(credit_data) %>% mutate(across(where(is.factor), as.character))

set.seed(28676)
data_split <- initial_split(credit, prop=.9, strata='Status')
train <- training(data_split)
test <- testing(data_split)

rec_C50 <- recipe(Status ~ ., data=train) %>% 
    themis::step_upsample(Status) %>% 
    step_other(all_nominal(), -Status, other='misc')
#> Warning: replacing previous import 'data.table:::=' by 'ggplot2:::=' when
#> loading 'mlr'
#> Registered S3 methods overwritten by 'themis':
#>   method                  from   
#>   bake.step_downsample    recipes
#>   bake.step_upsample      recipes
#>   prep.step_downsample    recipes
#>   prep.step_upsample      recipes
#>   tidy.step_downsample    recipes
#>   tidy.step_upsample      recipes
#>   tunable.step_downsample recipes
#>   tunable.step_upsample   recipes
prep_c50 <- rec_C50 %>% prep()

train_data <- prep_c50 %>% juice()
test_data <- prep_c50 %>% bake(new_data=test)

# this works as expected
c5_formula <- C5.0(Status ~ ., data=prep_c50 %>% juice())
preds_formula <- predict(c5_formula, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))
head(preds_formula)
#> [1] good good good good bad  good
#> Levels: bad good

# this causes an error
c5_xy <- C5.0(x=prep_c50 %>% juice(all_predictors()), y=prep_c50 %>% juice(Status) %>% pull(Status))
preds_xy <- predict(c5_xy, newdata=prep_c50 %>% bake(new_data=test, all_predictors()))

Created on 2021-05-06 by the reprex package (v1.0.0.9000)

@topepo topepo closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants