Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict(..., type = 'prob') with glmnet logistic_reg #206

Closed
skinnider opened this issue Aug 21, 2019 · 4 comments
Closed

predict(..., type = 'prob') with glmnet logistic_reg #206

skinnider opened this issue Aug 21, 2019 · 4 comments

Comments

@skinnider
Copy link

@skinnider skinnider commented Aug 21, 2019

When trying to predict probabilities from a multinomial logistic regression with the 'glmnet' engine, I get the following error:

options(stringsAsFactors = F)
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.2       ✔ recipes   0.1.6  
#> ✔ dials     0.0.2       ✔ rsample   0.0.5  
#> ✔ infer     0.4.0.1     ✔ yardstick 0.0.3  
#> ✔ parsnip   0.0.3.1
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard()     masks purrr::discard()
#> ✖ magrittr::extract()   masks tidyr::extract()
#> ✖ dplyr::filter()       masks stats::filter()
#> ✖ recipes::fixed()      masks stringr::fixed()
#> ✖ dplyr::lag()          masks stats::lag()
#> ✖ magrittr::set_names() masks purrr::set_names()
#> ✖ yardstick::spec()     masks readr::spec()
#> ✖ recipes::step()       masks stats::step()
library(glmnet)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:tidyr':
#> 
#>     expand
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loaded glmnet 2.0-18

# create a toy dataset
n_obs = 100
n_feats = 200
mat = matrix(NA, nrow = n_obs, ncol = n_feats,
             dimnames = list(paste0("observation_", seq_len(n_obs)),
                             paste0("feature_", seq_len(n_feats))))
mat[] = rnorm(length(c(mat)))

# create labels
labels = runif(n = n_obs, min = 0, max = 3) %>% floor() %>% factor()

# get optimized penalty with cv.glmnet
penalty = cv.glmnet(mat, labels, nfolds = 3, family = 'multinomial') %>%
  extract2("lambda.1se")

# create classifier
clf = logistic_reg(mixture = 1,
                   penalty = penalty,
                   mode = 'classification') %>%
  set_engine('glmnet', family = 'multinomial')

# fit models in cross-validation
x = as.data.frame(mat) %>% mutate(label = labels)
cv = vfold_cv(x, v = 3, strata = 'label')
folded = cv %>%
  mutate(
    recipes = splits %>%
      map(~ prepper(., recipe = recipe(.$data, label ~ .))),
    test_data = splits %>% map(analysis),
    fits = map2(
      recipes,
      test_data,
      ~ fit(
        clf,
        label ~ .,
        data = bake(object = .x, new_data = .y)
      )
    )
  )

# predict on the left-out data
retrieve_predictions = function(split, recipe, model) {
  test = bake(recipe, assessment(split))
  tbl = tibble(
    true = test$label,
    pred = predict(model, test)$.pred_class,
    prob = predict(model, test, type = 'prob')) %>%
    # convert prob from nested df to columns
    cbind(.$prob) %>%
    select(-prob)
  return(tbl)
}
predictions = folded %>%
  mutate(
    pred = list(
      splits,
      recipes,
      fits
    ) %>%
      pmap(retrieve_predictions)
  )
#> Error in attr(x, "names") <- as.character(value): 'names' attribute [3] must be the same length as the vector [2]

Created on 2019-08-21 by the reprex package (v0.3.0)

The issue is caused by this line:

predict(model, test, type = 'prob')

Running traceback() on that line alone gives the following:

10: `names<-.tbl_df`(`*tmp*`, value = value)
9: `names<-`(`*tmp*`, value = value)
8: `colnames<-`(`*tmp*`, value = object$lvl)
7: object$spec$method$pred$prob$post(res, object)
6: predict_classprob.model_fit(object, new_data = new_data, ...)
5: predict_classprob._multnet(object = object, new_data = new_data, 
       ...)
4: predict_classprob(object = object, new_data = new_data, ...)
3: predict.model_fit(object = object, new_data = new_data, type = type, 
       opts = opts)
2: predict._multnet(model, test, type = "prob")
1: predict(model, test, type = "prob")

(As a side note, I am having trouble interpreting the output returned by predict(model, test)$.pred_class. In most tidymodels predict functions, this is an object with length equal to that of test$label, but in this example, it's three times as long - why is this?)

@riccardopinosio
Copy link

@riccardopinosio riccardopinosio commented Sep 23, 2019

I have the same problem with predicting probabilities from a multinomial logistic regression using glm. If I do, e.g.,

mtcars_2 <- mtcars %>% as_tibble() %>% mutate(carb = as.factor(carb))
model <- parsnip::logistic_reg() %>% parsnip::set_engine("glm")
predict(mtcars_2, model, type = "prob")

Then I get:

Error in attr(x, "names") <- as.character(value) : 
  'names' attribute [6] must be the same length as the vector [2]

Which seems consistent with what is reported above.
Following the flow of computation, I note that the problem seems to reside in the
method predict_classprob.model_fit. At line 19 the value of variable res is the expected vector of 32 probabilities, but then after that there are lines 20-22:

  if (!is.null(object$spec$method$pred$prob$post)) {
    res <- object$spec$method$pred$prob$post(res, object)
  }

Now the value of object$spec$method$pred$prob$post is not null and it is:

function (x, object) 
{
    x <- tibble(v1 = 1 - x, v2 = x)
    colnames(x) <- object$lvl
    x
}

I assume from the naming that the function above is a postprocessing function that is to be applied to the vector of predictions res before it is returned. But in the case of predict_classprob.model_fit shouldn't this post function just the vector of probabilities? Here it's trying to create a dataframe with two columns and assign to it colnames corresponding to the factors of the response, which don't match.

@patr1ckm
Copy link
Contributor

@patr1ckm patr1ckm commented Oct 30, 2019

Better to use multinom_reg here, which supports glmnet as an engine. Currently glm is not a supported engine for multinom_reg.

@topepo
Copy link
Collaborator

@topepo topepo commented Dec 1, 2019

@patr1ckm is right about the model type (and using multinom_reg will automatically set family) but there is a bug here related specifically to glmnet. It is falsely generating the error:

Error: predict() doesn't work with multiple penalties (i.e. lambdas). Please specify a single value using penalty = some_value or use multi_predict() to get multiple predictions per row of data.

In the meantime, the tune package will do a lot of this for you:

options(stringsAsFactors = F)
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidymodels 0.0.3 ──
#> ✔ broom     0.5.2          ✔ recipes   0.1.7.9001
#> ✔ dials     0.0.3.9001     ✔ rsample   0.0.5     
#> ✔ infer     0.5.0          ✔ yardstick 0.0.4     
#> ✔ parsnip   0.0.4
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard()     masks purrr::discard()
#> ✖ magrittr::extract()   masks tidyr::extract()
#> ✖ dplyr::filter()       masks stats::filter()
#> ✖ recipes::fixed()      masks stringr::fixed()
#> ✖ dplyr::lag()          masks stats::lag()
#> ✖ dials::margin()       masks ggplot2::margin()
#> ✖ dials::offset()       masks stats::offset()
#> ✖ magrittr::set_names() masks purrr::set_names()
#> ✖ yardstick::spec()     masks readr::spec()
#> ✖ recipes::step()       masks stats::step()
library(glmnet)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loaded glmnet 3.0

set.seed(363)
n_obs = 100
n_feats = 200
mat = matrix(NA, nrow = n_obs, ncol = n_feats,
             dimnames = list(paste0("observation_", seq_len(n_obs)),
                             paste0("feature_", seq_len(n_feats))))
mat[] = rnorm(length(c(mat)))

labels = runif(n = n_obs, min = 0, max = 3) %>% floor() %>% factor()

# get optimized penalty with cv.glmnet
penalty = cv.glmnet(mat, labels, nfolds = 3, family = 'multinomial') %>%
  extract2("lambda.1se")

clf = multinom_reg(mixture = 1,
                   penalty = penalty,
                   mode = 'classification') %>%
  set_engine('glmnet')


x = as.data.frame(mat) %>% mutate(label = labels)
cv = vfold_cv(x, v = 3, strata = 'label')

# devtools::install_github("tidymodels/tune")
library(tune)

res <- fit_resamples(label ~ ., model = clf, resamples = cv, 
                     control = control_resamples(save_pred = TRUE))

collect_predictions(res)
#> # A tibble: 100 x 4
#>    id    .pred_class  .row label
#>    <chr> <fct>       <int> <fct>
#>  1 Fold1 1               5 2    
#>  2 Fold1 1               6 1    
#>  3 Fold1 1               7 2    
#>  4 Fold1 1              13 0    
#>  5 Fold1 1              21 0    
#>  6 Fold1 1              23 1    
#>  7 Fold1 1              27 0    
#>  8 Fold1 1              28 2    
#>  9 Fold1 1              31 0    
#> 10 Fold1 1              37 0    
#> # … with 90 more rows

Created on 2019-12-01 by the reprex package (v0.3.0)

@topepo
Copy link
Collaborator

@topepo topepo commented Dec 2, 2019

Fixed now:

options(stringsAsFactors = F)
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(tidymodels)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidymodels 0.0.3 ──
#> ✔ broom     0.5.2          ✔ recipes   0.1.7.9001
#> ✔ dials     0.0.3.9002     ✔ rsample   0.0.5     
#> ✔ infer     0.5.0          ✔ yardstick 0.0.4     
#> ✔ parsnip   0.0.4.9000
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard()     masks purrr::discard()
#> ✖ magrittr::extract()   masks tidyr::extract()
#> ✖ dplyr::filter()       masks stats::filter()
#> ✖ recipes::fixed()      masks stringr::fixed()
#> ✖ dplyr::lag()          masks stats::lag()
#> ✖ dials::margin()       masks ggplot2::margin()
#> ✖ dials::offset()       masks stats::offset()
#> ✖ magrittr::set_names() masks purrr::set_names()
#> ✖ yardstick::spec()     masks readr::spec()
#> ✖ recipes::step()       masks stats::step()
library(glmnet)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loaded glmnet 3.0

set.seed(363)
n_obs = 100
n_feats = 200
mat = matrix(NA, nrow = n_obs, ncol = n_feats,
             dimnames = list(paste0("observation_", seq_len(n_obs)),
                             paste0("feature_", seq_len(n_feats))))
mat[] = rnorm(length(c(mat)))

labels = runif(n = n_obs, min = 0, max = 3) %>% floor() %>% factor()

# get optimized penalty with cv.glmnet
penalty = cv.glmnet(mat, labels, nfolds = 3, family = 'multinomial') %>%
  extract2("lambda.1se")

clf = multinom_reg(mixture = 1,
                   penalty = penalty,
                   mode = 'classification') %>%
  set_engine('glmnet')


x = as.data.frame(mat) %>% mutate(label = labels)
cv = vfold_cv(x, v = 3, strata = 'label')

folded = cv %>%
  mutate(
    recipes = splits %>%
      map(~ prepper(., recipe = recipe(.$data, label ~ .))),
    test_data = splits %>% map(analysis),
    fits = map2(
      recipes,
      test_data,
      ~ fit(
        clf,
        label ~ .,
        data = bake(object = .x, new_data = .y)
      )
    )
  )

# predict on the left-out data
retrieve_predictions = function(split, recipe, model) {
  test = bake(recipe, assessment(split))
  tbl = tibble(
    true = test$label,
    pred = predict(model, test)$.pred_class,
    prob = predict(model, test, type = 'prob')) %>%
    # convert prob from nested df to columns
    cbind(.$prob) %>%
    select(-prob)
  return(tbl)
}
predictions = folded %>%
  mutate(
    pred = list(
      splits,
      recipes,
      fits
    ) %>%
      pmap(retrieve_predictions)
  )

head(predictions)
#> # A tibble: 3 x 6
#>   splits         id    recipes     test_data           fits       pred          
#> * <named list>   <chr> <named lis> <named list>        <named li> <named list>  
#> 1 <split [66/34… Fold1 <recipe>    <df[,201] [66 × 20… <fit[+]>   <df[,5] [34 ×…
#> 2 <split [67/33… Fold2 <recipe>    <df[,201] [67 × 20… <fit[+]>   <df[,5] [33 ×…
#> 3 <split [67/33… Fold3 <recipe>    <df[,201] [67 × 20… <fit[+]>   <df[,5] [33 ×…

Created on 2019-12-01 by the reprex package (v0.3.0)

@topepo topepo closed this Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.