Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have steps return integers when appropriate #766

Closed
EmilHvitfeldt opened this issue Aug 11, 2021 · 2 comments · Fixed by #1039
Closed

Have steps return integers when appropriate #766

EmilHvitfeldt opened this issue Aug 11, 2021 · 2 comments · Fixed by #1039
Labels
feature a feature request or enhancement long term

Comments

@EmilHvitfeldt
Copy link
Member

Some steps that return what are essentially integers are returning them as doubles. It might be helpful to handle the conversion to help with object sizes.

library(recipes)
library(modeldata)

data("Chicago")

recipe(ridership ~ date, data = Chicago) %>%
  step_date(date, label = FALSE, keep_original_cols = FALSE) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 5,698 × 4
#>    ridership date_dow date_month date_year
#>        <dbl>    <dbl>      <dbl>     <dbl>
#>  1     15.7         2          1      2001
#>  2     15.8         3          1      2001
#>  3     15.9         4          1      2001
#>  4     15.9         5          1      2001
#>  5     15.4         6          1      2001
#>  6      2.42        7          1      2001
#>  7      1.47        1          1      2001
#>  8     15.5         2          1      2001
#>  9     15.9         3          1      2001
#> 10     15.9         4          1      2001
#> # … with 5,688 more rows

recipe(ridership ~ date, data = Chicago) %>%
  step_date(date, keep_original_cols = FALSE) %>%
  step_dummy(all_nominal_predictors()) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 5,698 × 19
#>    ridership date_year date_dow_Mon date_dow_Tue date_dow_Wed date_dow_Thu
#>        <dbl>     <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#>  1     15.7       2001            1            0            0            0
#>  2     15.8       2001            0            1            0            0
#>  3     15.9       2001            0            0            1            0
#>  4     15.9       2001            0            0            0            1
#>  5     15.4       2001            0            0            0            0
#>  6      2.42      2001            0            0            0            0
#>  7      1.47      2001            0            0            0            0
#>  8     15.5       2001            1            0            0            0
#>  9     15.9       2001            0            1            0            0
#> 10     15.9       2001            0            0            1            0
#> # … with 5,688 more rows, and 13 more variables: date_dow_Fri <dbl>,
#> #   date_dow_Sat <dbl>, date_month_Feb <dbl>, date_month_Mar <dbl>,
#> #   date_month_Apr <dbl>, date_month_May <dbl>, date_month_Jun <dbl>,
#> #   date_month_Jul <dbl>, date_month_Aug <dbl>, date_month_Sep <dbl>,
#> #   date_month_Oct <dbl>, date_month_Nov <dbl>, date_month_Dec <dbl>

Created on 2021-08-10 by the reprex package (v2.0.1)

@juliasilge
Copy link
Member

FWIW the current behavior mirrors what happens in model.matrix() (in fact, that is why they turn out double):

library(tidyverse)
data(penguins, package = "palmerpenguins")
model.matrix(bill_length_mm ~ species + sex + bill_depth_mm, 
             data = penguins) %>%
  as_tibble()
#> # A tibble: 333 × 5
#>    `(Intercept)` speciesChinstrap speciesGentoo sexmale bill_depth_mm
#>            <dbl>            <dbl>         <dbl>   <dbl>         <dbl>
#>  1             1                0             0       1          18.7
#>  2             1                0             0       0          17.4
#>  3             1                0             0       0          18  
#>  4             1                0             0       0          19.3
#>  5             1                0             0       1          20.6
#>  6             1                0             0       0          17.8
#>  7             1                0             0       1          19.6
#>  8             1                0             0       0          17.6
#>  9             1                0             0       1          21.2
#> 10             1                0             0       1          21.1
#> # … with 323 more rows

Created on 2021-08-16 by the reprex package (v2.0.1)

Thinking about how often a model that needs dummy variables really can use integers vs. real numbers might be worthwhile.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement long term
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants