-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using vctrs for coercion #59
Comments
suppressPackageStartupMessages({
library(hardhat)
library(vctrs)
library(gapminder)
library(rsample)
library(dplyr)
})
#> Warning: package 'rsample' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.2
options(rlang__backtrace_on_error = "none")
split <- initial_split(gapminder)
gap_train <- training(split)
gap_test <- testing(split)
# 0 row slice (this is `info`!)
gap_train_0 <- vec_slice(gap_train, 0L)
# all of the df - df casts are due to df_col_cast()
# which is called from vec_cast()
# ///////////////////////
# continent is character not factor
gap_test2 <- mutate(gap_test, continent = as.character(continent))
# silently fixes that
common2 <- vec_cast(gap_test2, gap_train_0)
# recovered levels and type
levels(common2$continent)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# Takeaway) This is good
# ///////////////////////
# continent is character not factor, AND has too many implicit levels internally
gap_test2b <- mutate(gap_test, continent = as.character(continent))
gap_test2b$continent[1] <- "jupyter"
# noisily fixes that
common2b <- vec_cast(gap_test2b, gap_train_0)
#> Warning: Lossy cast from <character> to <factor<69262>>
#> Locations: 1
# recovered levels and type
levels(common2b$continent)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# Takeaway) This is good, but a better warning would be
# good. This is a TOO MANY LEVELS problem.
# ///////////////////////
# continent is character not factor, AND doesn't have enough levels internally
gap_test2c <- mutate(gap_test, continent = as.character(continent))
gap_test2c$continent <- gsub("Asia", NA_character_, gap_test2c$continent)
# silently fixes that
common2c <- vec_cast(gap_test2c, gap_train_0)
# recovered levels and type
levels(common2c$continent)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# Takeaway) This is good
# ///////////////////////
# continent is numeric
gap_test3 <- mutate(gap_test, continent = 1)
# not the best error message
common3 <- vec_cast(gap_test3, gap_train_0)
#> Error: Can't cast <double> to <factor<69262>>
# Takeaway) This is good, but want a better error message
# would really like this better message to be vctrs' job
# ///////////////////////
# too many columns
gap_test4 <- mutate(gap_test, x = 4)
# this actually throws a decent error message,
# but I think we would still rather have our own
vec_cast(gap_test4, gap_train)
#> Warning: Lossy cast from <tbl_df<
#> country : factor<bf6dc>
#> continent: factor<69262>
#> year : integer
#> lifeExp : double
#> pop : integer
#> gdpPercap: double
#> x : double
#> >> to <tbl_df<
#> country : factor<bf6dc>
#> continent: factor<69262>
#> year : integer
#> lifeExp : double
#> pop : integer
#> gdpPercap: double
#> >>
#> Locations:
#> Dropped variables: `x`
#> # A tibble: 426 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1987 40.8 13867957 852.
#> 3 Albania Europe 1987 72 3075321 3739.
#> 4 Albania Europe 1992 71.6 3326498 2497.
#> 5 Albania Europe 2002 75.7 3508512 4604.
#> 6 Algeria Africa 1957 45.7 10270856 3014.
#> 7 Algeria Africa 1977 58.0 17152804 4910.
#> 8 Algeria Africa 1987 65.8 23254956 5681.
#> 9 Algeria Africa 2007 72.3 33333216 6223.
#> 10 Angola Africa 1962 34 4826015 4269.
#> # … with 416 more rows
# Takeaway) This is good, but probably want a custom error message
# because vctrs "types" its warnings, I think I can do this
# ///////////////////////
# not enough columns
gap_test5 <- select(gap_test, -pop)
# silently adds column of NA values
vec_cast(gap_test5, gap_train_0)
#> # A tibble: 426 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 NA 779.
#> 2 Afghanistan Asia 1987 40.8 NA 852.
#> 3 Albania Europe 1987 72 NA 3739.
#> 4 Albania Europe 1992 71.6 NA 2497.
#> 5 Albania Europe 2002 75.7 NA 4604.
#> 6 Algeria Africa 1957 45.7 NA 3014.
#> 7 Algeria Africa 1977 58.0 NA 4910.
#> 8 Algeria Africa 1987 65.8 NA 5681.
#> 9 Algeria Africa 2007 72.3 NA 6223.
#> 10 Angola Africa 1962 34 NA 4269.
#> # … with 416 more rows
# Takeaway) We should let shrink() be noisy here and error
# ///////////////////////
# Too many levels, but the factor's actual values never use that level
gap_test6 <- mutate(gap_test, continent = factor(continent, c(levels(continent), "extra_level")))
# silently fixes that
common6 <- vec_cast(gap_test6, gap_train_0)
levels(common6$continent)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# Takeaway) Silence is okay here as the values aren't actually affected
# ///////////////////////
# Too many levels (in test), AND the factor's actual values use that level
# (here we drop Africa from the train data to demonstrate)
gap_train_0_6b <- mutate(gap_train_0, continent = factor(continent, levels(continent)[-1]))
# noisily drops the level and coerces problematic
# positions to NA
common6b <- vec_cast(gap_test, gap_train_0_6b)
#> Warning: Lossy cast from <factor<69262>> to <factor<e5252>>
#> Locations: 6, 7, 8, 9, 10, 11, 12, 13, 33, 34, 35, 42, 43, 47, 48, 49, 5...
# no africa
levels(common6b$continent)
#> [1] "Americas" "Asia" "Europe" "Oceania"
# Takeaway) Noisy is good, but I think I want a different warning.
# Again, capture the typed warning. This could check if `x` is a factor
# then you'd know the lossy cast is specific to having too many factor levels
# ///////////////////////
# not enough levels (in test)
gap_test7 <- mutate(gap_test, continent = factor(continent, levels(continent)[-1]))
# silently fixes that
common7 <- vec_cast(gap_test7, gap_train_0)
levels(common7$continent)
#> [1] "Africa" "Americas" "Asia" "Europe" "Oceania" Created on 2019-02-25 by the reprex package (v0.2.1.9000) |
Closed by #71 |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Coercion of
new_data
columns into their correct type (maybe characters can be coerced to factors automatically)The text was updated successfully, but these errors were encountered: