-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cut argument right= FALSE #51
Comments
The reason to use the cut(....right = FALSE) by default in woebin function is due to business sense. Take age for example, 20s and 30s are usually split into different groups. I have added this feature according your issue. Please update the package to the latest version on Github. You can set the bins closed on right via options (bin_close_right = TRUE), see the last example of the woebin function. |
Thanks for your response. I understand the argument for the use About the feature. I tested it but I have some problems:
Tell me if I can help. Thanks. # devtools::install_github("shichenxie/scorecard")
.Platform$OS.type
#> [1] "windows"
library(scorecard)
packageVersion("scorecard")
#> [1] '0.3.2.999'
options(bin_close_right = TRUE)
data("germancredit")
# example 1
bins <- woebin(germancredit, y = "creditability", x = c("age.in.years", "duration.in.month"))
#> [INFO] creating woe binning ...
bins$age.in.years
#> variable bin count count_distr neg pos posprob woe
#> 1: age.in.years missing 16 0.016 10 6 0.3750000 0.3364722
#> 2: age.in.years (-Inf,24] 174 0.174 100 74 0.4252874 0.5461928
#> 3: age.in.years (24,26] 101 0.101 74 27 0.2673267 -0.1609304
#> 4: age.in.years (26,33] 257 0.257 172 85 0.3307393 0.1424546
#> 5: age.in.years (33,35] 79 0.079 67 12 0.1518987 -0.8724881
#> 6: age.in.years (35, Inf] 373 0.373 277 96 0.2573727 -0.2123715
#> bin_iv total_iv breaks is_special_values
#> 1: 0.001922698 0.1312002 missing FALSE
#> 2: 0.056700011 0.1312002 24 FALSE
#> 3: 0.002528906 0.1312002 26 FALSE
#> 4: 0.005359008 0.1312002 33 FALSE
#> 5: 0.048610052 0.1312002 35 FALSE
#> 6: 0.016079553 0.1312002 Inf FALSE
setNames(bins$age.in.years$count, bins$age.in.years$bin)
#> missing (-Inf,24] (24,26] (26,33] (33,35] (35, Inf]
#> 16 174 101 257 79 373
table(
cut(
germancredit$age.in.years,
breaks = c(-Inf, na.omit(as.numeric(bins$age.in.years$breaks)))
)
)
#> Warning in na.omit(as.numeric(bins$age.in.years$breaks)): NAs introducidos por
#> coerción
#>
#> (-Inf,24] (24,26] (26,33] (33,35] (35, Inf]
#> 149 91 276 72 412
# example 2
bins <- woebin(germancredit, y = "creditability")
#> [INFO] creating woe binning ...
bins$age.in.years$bin
#> [1] "[-Inf,26)" "[26,28)" "[28,35)" "[35,37)" "[37, Inf)" Created on 2021-05-23 by the reprex package (v2.0.0) |
It should be fixed. Please upgrade to the latest version and try again. library(scorecard)
data("germancredit")
options(bin_close_right = TRUE)
binsR <- woebin(germancredit, y = "creditability", x = c("age.in.years"))
#> [INFO] creating woe binning ...
binsR
#> $age.in.years
#> variable bin count count_distr neg pos posprob woe
#> 1: age.in.years (-Inf,25] 190 0.190 110 80 0.4210526 0.5288441
#> 2: age.in.years (25,27] 101 0.101 74 27 0.2673267 -0.1609304
#> 3: age.in.years (27,34] 257 0.257 172 85 0.3307393 0.1424546
#> 4: age.in.years (34,36] 79 0.079 67 12 0.1518987 -0.8724881
#> 5: age.in.years (36, Inf] 373 0.373 277 96 0.2573727 -0.2123715
#> bin_iv total_iv breaks is_special_values
#> 1: 0.057921024 0.1304985 25 FALSE
#> 2: 0.002528906 0.1304985 27 FALSE
#> 3: 0.005359008 0.1304985 34 FALSE
#> 4: 0.048610052 0.1304985 36 FALSE
#> 5: 0.016079553 0.1304985 Inf FALSE
options(bin_close_right = FALSE)
binsL <- woebin(germancredit, y = "creditability", x = c("age.in.years"))
#> [INFO] creating woe binning ...
binsL
#> $age.in.years
#> variable bin count count_distr neg pos posprob woe
#> 1: age.in.years [-Inf,26) 190 0.190 110 80 0.4210526 0.5288441
#> 2: age.in.years [26,28) 101 0.101 74 27 0.2673267 -0.1609304
#> 3: age.in.years [28,35) 257 0.257 172 85 0.3307393 0.1424546
#> 4: age.in.years [35,37) 79 0.079 67 12 0.1518987 -0.8724881
#> 5: age.in.years [37, Inf) 373 0.373 277 96 0.2573727 -0.2123715
#> bin_iv total_iv breaks is_special_values
#> 1: 0.057921024 0.1304985 26 FALSE
#> 2: 0.002528906 0.1304985 28 FALSE
#> 3: 0.005359008 0.1304985 35 FALSE
#> 4: 0.048610052 0.1304985 37 FALSE
#> 5: 0.016079553 0.1304985 Inf FALSE Created on 2021-05-24 by the reprex package (v1.0.0) |
Thanks so much @ShichenXie , the example works perfectly, but when I use Lastly, what do you think to prefix al the options? For example, dplyr and data.table packages use this pattern/nomenclature. options(scorecad.bin_close_right = TRUE)
# like:
options(datatable.auto.index = TRUE)
options(dplyr.show_progress = FALSE) Thanks again! # devtools::install_github("shichenxie/scorecard")
library(scorecard)
packageVersion("scorecard")
#> [1] '0.3.2.999'
options(bin_close_right = TRUE)
data("germancredit")
# example 1: only age in years and other variable
bins1 <- woebin(germancredit, y = "creditability", x = c("age.in.years", "duration.in.month"))
#> [INFO] creating woe binning ...
bins1$age.in.years[, c(1, 2, 3)]
#> variable bin count
#> 1: age.in.years (-Inf,25] 190
#> 2: age.in.years (25,27] 101
#> 3: age.in.years (27,34] 257
#> 4: age.in.years (34,36] 79
#> 5: age.in.years (36, Inf] 373
# example 2: x = NULL
# the cloese in left
bins2 <- woebin(germancredit, y = "creditability")
#> [INFO] creating woe binning ...
bins2$age.in.years[, c(1, 2, 3)]
#> variable bin count
#> 1: age.in.years [-Inf,26) 190
#> 2: age.in.years [26,28) 101
#> 3: age.in.years [28,35) 257
#> 4: age.in.years [35,37) 79
#> 5: age.in.years [37, Inf) 373 Created on 2021-05-24 by the reprex package (v2.0.0) |
Good point. I have changed the argument to |
This issue should be solved. I close it now. |
Hi @ShichenXie
First, thank so much for your work. This package help me a lot!
What is the reason to use the
cut(....right = FALSE)
inwoebin
being that the default value inbase::cut
isTRUE
?For example
scorecard/R/woebin.R
Line 344 in 65b9ca1
Why my question? Because I'm trying to create an interface
woebin_ctree
to mix thescorecard::woebin
output using the breaks given bypartykit::ctree
function. This tree algorithm make the split using<=
. So I can't replicate the counts in each node. For example:The tree:
But (obviously) when I use
woebin
with that beaks I don't have the same counts.I tried to make similar breaks adding a small value
0.000001
but this is not quite elegant 😅 :Do you think is possible add an optional argument
woebin(..., right = FALSE)
to modify this behaviour if is necessary.Thanks in advance,
Kind regards,
The text was updated successfully, but these errors were encountered: