Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cut argument right= FALSE #51

Closed
jbkunst opened this issue May 22, 2021 · 6 comments
Closed

cut argument right= FALSE #51

jbkunst opened this issue May 22, 2021 · 6 comments

Comments

@jbkunst
Copy link
Contributor

jbkunst commented May 22, 2021

Hi @ShichenXie

First, thank so much for your work. This package help me a lot!

What is the reason to use the cut(....right = FALSE) in woebin being that the default value in base::cut is TRUE?

For example

, bstbin := cut(brkp, c(-Inf, bestbreaks, Inf), right = FALSE, dig.lab = 10, ordered_result = FALSE)

Why my question? Because I'm trying to create an interface woebin_ctree to mix the scorecard::woebin output using the breaks given by partykit::ctree function. This tree algorithm make the split using <=. So I can't replicate the counts in each node. For example:

The tree:
image

> ctree_breaks
[1] 11 15 33

But (obviously) when I use woebin with that beaks I don't have the same counts.

image

I tried to make similar breaks adding a small value 0.000001 but this is not quite elegant 😅 :
image

Do you think is possible add an optional argument woebin(..., right = FALSE) to modify this behaviour if is necessary.

Thanks in advance,
Kind regards,

@ShichenXie
Copy link
Owner

ShichenXie commented May 23, 2021

The reason to use the cut(....right = FALSE) by default in woebin function is due to business sense. Take age for example, 20s and 30s are usually split into different groups.

I have added this feature according your issue. Please update the package to the latest version on Github. You can set the bins closed on right via options (bin_close_right = TRUE), see the last example of the woebin function.

@jbkunst
Copy link
Contributor Author

jbkunst commented May 23, 2021

Thanks for your response. I understand the argument for the use right = FALSE.

About the feature. I tested it but I have some problems:

  1. The example 1 the options(bin_close_right = TRUE) and woebin(germancredit, y = 'creditability', x = 'age.in.years') make a missing category and the counts don't are the same using cut.

  2. Using x = NULL ins woebin makes bins using right = FALSE even using options(bin_close_right = TRUE).

Tell me if I can help. Thanks.

# devtools::install_github("shichenxie/scorecard")

.Platform$OS.type
#> [1] "windows"

library(scorecard)


packageVersion("scorecard")
#> [1] '0.3.2.999'

options(bin_close_right = TRUE)

data("germancredit")

# example 1
bins <- woebin(germancredit, y = "creditability", x = c("age.in.years", "duration.in.month"))
#> [INFO] creating woe binning ...
bins$age.in.years
#>        variable       bin count count_distr neg pos   posprob        woe
#> 1: age.in.years   missing    16       0.016  10   6 0.3750000  0.3364722
#> 2: age.in.years (-Inf,24]   174       0.174 100  74 0.4252874  0.5461928
#> 3: age.in.years   (24,26]   101       0.101  74  27 0.2673267 -0.1609304
#> 4: age.in.years   (26,33]   257       0.257 172  85 0.3307393  0.1424546
#> 5: age.in.years   (33,35]    79       0.079  67  12 0.1518987 -0.8724881
#> 6: age.in.years (35, Inf]   373       0.373 277  96 0.2573727 -0.2123715
#>         bin_iv  total_iv  breaks is_special_values
#> 1: 0.001922698 0.1312002 missing             FALSE
#> 2: 0.056700011 0.1312002      24             FALSE
#> 3: 0.002528906 0.1312002      26             FALSE
#> 4: 0.005359008 0.1312002      33             FALSE
#> 5: 0.048610052 0.1312002      35             FALSE
#> 6: 0.016079553 0.1312002     Inf             FALSE

setNames(bins$age.in.years$count, bins$age.in.years$bin)
#>   missing (-Inf,24]   (24,26]   (26,33]   (33,35] (35, Inf] 
#>        16       174       101       257        79       373

table(
  cut(
    germancredit$age.in.years,
    breaks = c(-Inf, na.omit(as.numeric(bins$age.in.years$breaks)))
  )
)
#> Warning in na.omit(as.numeric(bins$age.in.years$breaks)): NAs introducidos por
#> coerción
#> 
#> (-Inf,24]   (24,26]   (26,33]   (33,35] (35, Inf] 
#>       149        91       276        72       412

# example 2
bins <- woebin(germancredit, y = "creditability")
#> [INFO] creating woe binning ...

bins$age.in.years$bin
#> [1] "[-Inf,26)" "[26,28)"   "[28,35)"   "[35,37)"   "[37, Inf)"

Created on 2021-05-23 by the reprex package (v2.0.0)

@ShichenXie
Copy link
Owner

It should be fixed. Please upgrade to the latest version and try again.

library(scorecard)
data("germancredit")

options(bin_close_right = TRUE)
binsR <- woebin(germancredit, y = "creditability", x = c("age.in.years"))
#> [INFO] creating woe binning ...
binsR
#> $age.in.years
#>        variable       bin count count_distr neg pos   posprob        woe
#> 1: age.in.years (-Inf,25]   190       0.190 110  80 0.4210526  0.5288441
#> 2: age.in.years   (25,27]   101       0.101  74  27 0.2673267 -0.1609304
#> 3: age.in.years   (27,34]   257       0.257 172  85 0.3307393  0.1424546
#> 4: age.in.years   (34,36]    79       0.079  67  12 0.1518987 -0.8724881
#> 5: age.in.years (36, Inf]   373       0.373 277  96 0.2573727 -0.2123715
#>         bin_iv  total_iv breaks is_special_values
#> 1: 0.057921024 0.1304985     25             FALSE
#> 2: 0.002528906 0.1304985     27             FALSE
#> 3: 0.005359008 0.1304985     34             FALSE
#> 4: 0.048610052 0.1304985     36             FALSE
#> 5: 0.016079553 0.1304985    Inf             FALSE

options(bin_close_right = FALSE)
binsL <- woebin(germancredit, y = "creditability", x = c("age.in.years"))
#> [INFO] creating woe binning ...
binsL
#> $age.in.years
#>        variable       bin count count_distr neg pos   posprob        woe
#> 1: age.in.years [-Inf,26)   190       0.190 110  80 0.4210526  0.5288441
#> 2: age.in.years   [26,28)   101       0.101  74  27 0.2673267 -0.1609304
#> 3: age.in.years   [28,35)   257       0.257 172  85 0.3307393  0.1424546
#> 4: age.in.years   [35,37)    79       0.079  67  12 0.1518987 -0.8724881
#> 5: age.in.years [37, Inf)   373       0.373 277  96 0.2573727 -0.2123715
#>         bin_iv  total_iv breaks is_special_values
#> 1: 0.057921024 0.1304985     26             FALSE
#> 2: 0.002528906 0.1304985     28             FALSE
#> 3: 0.005359008 0.1304985     35             FALSE
#> 4: 0.048610052 0.1304985     37             FALSE
#> 5: 0.016079553 0.1304985    Inf             FALSE

Created on 2021-05-24 by the reprex package (v1.0.0)

@jbkunst
Copy link
Contributor Author

jbkunst commented May 24, 2021

Thanks so much @ShichenXie , the example works perfectly, but when I use woebin(..., x = NULL) something happens. Please, see reprex.

Lastly, what do you think to prefix al the options? For example, dplyr and data.table packages use this pattern/nomenclature.

options(scorecad.bin_close_right = TRUE)

# like:
options(datatable.auto.index = TRUE)
options(dplyr.show_progress = FALSE)

Thanks again!

# devtools::install_github("shichenxie/scorecard")
library(scorecard)

packageVersion("scorecard")
#> [1] '0.3.2.999'

options(bin_close_right = TRUE)

data("germancredit")

# example 1: only age in years and other variable
bins1 <- woebin(germancredit, y = "creditability", x = c("age.in.years", "duration.in.month"))
#> [INFO] creating woe binning ...
bins1$age.in.years[, c(1, 2, 3)]
#>        variable       bin count
#> 1: age.in.years (-Inf,25]   190
#> 2: age.in.years   (25,27]   101
#> 3: age.in.years   (27,34]   257
#> 4: age.in.years   (34,36]    79
#> 5: age.in.years (36, Inf]   373

# example 2: x = NULL
# the cloese in left
bins2 <- woebin(germancredit, y = "creditability")
#> [INFO] creating woe binning ...
bins2$age.in.years[, c(1, 2, 3)]
#>        variable       bin count
#> 1: age.in.years [-Inf,26)   190
#> 2: age.in.years   [26,28)   101
#> 3: age.in.years   [28,35)   257
#> 4: age.in.years   [35,37)    79
#> 5: age.in.years [37, Inf)   373

Created on 2021-05-24 by the reprex package (v2.0.0)

@ShichenXie
Copy link
Owner

Good point. I have changed the argument to options(scorecard.bin_close_right=TRUE).

@ShichenXie
Copy link
Owner

This issue should be solved. I close it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants