-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Labels
featurea feature request or enhancementa feature request or enhancement
Description
Feature
Thanks for your work on this package.
It would be great if a recipe step is added that can bin numeric and factor features by using weight of evidence against a binary outcome. There are functions that do this such as woebin()
from {scorecard} or woe.binning()
from {woeBinning}. This recipe step will do two things:
- Bin the numeric or factor features (can lump some factor levels together)
- Replace the bin values / factor levels with their woe values (like what
step_woe()
currently does)
Example with woebin()
from {scorecard}:
library(scorecard)
library(rsample)
data("germancredit")
data_split <- initial_split(germancredit, strata = creditability)
germancredit_train <- training(data_split)
germancredit_test <- testing(data_split)
bins <- woebin(germancredit_train, "creditability")
#> ℹ Creating woe binning ...
#> ✔ Binning on 750 rows and 21 columns in 00:00:02
bins$duration.in.month
#> variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
#> <char> <char> <int> <num> <int> <int> <num> <num> <num> <num> <char> <lgcl>
#> 1: duration.in.month [-Inf,8) 68 0.09066667 60 8 0.1176471 -1.1676052 0.091925740 0.2587426 8 FALSE
#> 2: duration.in.month [8,14) 205 0.27333333 151 54 0.2634146 -0.1809979 0.008618949 0.2587426 14 FALSE
#> 3: duration.in.month [14,16) 53 0.07066667 45 8 0.1509434 -0.8799231 0.044135825 0.2587426 16 FALSE
#> 4: duration.in.month [16,34) 291 0.38800000 197 94 0.3230241 0.1073889 0.004568290 0.2587426 34 FALSE
#> 5: duration.in.month [34,44) 76 0.10133333 46 30 0.3947368 0.4198538 0.019193319 0.2587426 44 FALSE
#> 6: duration.in.month [44, Inf) 57 0.07600000 26 31 0.5438596 1.0231885 0.090300448 0.2587426 Inf FALSE
bins$purpose
#> variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
#> <char> <char> <int> <num> <int> <int> <num> <num> <num> <num> <char> <lgcl>
#> 1: purpose retraining%,%car (used) 83 0.11066667 70 13 0.1566265 -0.8362480 0.06318318 0.1960758 retraining%,%car (used) FALSE
#> 2: purpose radio/television%,%repairs 220 0.29333333 172 48 0.2181818 -0.4289956 0.04902807 0.1960758 radio/television%,%repairs FALSE
#> 3: purpose furniture/equipment%,%business%,%domestic appliances%,%car (new) 395 0.52666667 257 138 0.3493671 0.2254755 0.02791601 0.1960758 furniture/equipment%,%business%,%domestic appliances%,%car (new) FALSE
#> 4: purpose education%,%others 52 0.06933333 26 26 0.5000000 0.8472979 0.05594856 0.1960758 education%,%others FALSE
germancredit_test_woe <- woebin_ply(germancredit_test, bins=bins)
#> ℹ Converting into woe values ...
#> ✔ Woe transformating on 250 rows and 20 columns in 00:00:00
head(germancredit_test_woe)
#> creditability status.of.existing.checking.account_woe duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe savings.account.and.bonds_woe present.employment.since_woe
#> <fctr> <num> <num> <num> <num> <num> <num> <num>
#> 1: good 0.7901394 -0.83910109 -0.73005174 -0.5518446 0.01369884 -0.7833423 -0.34989526
#> 2: bad 0.7901394 0.06578153 -0.05715841 0.3677248 0.31508105 0.2344150 0.06559728
#> 3: good 0.2814901 0.80349524 0.10090617 -0.5518446 0.82320031 0.2344150 0.06559728
#> 4: good -1.2599785 -0.30766736 0.10090617 -0.5518446 -0.33683660 -0.7833423 -0.34989526
#> 5: bad 0.2814901 0.06578153 -0.73005174 0.3677248 0.31508105 0.2344150 0.21868920
#> 6: bad 0.7901394 0.06578153 -0.73005174 0.3677248 0.01369884 0.2344150 -0.34989526
#> installment.rate.in.percentage.of.disposable.income_woe personal.status.and.sex_woe other.debtors.or.guarantors_woe present.residence.since_woe property_woe age.in.years_woe other.installment.plans_woe
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0.095061763 -0.09790421 0.0287165 -0.01712104 -0.56976816 -0.1941560 -0.1688382
#> 2: -0.004073325 -0.09790421 0.0287165 -0.01712104 0.49062292 -0.1941560 -0.1688382
#> 3: -0.077291674 -0.09790421 0.0287165 0.14090545 0.09425254 -0.9650809 -0.1688382
#> 4: -0.077291674 -0.09790421 0.0287165 -0.01712104 -0.56976816 -0.1941560 -0.1688382
#> 5: 0.095061763 -0.09790421 0.0287165 0.14090545 0.09425254 -0.1044233 -0.1688382
#> 6: 0.095061763 -0.09790421 0.0287165 -0.01712104 0.09425254 -0.1941560 -0.1688382
#> housing_woe number.of.existing.credits.at.this.bank_woe job_woe number.of.people.being.liable.to.provide.maintenance.for_woe telephone_woe foreign.worker_woe
#> <num> <num> <num> <num> <num> <num>
#> 1: -0.2121896 -0.1009105 -0.02034658 0.01369884 -0.14732471 0
#> 2: 0.4616354 -0.1009105 -0.02034658 -0.06899287 0.09352606 0
#> 3: 0.4944765 0.0534367 0.09858083 0.01369884 -0.14732471 0
#> 4: -0.2121896 0.0534367 -0.00836825 0.01369884 0.09352606 0
#> 5: -0.2121896 -0.1009105 0.09858083 0.01369884 0.09352606 0
#> 6: -0.2121896 -0.1009105 -0.00836825 0.01369884 0.09352606 0
Metadata
Metadata
Assignees
Labels
featurea feature request or enhancementa feature request or enhancement