Skip to content

step_woe_bin() for binning numeric and factor predictors #239

@AndrewKostandy

Description

@AndrewKostandy

Feature

Thanks for your work on this package.

It would be great if a recipe step is added that can bin numeric and factor features by using weight of evidence against a binary outcome. There are functions that do this such as woebin() from {scorecard} or woe.binning() from {woeBinning}. This recipe step will do two things:

  1. Bin the numeric or factor features (can lump some factor levels together)
  2. Replace the bin values / factor levels with their woe values (like what step_woe() currently does)

Example with woebin() from {scorecard}:

library(scorecard)
library(rsample)

data("germancredit")
data_split <- initial_split(germancredit, strata = creditability)

germancredit_train <- training(data_split)
germancredit_test <- testing(data_split)

bins <- woebin(germancredit_train, "creditability")
#> ℹ Creating woe binning ...
#> ✔ Binning on 750 rows and 21 columns in 00:00:02

bins$duration.in.month
#>             variable       bin count count_distr   neg   pos   posprob        woe      bin_iv  total_iv breaks is_special_values
#>               <char>    <char> <int>       <num> <int> <int>     <num>      <num>       <num>     <num> <char>            <lgcl>
#> 1: duration.in.month  [-Inf,8)    68  0.09066667    60     8 0.1176471 -1.1676052 0.091925740 0.2587426      8             FALSE
#> 2: duration.in.month    [8,14)   205  0.27333333   151    54 0.2634146 -0.1809979 0.008618949 0.2587426     14             FALSE
#> 3: duration.in.month   [14,16)    53  0.07066667    45     8 0.1509434 -0.8799231 0.044135825 0.2587426     16             FALSE
#> 4: duration.in.month   [16,34)   291  0.38800000   197    94 0.3230241  0.1073889 0.004568290 0.2587426     34             FALSE
#> 5: duration.in.month   [34,44)    76  0.10133333    46    30 0.3947368  0.4198538 0.019193319 0.2587426     44             FALSE
#> 6: duration.in.month [44, Inf)    57  0.07600000    26    31 0.5438596  1.0231885 0.090300448 0.2587426    Inf             FALSE

bins$purpose
#>    variable                                                              bin count count_distr   neg   pos   posprob        woe     bin_iv  total_iv                                                           breaks is_special_values
#>      <char>                                                           <char> <int>       <num> <int> <int>     <num>      <num>      <num>     <num>                                                           <char>            <lgcl>
#> 1:  purpose                                          retraining%,%car (used)    83  0.11066667    70    13 0.1566265 -0.8362480 0.06318318 0.1960758                                          retraining%,%car (used)             FALSE
#> 2:  purpose                                       radio/television%,%repairs   220  0.29333333   172    48 0.2181818 -0.4289956 0.04902807 0.1960758                                       radio/television%,%repairs             FALSE
#> 3:  purpose furniture/equipment%,%business%,%domestic appliances%,%car (new)   395  0.52666667   257   138 0.3493671  0.2254755 0.02791601 0.1960758 furniture/equipment%,%business%,%domestic appliances%,%car (new)             FALSE
#> 4:  purpose                                               education%,%others    52  0.06933333    26    26 0.5000000  0.8472979 0.05594856 0.1960758                                               education%,%others             FALSE

germancredit_test_woe <- woebin_ply(germancredit_test, bins=bins)
#> ℹ Converting into woe values ...
#> ✔ Woe transformating on 250 rows and 20 columns in 00:00:00

head(germancredit_test_woe)
#>    creditability status.of.existing.checking.account_woe duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe savings.account.and.bonds_woe present.employment.since_woe
#>           <fctr>                                   <num>                 <num>              <num>       <num>             <num>                         <num>                        <num>
#> 1:          good                               0.7901394           -0.83910109        -0.73005174  -0.5518446        0.01369884                    -0.7833423                  -0.34989526
#> 2:           bad                               0.7901394            0.06578153        -0.05715841   0.3677248        0.31508105                     0.2344150                   0.06559728
#> 3:          good                               0.2814901            0.80349524         0.10090617  -0.5518446        0.82320031                     0.2344150                   0.06559728
#> 4:          good                              -1.2599785           -0.30766736         0.10090617  -0.5518446       -0.33683660                    -0.7833423                  -0.34989526
#> 5:           bad                               0.2814901            0.06578153        -0.73005174   0.3677248        0.31508105                     0.2344150                   0.21868920
#> 6:           bad                               0.7901394            0.06578153        -0.73005174   0.3677248        0.01369884                     0.2344150                  -0.34989526
#>    installment.rate.in.percentage.of.disposable.income_woe personal.status.and.sex_woe other.debtors.or.guarantors_woe present.residence.since_woe property_woe age.in.years_woe other.installment.plans_woe
#>                                                      <num>                       <num>                           <num>                       <num>        <num>            <num>                       <num>
#> 1:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 2:                                            -0.004073325                 -0.09790421                       0.0287165                 -0.01712104   0.49062292       -0.1941560                  -0.1688382
#> 3:                                            -0.077291674                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.9650809                  -0.1688382
#> 4:                                            -0.077291674                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 5:                                             0.095061763                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.1044233                  -0.1688382
#> 6:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104   0.09425254       -0.1941560                  -0.1688382
#>    housing_woe number.of.existing.credits.at.this.bank_woe     job_woe number.of.people.being.liable.to.provide.maintenance.for_woe telephone_woe foreign.worker_woe
#>          <num>                                       <num>       <num>                                                        <num>         <num>              <num>
#> 1:  -0.2121896                                  -0.1009105 -0.02034658                                                   0.01369884   -0.14732471                  0
#> 2:   0.4616354                                  -0.1009105 -0.02034658                                                  -0.06899287    0.09352606                  0
#> 3:   0.4944765                                   0.0534367  0.09858083                                                   0.01369884   -0.14732471                  0
#> 4:  -0.2121896                                   0.0534367 -0.00836825                                                   0.01369884    0.09352606                  0
#> 5:  -0.2121896                                  -0.1009105  0.09858083                                                   0.01369884    0.09352606                  0
#> 6:  -0.2121896                                  -0.1009105 -0.00836825                                                   0.01369884    0.09352606                  0

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions