Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with mixed data ? #3

Closed
xiaoguozhi opened this issue Feb 6, 2018 · 6 comments
Closed

how to deal with mixed data ? #3

xiaoguozhi opened this issue Feb 6, 2018 · 6 comments

Comments

@xiaoguozhi
Copy link

xiaoguozhi commented Feb 6, 2018

great package on this subject! very nice job! here i have a problem.for example,I have a variable, such as ,dat<-data.frame(y=c(0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,0),x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20)). In this case,i want regard '888' and '666'as special two class such as missing value have own woe, and i want to get two woe for '888' and '666' separately. other values are computed as usual. How to handle this type data. Thanks!

@ShichenXie
Copy link
Owner

ShichenXie commented Feb 6, 2018

Thanks. You can specify the breakpoints via option break_list in the function woebin. And you can get the optimal binning based on the dataset that excludes the two special values.

library(scorecard) library(data.table) dat<-data.table( y=c(0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,0), x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20))

# get optimal breakpoints for rest dataset
bins <- woebin(dat[x != 666 & x != 888], "y")

# specify the breakpoints
bins2 <- woebin(dat, "y", breaks_list = list(x=c(16, 18, 666, 888)))
woebin_plot(bins2)

@xiaoguozhi
Copy link
Author

xiaoguozhi commented Feb 7, 2018

Thanks. very detailed answer. In my case, i mean the value '666' and '888' is a categorical variable. so we should convert it as a factor before woebin.
wheile,the following code regard '666' and '888' as numerical variable.
bins2 <- woebin(dat, "y", breaks_list = list(x=c(16, 18, 666, 888)))
in my opinion, i can do following
library(scorecard)
library(data.table)
library(dplyr)
dat<-data.table( y=c(0,0,0,1,1,0,0,1,0,0,1,1,1,0,0,0,1,1,1,0), x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20))

special value

sp<-c(666,888)
#get the special data
dat_sp<-filter(dat, x %in% sp)
#get normal data
dat_nor<-filter(dat, !x %in% sp)

convert it to factor

dat_sp$x<-as.factor(dat_sp$x)
bins_sp <- woebin(dat_sp, "y")
woebin_plot(bins_sp)

bin for normal data

bins_nor <- woebin(dat_nor, "y")
woebin_plot(bins_nor)

and now the question is 1) how to combine these two plot in one plot. 2) how to combine these two woe for the variable in this case becase we can't do it by rbind function simply. if we have many such variable , how to get woe ? what I really warried is that we can't do bin for many variables automatically. for example,many functions in your package support batch process, obviously,if we bin for special value and normal value respectively,it destroies batch process.

@xiaoguozhi
Copy link
Author

xiaoguozhi commented Feb 7, 2018

get optimal breakpoints for rest dataset

bins <- woebin(dat[x != 666 & x != 888], "y")
here, if we could really get the optimal breakponits once we omit some special samples?
and i realized that compute woe separately is wrong.

@ShichenXie
Copy link
Owner

It could be a problem if you have many variable that need to handle in this way.
I'll add this feature into woebin function. Thanks.

@xiaoguozhi
Copy link
Author

Thanks again for your nice solution! Looking forward to your improved version.

@ShichenXie
Copy link
Owner

ShichenXie commented Mar 21, 2018

see the following example:

library(scorecard)

dat <- data.frame(y=c(0,0,0,1,1,1,1,1,0,0,1,1,1,0,0,0,1,1,1,0),
x=c(1,2,3,4,5,888,888,888,9,10,666,666,666,666,15,16,17,18,19,20))

#' specify two values as two class
bin = woebin(dat, "y", special_values = c(666,888))
#' specify two values as one class
bin2 = woebin(dat, "y", special_values = c("666%,%888"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants