/
introduction.Rmd
160 lines (117 loc) · 4.12 KB
/
introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Introduction to rbin"
author: "Aravind Hebbali"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to rbin}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r, echo=FALSE, eval=TRUE}
library(rbin)
```
## Introduction
Binning is the process of transforming numerical or continuous data into categorical data. It is
a common data pre-processing step of the model building process.
rbin has the following features:
- manual binning using shiny app
- equal length binning method
- winsorized binning method
- quantile binning method
- combine levels of categorical data
- create dummy variables based on binning method
- calculates weight of evidence (WOE), entropy and information value (IV)
- provides summary information about binning pre-processing
## Manual Binning
For manual binning, you need to specify the cut points for the bins. `rbin` follows the left closed and
right open interval (`[0,1) = {x | 0 ≤ x < 1}`) for creating bins. The number of cut points you specify
is one less than the number of bins you want to create i.e. if you want to create 10 bins, you
need to specify only 9 cut points as shown in the below example. The accompanying RStudio addin,
`rbinAddin()` can be used to iteratively bin the data and to enforce monotonic increasing/decreasing trend.
After finalizing the bins, you can use `rbin_create()` to create the dummy variables.
### Bins
```{r manual}
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56))
bins
```
### Plot
```{r manual_plot, fig.width=7, fig.height=5, fig.align='center'}
# plot
plot(bins)
```
### Dummy Variables
```{r manual_dummy}
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56))
rbin_create(mbank, age, bins)
```
## Factor Binning
You can collapse or combine levels of a factor/categorical variable using `rbin_factor_combine()`
and then use `rbin_factor()` to look at weight of evidence, entropy and information value. After
finalizing the bins, you can use `rbin_factor_create()` to create the dummy variables. You can
use the RStudio addin, `rbinFactorAddin()` to interactively combine the levels and create dummy variables
after finalizing the bins.
### Combine Levels
```{r factor_combine}
upper <- c("secondary", "tertiary")
out <- rbin_factor_combine(mbank, education, upper, "upper")
table(out$education)
out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper")
table(out$education)
```
### Bins
```{r factor_bins}
bins <- rbin_factor(mbank, y, education)
bins
```
#### Plot
```{r factor_plot, fig.width=7, fig.height=5, fig.align='center'}
# plot
plot(bins)
```
### Create Bins
```{r factor_create}
upper <- c("secondary", "tertiary")
out <- rbin_factor_combine(mbank, education, upper, "upper")
rbin_factor_create(out, education)
```
## Quantile Binning
Quantile binning aims to bin the data into roughly equal groups using quantiles.
```{r quantile}
bins <- rbin_quantiles(mbank, y, age, 10)
bins
```
#### Plot
```{r quantile_plot, fig.width=7, fig.height=5, fig.align='center'}
# plot
plot(bins)
```
## Equal Length Binning
Equal length binning creates bins of equal widths. It is different from equal frequency binning which
creates bins of equal size.
```{r equal_length}
bins <- rbin_equal_length(mbank, y, age, 10)
bins
```
#### Plot
```{r equal_length_plot, fig.width=7, fig.height=5, fig.align='center'}
# plot
plot(bins)
```
## Winsorized Binning
Winsorized binning is similar to equal length binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data pre-processing stage. For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in equal length binning.
```{r winsorize}
bins <- rbin_winsorize(mbank, y, age, 10, winsor_rate = 0.05)
bins
```
#### Plot
```{r winsorize_plot, fig.width=7, fig.height=5, fig.align='center'}
# plot
plot(bins)
```