# Day13 Converting Continuous Variables into Discrete Values 連續型變數離散化

## 為什麼要把連續型變數離散化 
### What are the reasons of converting continuous variables into discrete values?
離散化是將多個連續型數值分箱成較少組別，進行離散化的主要原因有以下幾點：
1. 簡化模型 - 分箱降低了變數數量、可以加快疊代速度且較方便儲存。
2. 增加魯棒性 - 意即減低了極端值與異常值造成分析整體資料的影響程度。ex: 若將年齡特徵50歲以上年齡定義為1，其餘為0，一筆年齡為200的異常值產生的干擾較以原資料直接進行分析小。
3. 配合模型需要 - 像決策樹、貝葉斯等模型需要使用離散特徵。
4. 引入非線性 - 讓模型具有較佳表達能力。

We convert continuous variables into discrete values by binning them into groups to:
1. Simplify the model - binning decrease the amount of variables. This could reduce the epoch time and storage space.
2. Enhance robustness - decrease the effects caused by extreme values and outliers. ex: binning age above 50 as 1, otherwise 0. The interference of an outlier value 200 would be smaller.
3. Fit models - some models like decision trees and Bayes classifiers need the variable to be discrete.
4. Increasing non-linearity - non-linear methods offered better overall feature selection performance than linear methods in all usage conditions.

## 主要的方法 Main ways to convert:
1. 等寬劃分 - 依相同寬度將資料分組，每份的間距相等。ex: 每10歲分一組。
2. 等頻劃分 - 將資料均勻分成幾等份，每份的觀察點數相同。ex: 分為10組。
3. 聚類劃分 - 使用聚類演算法將資料聚類劃分。


1. Binning by same deviation.
2. Binning by numbers of data in a bin.
3. Cluster data then bin.

## 範例 Example
'('表示不包含、']'表示包含。<br>
'(' are included, ']' are not included.

In [1]:
# 載入套件 import packages
import pandas as pd

# 創建一些資料 create some data
ages = pd.DataFrame({"age": [18, 22, 25, 27, 7, 21, 23, 37, 30, 61, 45, 13, 11, 5, 2, 41, 9, 18, 80, 100]})

### 等寬劃分 Binning by same deviation.

In [2]:
# 新增欄位對年齡做等寬劃分 create new column with the same width of age
ages["equal_width"] = pd.cut(ages["age"], 5)
print(ages["equal_width"])

0     (1.902, 21.6]
1      (21.6, 41.2]
2      (21.6, 41.2]
3      (21.6, 41.2]
4     (1.902, 21.6]
5     (1.902, 21.6]
6      (21.6, 41.2]
7      (21.6, 41.2]
8      (21.6, 41.2]
9      (60.8, 80.4]
10     (41.2, 60.8]
11    (1.902, 21.6]
12    (1.902, 21.6]
13    (1.902, 21.6]
14    (1.902, 21.6]
15     (21.6, 41.2]
16    (1.902, 21.6]
17    (1.902, 21.6]
18     (60.8, 80.4]
19    (80.4, 100.0]
Name: equal_width, dtype: category
Categories (5, interval[float64]): [(1.902, 21.6] < (21.6, 41.2] < (41.2, 60.8] < (60.8, 80.4] < (80.4, 100.0]]


In [3]:
# 觀察等寬劃分下各出現次數 count the amount of each bin
ages["equal_width"].value_counts() # 每個bin的範圍大小是一樣的 the range of each bin is the same

(1.902, 21.6]    9
(21.6, 41.2]     7
(60.8, 80.4]     2
(80.4, 100.0]    1
(41.2, 60.8]     1
Name: equal_width, dtype: int64

### 等頻劃分 Binning by numbers of data in a bin.

In [4]:
# 新增欄位做等頻劃分 create new column with the same amount of data in each bin
ages["equal_freq"] = pd.qcut(ages["age"], 5)
print(ages["equal_freq"])

0      (10.6, 19.8]
1      (19.8, 25.8]
2      (19.8, 25.8]
3      (25.8, 41.8]
4     (1.999, 10.6]
5      (19.8, 25.8]
6      (19.8, 25.8]
7      (25.8, 41.8]
8      (25.8, 41.8]
9     (41.8, 100.0]
10    (41.8, 100.0]
11     (10.6, 19.8]
12     (10.6, 19.8]
13    (1.999, 10.6]
14    (1.999, 10.6]
15     (25.8, 41.8]
16    (1.999, 10.6]
17     (10.6, 19.8]
18    (41.8, 100.0]
19    (41.8, 100.0]
Name: equal_freq, dtype: category
Categories (5, interval[float64]): [(1.999, 10.6] < (10.6, 19.8] < (19.8, 25.8] < (25.8, 41.8] < (41.8, 100.0]]


In [5]:
# 觀察等頻劃分下各組距各出現幾次 count 
ages["equal_freq"].value_counts() # 每個bin的資料筆數是一樣的 each bin contains same amount of data

(41.8, 100.0]    4
(25.8, 41.8]     4
(19.8, 25.8]     4
(10.6, 19.8]     4
(1.999, 10.6]    4
Name: equal_freq, dtype: int64

### 新增一個欄位分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 五組，'(' 表示不包含、']' 表示包含

In [6]:
ages["exact_bins"] = pd.cut(ages["age"], bins=(0,10,20,30,50,100)) # 具體指定bin的劃分 specify bins
print(ages["exact_bins"])

0      (10, 20]
1      (20, 30]
2      (20, 30]
3      (20, 30]
4       (0, 10]
5      (20, 30]
6      (20, 30]
7      (30, 50]
8      (20, 30]
9     (50, 100]
10     (30, 50]
11     (10, 20]
12     (10, 20]
13      (0, 10]
14      (0, 10]
15     (30, 50]
16      (0, 10]
17     (10, 20]
18    (50, 100]
19    (50, 100]
Name: exact_bins, dtype: category
Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 50] < (50, 100]]


In [7]:
# 具體指定bins的劃分 count amount in each specified bin
ages["exact_bins"].value_counts().sort_index() # 指定的bins gouped by specified bins

(0, 10]      4
(10, 20]     4
(20, 30]     6
(30, 50]     3
(50, 100]    3
Name: exact_bins, dtype: int64

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] 第二屆機器學習百日馬拉松內容

[2] [Continuous or discrete Variable](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable)

[3] [连续特征的离散化](https://www.zhihu.com/question/31989952)

[4] [特征离散化](https://mp.weixin.qq.com/s/KF2-ejxxapISaj4bIMaIdw)
