# 离散化
连续属性的离散化就是在连续属性的值域上，将值域划分为若干个离散的区间，最后用不同的符号或整数 值代表落在每个子区间中的属性值。

# 离散化有很多种方法，这使用一种最简单的方式去操作

原始人的身高数据：165，174，160，180，159，163，192，184

假设按照身高分几个区间段：150~165, 165~180,180~195

这样我们将数据分到了三个区间段，我可以对应的标记为矮、中、高三个类别，最终要处理成一个"哑变量"矩阵

In [3]:
import pandas as pd
data = pd.read_csv("./data/stock_day.csv")
p_change= data['p_change']
p_change.head()

2018-02-27    2.68
2018-02-26    3.02
2018-02-23    2.42
2018-02-22    1.64
2018-02-14    2.05
Name: p_change, dtype: float64

## pd.qcut(data, q)：
- q是分组数,会自动把数据数量均匀的分布到这些组
- 对数据进行分组将数据分组，一般会与value_counts搭配使用，统计每组的个数
- series.value_counts()：统计分组次数

In [15]:
qcut = pd.qcut(p_change, q=10)
# 计算分到每个组数据个数
qcut.value_counts()

(5.27, 10.03]                    65
(0.26, 0.94]                     65
(-0.462, 0.26]                   65
(-10.030999999999999, -4.836]    65
(2.938, 5.27]                    64
(1.738, 2.938]                   64
(-1.352, -0.462]                 64
(-2.444, -1.352]                 64
(-4.836, -2.444]                 64
(0.94, 1.738]                    63
Name: p_change, dtype: int64

# pd.cut(data, bins) 按照自定义区间对数据进行分组
- bins  自定义分组区间

In [11]:
# 自己指定分组区间
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
p_counts = pd.cut(p_change, bins)
p_counts.head()

2018-02-27    (0, 3]
2018-02-26    (3, 5]
2018-02-23    (0, 3]
2018-02-22    (0, 3]
2018-02-14    (0, 3]
Name: p_change, dtype: category
Categories (8, interval[int64]): [(-100, -7] < (-7, -5] < (-5, -3] < (-3, 0] < (0, 3] < (3, 5] < (5, 7] < (7, 100]]

#  one-hot

- pandas.get_dummies(data, prefix=None)

- data:array-like, Series, or DataFrame

- prefix:分组名字

In [14]:
bins = [-100, -7, -5, -3, 0, 3, 5, 7, 100]
p_counts = pd.cut(p_change, bins)
dummies = pd.get_dummies(p_counts, prefix="rise")
dummies.head()

Unnamed: 0,"rise_(-100, -7]","rise_(-7, -5]","rise_(-5, -3]","rise_(-3, 0]","rise_(0, 3]","rise_(3, 5]","rise_(5, 7]","rise_(7, 100]"
2018-02-27,0,0,0,0,1,0,0,0
2018-02-26,0,0,0,0,0,1,0,0
2018-02-23,0,0,0,0,1,0,0,0
2018-02-22,0,0,0,0,1,0,0,0
2018-02-14,0,0,0,0,1,0,0,0
