**<font size=6>类别数据的处理</font>**

In [1]:
import pandas as pd

In [2]:
data = pd.DataFrame([
    ['green', 'M', 10.1, 'class1'],
    ['red', 'L', 13.5, 'class2'],
    ['blue', 'XL', 15.3, 'class1']
])

In [3]:
data.columns = ['color', 'size', 'price', 'category']

In [4]:
data

Unnamed: 0,color,size,price,category
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


**<font size=4>1.有序特征的处理</font>**<br>
类别的值是有序的，或者是可以排序的。例如，衣服的尺码S\M\L

In [5]:
size_mapping = {'XL':4, 'L':3, 'M':2, 'S':1}

In [7]:
data['size'] = data['size'].map(size_mapping)

In [8]:
data

Unnamed: 0,color,size,price,category
0,green,2,10.1,class1
1,red,3,13.5,class2
2,blue,4,15.3,class1


**<font size=4>2.类标整数的处理</font>**<br>

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
class_le = LabelEncoder()

In [11]:
y = class_le.fit_transform(data['category'].values)

In [12]:
y

array([0, 1, 0])

使用inverse_transform方法将整数类标还原为原始的字符串表示

In [13]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

In [14]:
data

Unnamed: 0,color,size,price,category
0,green,2,10.1,class1
1,red,3,13.5,class2
2,blue,4,15.3,class1


此外，可利用Categorical将label标记为int型数据

<font color='red'>这种方式的弊端是，无法进行逆操作，恢复之前的类别</font>

In [15]:
data["category"] = pd.Categorical(data["category"]).codes

In [16]:
data

Unnamed: 0,color,size,price,category
0,green,2,10.1,0
1,red,3,13.5,1
2,blue,4,15.3,0


分离特征数据与样本标签

In [17]:
X = data[['color', 'size', 'price']]

In [18]:
X

Unnamed: 0,color,size,price
0,green,2,10.1
1,red,3,13.5
2,blue,4,15.3


In [19]:
y = data["category"].values

In [20]:
y

array([0, 1, 0], dtype=int8)

**<font size=4>3.标称数据的处理</font>**<br>

**one-hot编码**<br>
利用二进制标识是否属于该类别数据

In [21]:
from sklearn.preprocessing import OneHotEncoder

In [22]:
# encoder = OneHotEncoder(n_values='auto')
encoder = OneHotEncoder(categorical_features=[0])

先用LabelCoder对字符串类型的数据进行处理，之后才能用onehot编码

In [23]:
X = X.values

In [24]:
X[:,0] = class_le.fit_transform(X[:,0])

In [25]:
X

array([[1, 2, 10.1],
       [2, 3, 13.5],
       [0, 4, 15.3]], dtype=object)

In [26]:
encoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[ 0. ,  1. ,  0. ,  2. , 10.1],
       [ 0. ,  0. ,  1. ,  3. , 13.5],
       [ 1. ,  0. ,  0. ,  4. , 15.3]])

所以，个人感觉还是pandas自带的get_dummies()更好用

In [27]:
X = data[['color', 'size', 'price']]

这种方法对连续型数据不进行编码

In [30]:
pd.get_dummies(X[['color', 'size', 'price']])

Unnamed: 0,size,price,color_blue,color_green,color_red
0,2,10.1,0,1,0
1,3,13.5,0,0,1
2,4,15.3,1,0,0


<hr>