### 介绍

处理categorical data会稍微困难，因为其蕴含额外的语义信息，而且一般的机器学习算法无法直接处理这类数据。

Categorical data分为两类：nominal和ordinal。前者无序，比如各种天气；后者有序，比如衣服尺码。

一般来说处理categorical data的策略是先将categorical data转化为数值，然后在这些数值上使用一些编码方法。

In [1]:
import pandas as pd
import numpy as np

### Transforming Nominal Attributes

In [2]:
vg_df = pd.read_csv('data/vgsales.csv', encoding='utf-8')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo


‘Genre’显然是一个nominal属性，可以看这个属性的所有取值：

In [3]:
genres = np.unique(vg_df['Genre'])
genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

可以使用sklearn将每个取值映射到一个数上：

In [4]:
from sklearn.preprocessing import LabelEncoder
gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in 
                  enumerate(gle.classes_)}
genre_mappings    # 枚举该属性的所有可能取值

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

In [5]:
genre_labels

array([10,  4,  6, ...,  6,  5,  4])

In [6]:
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

Unnamed: 0,Name,Platform,Year,Genre,GenreLabel
1,Super Mario Bros.,NES,1985.0,Platform,4
2,Mario Kart Wii,Wii,2008.0,Racing,6
3,Wii Sports Resort,Wii,2009.0,Sports,10
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,7
5,Tetris,GB,1989.0,Puzzle,5
6,New Super Mario Bros.,DS,2006.0,Platform,4


### Transforming Ordinal Attributes

下面是Pokemon的‘Generation’属性

In [7]:
poke_df = pd.read_csv('data/Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation'])

array([1, 2, 3, 4, 5, 6], dtype=int64)

这里Generation属性已经映射到了整数，假如没有，可以进行如下操作：

In [None]:
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

### Encoding Categorical Attributes

#### One-hot Encoding Scheme

#### Dummy Coding Scheme

#### Effect Coding Scheme

#### Bin-counting Scheme

#### Feature Hashing Scheme

### 结语