## Categorical Feature Encoding Challenge

这是 kaggle 上的一个练习赛，地址在 [这里](https://www.kaggle.com/c/cat-in-the-dat/overview)，给出了一个全部由 category 特征构成的数据集。通过玩这个项目，可以了解到如何处理 category 特征，以及维度很高的时候，如何使用稀疏向量来存储数据。

In [2]:
"""
数据分析
"""
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [3]:
train = pd.read_csv('../data/cat-in-the-dat/train.csv')
test = pd.read_csv('../data/cat-in-the-dat/test.csv')

train_copy, test_copy = train.copy(), test.copy()

## 观察数据


In [4]:
train.shape, test.shape

((300000, 25), (200000, 24))

In [5]:
train.iloc[:,:12].head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,nom_5
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,50f116bcf
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,b3b4d25d0
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,3263bdce5
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,f12246592
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,5b0f5acd5


In [6]:
train.iloc[:,12:].head()

Unnamed: 0,nom_6,nom_7,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,3ac1b8814,68f6ad3e9,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,fbcb50fc1,3b6dd5612,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,0922e3cb8,a6a36f527,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,50d7ad46a,ec69236eb,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,1fe17a1fd,04ddac2be,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


In [7]:
train.columns.values

array(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0',
       'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7',
       'nom_8', 'nom_9', 'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4',
       'ord_5', 'day', 'month', 'target'], dtype=object)

所有以 `bin_` 开头的属性是二值的。`nom_` 开头的属性是枚举值。`ord_` 开头的属性也是枚举值，但是存在顺序关系，比如 `ord_2` 这一列描述的是温度，温度是有高低关系的。

In [8]:
for i in range(10):
    count = train.loc[:, 'nom_{}'.format(i)].unique().shape[0]
    print("nom_{} has {} unique values".format(i, count))

nom_0 has 3 unique values
nom_1 has 6 unique values
nom_2 has 6 unique values
nom_3 has 6 unique values
nom_4 has 4 unique values
nom_5 has 222 unique values
nom_6 has 522 unique values
nom_7 has 1220 unique values
nom_8 has 2215 unique values
nom_9 has 11981 unique values


In [9]:
for i in range(6):
    count = train.loc[:, 'ord_{}'.format(i)].unique().shape[0]
    print("ord_{} has {} unique values".format(i, count))

ord_0 has 3 unique values
ord_1 has 5 unique values
ord_2 has 6 unique values
ord_3 has 15 unique values
ord_4 has 26 unique values
ord_5 has 192 unique values


In [10]:
print(train.info())
print("-" * 40)
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 25 columns):
id        300000 non-null int64
bin_0     300000 non-null int64
bin_1     300000 non-null int64
bin_2     300000 non-null int64
bin_3     300000 non-null object
bin_4     300000 non-null object
nom_0     300000 non-null object
nom_1     300000 non-null object
nom_2     300000 non-null object
nom_3     300000 non-null object
nom_4     300000 non-null object
nom_5     300000 non-null object
nom_6     300000 non-null object
nom_7     300000 non-null object
nom_8     300000 non-null object
nom_9     300000 non-null object
ord_0     300000 non-null int64
ord_1     300000 non-null object
ord_2     300000 non-null object
ord_3     300000 non-null object
ord_4     300000 non-null object
ord_5     300000 non-null object
day       300000 non-null int64
month     300000 non-null int64
target    300000 non-null int64
dtypes: int64(8), object(17)
memory usage: 57.2+ MB
None
---------------

经过分析后，所有属性都是离散的，不存在缺失值，我觉得可以把所有属性都做 one-hot 编码即可。但是考虑到 `ord_` 属性是存在顺序关系的，可以尝试把 `ord_` 映射到 `0-1` 之间。后面分别尝试这两种方案。

## 方案一

所有属性都做 one-hot 编码。

### 数据预处理

将所有特征都做 one-hot 编码即可，每一个样本为一个 16552 维的稀疏向量。day 和 month 组合起来可以是一年中具体的某一天，可以构造出这样一个特征来。

In [11]:
train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)

In [12]:
dataset = pd.concat([train, test], ignore_index=True)
dataset = dataset.drop(['id'], axis=1)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)

In [13]:
"""
大约要执行一分钟
"""
X = pd.get_dummies(dataset, columns=dataset.columns, sparse=True)

`pandas.get_dummies` 加 `sparse=True` 返回的是 DataFrame，可以使用 `to_coo` 方法得到稀疏矩阵。但是这里为了分离出训练集和测试集，需要在将 coo 矩阵转为 csr 矩阵，这样才能做行切片。

In [14]:
X = X.to_coo().tocsr()
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]

In [15]:
X_train.shape, X_test.shape

((300000, 16636), (200000, 16636))

### 训练模型

这里数据量虽然很大，但是因为使用的是稀疏向量，因此训练 logistics regression 还是会很快。

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.1)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.80145132, 0.80058547, 0.80766639, 0.80311128, 0.80372911])

### 预测

In [18]:
lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission.csv", index=False)

提交到 kaggle 上之后得分 0.80780，排行榜上 56/347。

## 方案二

除了 `ord_` 属性外都做 one-hot 编码，`ord_` 属性转换为 0-1 之间的值。

In [64]:
train = train_copy.copy()
test = test_copy.copy()

target = train['target']
train = train.drop(['target'], axis=1)

In [65]:
dataset = pd.concat([train, test], ignore_index=True)
dataset['date'] = dataset['month'].astype(np.str) + '-' + dataset['day'].astype(np.str)

In [66]:
dataset.shape

(500000, 25)

In [68]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, OrdinalEncoder, MinMaxScaler

def make_pipeline(estimator_list):
    return Pipeline([
        (estimator.__class__.__name__ + str(i), estimator)
        for i,estimator  in enumerate(estimator_list)
    ])

full_pipeline =  ColumnTransformer([
    ("nom_*", OneHotEncoder(), ['nom_'+str(i) for i in range(0, 10)]),
    ("day/month", OneHotEncoder(), ['day','month']),
    ("ord_0", make_pipeline([
        OrdinalEncoder(),
        MinMaxScaler()
    ]), ["ord_0", "ord_3", "ord_4", "ord_5", "ord_6"]),
    ("ord_1", make_pipeline([
        OrdinalEncoder(categories=[['Novice','Contributor','Expert','Master','Grandmaster']]),
        MinMaxScaler()
    ]), ["ord_1"]),
    ("ord_2", make_pipeline([
        OrdinalEncoder(categories=[['Freezing','Cold','Warm', 'Hot', 'Boiling Hot', 'Lava Hot']]),
        MinMaxScaler()
    ]), ["ord_2"]),
], remainder='passthrough')

for df in [train, test]:
    df.drop(['id'], axis=1, inplace=True)
    df['bin_3'] = df['bin_3'].map({'F': 0, 'T': 1})
    df['bin_4'] = df['bin_4'].map({'N': 0, 'Y': 1})
    df['ord_6'] = df['ord_5'].str[1]
    df['ord_5'] = df['ord_5'].str[0]

dataset = pd.concat([train, test], ignore_index=True)
    
full_pipeline.fit(dataset)

X_train = full_pipeline.transform(train)
X_test = full_pipeline.transform(test)

In [70]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

lr = LogisticRegression(solver='lbfgs', C=0.2)

cross_val_score(lr, X_train, target, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.80179762, 0.80051735, 0.8070874 , 0.80310585, 0.80400994])

In [71]:
"""
预测
"""

lr.fit(X_train, target)

y_proba = lr.predict_proba(X_test)

submission = pd.DataFrame({
    "id": test_copy['id'],
    "target": y_proba[:, 0]
})

submission.to_csv("../data/cat-in-the-dat/submission_order.csv", index=False)