### 課程範例練習  
在初步 EDA 的過程，我們無可避免會想問的問題  
不同資料類型各有多少個欄位？  
類別型欄位 (pandas 中的 object) 的類別數量?  
模型怎麼處理類別型的資料？有什麼表示方法？  
  

  
  
**[Label Encoder vs. One Hot Encoder in Machine Learning](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)**  
  * Label encoding：把每個類別 mapping 到某個整數，不會增加新欄位  
  * One Hot encoding：為每個類別新增一個欄位，用 0/1 表示是否  

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
# 文件路徑
dir_path = './data/home_credit_default_risk/'

# 取得路徑下的指定文件
f_app_train = os.path.join(dir_path , 'application_train.csv')
f_app_test = os.path.join(dir_path , 'application_test.csv')

# panda 讀取 csv檔資料
app_train = pd.read_csv(f_app_train)
app_test = pd.read_csv(f_app_test)

檢視檔案資料中各欄位類型、數量
**[Pandas](https://blog.csdn.net/starter_____/article/details/79184196)**  
唯一值 Unique()、計數值 value_counts()

In [3]:
app_train.dtypes.value_counts()


float64    65
int64      41
object     16
dtype: int64

檢視資料中，類別型欄位 (pandas 中的 object) 的數量  
類別型欄位(pandas 中的 object) 即為 python 中的 str  
[Pandas 數據類型概覽](https://juejin.im/post/5acc36e66fb9a028d043c2a5)  
[apply( )](https://blog.csdn.net/qq_19528953/article/details/79348929) 的參考資料


In [4]:
# 檢視 application_train.csv 中所有 object 欄位的資料內容
app_train.select_dtypes(include =['object'])

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Laborers,WEDNESDAY,Business Entity Type 3,reg oper account,block of flats,"Stone, brick",No
1,Cash loans,F,N,N,Family,State servant,Higher education,Married,House / apartment,Core staff,MONDAY,School,reg oper account,block of flats,Block,No
2,Revolving loans,M,Y,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Laborers,MONDAY,Government,,,,
3,Cash loans,F,N,Y,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,Laborers,WEDNESDAY,Business Entity Type 3,,,,
4,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Core staff,THURSDAY,Religion,,,,
5,Cash loans,M,N,Y,"Spouse, partner",State servant,Secondary / secondary special,Married,House / apartment,Laborers,WEDNESDAY,Other,,,,
6,Cash loans,F,Y,Y,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,Accountants,SUNDAY,Business Entity Type 3,,,,
7,Cash loans,M,Y,Y,Unaccompanied,State servant,Higher education,Married,House / apartment,Managers,MONDAY,Other,,,,
8,Cash loans,F,N,Y,Children,Pensioner,Secondary / secondary special,Married,House / apartment,,WEDNESDAY,XNA,,,,
9,Revolving loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,Laborers,THURSDAY,Electricity,,,,


In [5]:
# 檢視 application_train.csv 中所有 object 欄位的各類別(不同欄位名稱)數量 
app_train.select_dtypes(include =['object']).apply(pd.Series.nunique , axis=0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

#### Label encoding
有仔細閱讀[參考資料](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)的人可以發現，Label encoding 的表示方式會讓同一個欄位底下的類別之間有大小關係 (0<1<2<...)，所以在這裡我們只對有類別數量小於等於 2 的類別型欄位示範使用 Label encoding，但不表示這樣處理是最好的，一切取決於欄位本身的意義適合哪一種表示方法

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # keep track of how many columns were label encoded
            le_count += 1
        
print('%d columns were label encoded.' %le_count)

3 columns were label encoded.


#### One Hot encoding
pandas 中的 one hot encoding 非常方便，一行程式碼就搞定

In [8]:
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print(app_train['CODE_GENDER_F'].head())
print(app_train['CODE_GENDER_M'].head())
print(app_train['NAME_EDUCATION_TYPE_Academic degree'].head())

0    0
1    1
2    0
3    1
4    0
Name: CODE_GENDER_F, dtype: uint8
0    1
1    0
2    1
3    0
4    1
Name: CODE_GENDER_M, dtype: uint8
0    0
1    0
2    0
3    0
4    0
Name: NAME_EDUCATION_TYPE_Academic degree, dtype: uint8
