### A、在初步 EDA 的過程，我們無可避免會想問的問題

- 不同資料類型各有多少個欄位？
- 類別型欄位 (pandas 中的 object) 的類別數量?
- 模型怎麼處理類別型的資料？有什麼表示方法？

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
f = os.path.join('data/application_test.csv')
df = pd.read_csv(f)

- 檢視資料中類別型欄位各自類別的數量

In [3]:
df.dtypes.value_counts()

float64    65
int64      40
object     16
dtype: int64

類別型欄位 (pandas 中的 object) 的類別數量

In [4]:
df.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               7
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             5
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

### B、這三個問題在參考程式碼範例(請點選下方檢視範例Day_004_column_data_type.ipynb)都會實現，第三個問題會更為複雜一些，簡單來說我們有兩種方法來處理類別型資料

- Label encoding: 把每個類別 mapping 到某個整數，不會增加新欄位
- One Hot encoding: 為每個類別新增一個欄位，用 0/1 表示是否

In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in df:
    if df[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(df[col].unique())) <= 2:
            # Train on the training data
            le.fit(df[col])
            # Transform both training and testing data
            df[col] = le.transform(df[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

4 columns were label encoded.


In [9]:
df = pd.get_dummies(df)

print(df['CODE_GENDER'].head())


0    0
1    1
2    1
3    0
4    1
Name: CODE_GENDER, dtype: int64


In [11]:
print(df['FLAG_OWN_CAR'].head())
print(df['FLAG_OWN_REALTY'].head())

0    0
1    0
2    1
3    0
4    1
Name: FLAG_OWN_CAR, dtype: int64
0    1
1    1
2    1
3    1
4    0
Name: FLAG_OWN_REALTY, dtype: int64
