# Scikit-learn Feature Engineering

我們同樣使用titanic資料集，並且嘗試著作特徵工程與判斷特徵效果

[參考連結](https://medium.com/@yulongtsai/https-medium-com-yulongtsai-titanic-top3-8e64741cc11f)

[參考連結2](http://www.jasongj.com/ml/classification/)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression

In [None]:
# load data
train_dat = pd.read_csv('titanic/train.csv')
test_dat = pd.read_csv('titanic/test.csv')

full_dat = pd.concat([train_dat, test_dat], sort = False)
full_dat.reset_index(drop = True, inplace = True)


## Feature engineering

In [None]:
full_dat.head()

In [None]:
full_dat.info()

In [None]:
for col in ['PassengerId', 'Age', 'Fare']:
    sns.violinplot(full_dat['Survived'], full_dat[col])
    plt.show()

In [None]:
for col in ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']:
    sns.barplot(full_dat[col], full_dat['Survived'])
    plt.show()

In [None]:
full_dat.Cabin.value_counts().head()

In [None]:
full_dat.Ticket.value_counts().head()

- 類別變項
    - **PassengerId** : 刪除
    - **Name** : 刪除
    - **Ticket** : 刪除

    - **Cabin** : 將船艙編號取出作為類別變項，遺漏值額外分為一類

    - **Pclass** : 類別變項，作 one-hot encoding
    - **Sex** : 類別變項，作 one-hot encoding
    - **Embarked** : 類別變項，作 one-hot encoding


- 連續變項

    - **Survived** : 預測變項，不特別處理
    - **Age** : 進行遺漏值填補，以各上船港口、艙等、與性別的組別平均數填補
    - **Fare** : 以票價中位數填補遺漏值，另將票價分組作為額外變項
    - **SibSp** : 不特別處理，與Parch加總產生額外變項
    - **Parch** : 不特別處理，取SibSp加總產生額外變項


In [None]:
# missing imputation---#
full_dat['Embarked'].fillna(full_dat['Embarked'].mode()[0], inplace = True)
full_dat['Fare'].fillna(full_dat['Fare'].median(), inplace = True)

full_dat['Age'] = full_dat.groupby(['Pclass', 'Sex', 'Embarked'])['Age'].apply(lambda x: x.fillna(x.mean()))

In [None]:
# new feature : Family size
full_dat['Family_size'] = full_dat.SibSp+full_dat.Parch+1


# new feature : Fare_bin
full_dat['Fare_bin'] = pd.qcut(full_dat['Fare'], 5)


# new feature : Cabin group
full_dat['Cabin_group'] = full_dat.Cabin.fillna('Z').apply(lambda x: x[0])


In [None]:
# drop columns---#
full_dat.drop(['Name', 'Ticket', 'Cabin', 'PassengerId', 'Fare'], axis = 1, inplace = True)

In [None]:
#one-hot encoding---#
one_hot_dat = pd.get_dummies(full_dat, columns = ['Pclass','Sex','Embarked','Fare_bin','Cabin_group'])
one_hot_dat.head()


#normalization---#
std_s = StandardScaler()

survived_ = one_hot_dat['Survived']
one_hot_dat.drop('Survived', axis = 1, inplace = True)

normalize_dat = std_s.fit_transform(one_hot_dat)

In [None]:
#train test split---#
test_index = survived_.isna()

train_x = normalize_dat[~test_index]
test_x = normalize_dat[test_index]
train_y = survived_[~test_index]

t_x, v_x, t_y, v_y = train_test_split(train_x, train_y, test_size = 0.2, shuffle = True, random_state = 412)

## Build Model

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(t_x, t_y)

print('training score (decision tree : {:.3f}'.format(dt_model.score(t_x, t_y)))
print('validation score (decision tree : {:.3f}'.format(dt_model.score(v_x, v_y)))

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_true = v_y, y_pred = dt_model.predict(v_x))

## Feature importance

In [None]:
for c,i in zip(one_hot_dat.columns, dt_model.feature_importances_):
    print('{}:{:.3f}'.format(c,i))

---

## 監督式學習 3.0

做完上述範例後，希望大家可以

- 了解如何對資料作簡單的探索與特徵工程
- 使用樹狀模型的feature importance判斷特徵重要程度