# **特征值工程**

在课程’数据预处理‘和’数据可视化‘中，经过处理的训练数据被保存为’train_preprocessed.csv'。下面的例子中，我们读取经过预处理的训练数据，并对特征值进行处理。

# **数据加载**

首先加载软件包。经过预处理的数据'train_processed.csv'保存在目录‘ml-course'下。

In [None]:
import pandas as pd
import numpy as np
import os
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

In [None]:
# Display the folders and files in current directory;
import os
for dirname, _, filenames in os.walk('/kaggle/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load train data already pre-processed;
df_train = pd.read_csv('/kaggle/input/ml-course/train_processed.csv',index_col=0)

检查数据的结构。可以看到特征值都已经转化成数值型，缺失的数据已经被填补。

In [None]:
df_train.head()

# **构造新的特征值**

根据日常经验，我们猜测 SibSp 和 Parch 的组合，也就是家庭成员，可能会提供额外的有用信息。将者两列相加作为一个新的特征值'Family'。

In [None]:
# Combine SibSp and Parch as new feature; 
# Combne train test first;
all_data=[df_train]

for dataset in all_data:
    dataset['Family'] = dataset['SibSp'] + dataset['Parch'] + 1

可以看到新的一列特征值'Family'。作图直观地显示特征值'Family'和目标值'Survived'的关系。

In [None]:
df_train.head()

In [None]:
# Function of drawing graph;
def draw(graph):
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height + 5,height ,ha= "center")

In [None]:
# Family vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Family", hue ="Survived", data = df_train)
draw(graph)

# **特征值分组**

从数据的描述可以看到年龄'Age'的分布从0.42到80。相对于连续的年龄值，我们认为分段的年龄提供更多有用的信息。

In [None]:
df_train.describe()

In [None]:
# Use bin to convert ages to bins;
all_data=[df_train]

for dataset in all_data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 15, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 15) & (dataset['Age'] <= 20), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 26), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 28), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 28) & (dataset['Age'] <= 35), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 35) & (dataset['Age'] <= 45), 'Age'] = 5
    dataset.loc[ dataset['Age'] > 45, 'Age'] = 6

df_train['Age'].value_counts()

画图显示 'Age'和 'Survived'的关系。

In [None]:
plt.figure(figsize = (8, 5))
ag = sns.countplot(x='Age', hue='Survived', data=df_train)
draw(ag)

同样地，对 'Fare'进行分组，将原来较为连续的票价分为四组：'Low_fare','median_fare','Average_fare','high_fare'。

In [None]:
# Check fare vs survived;
# Create categorical of fare to plot fare vs Pclass first;
for dataset in all_data:
    dataset['Fare_cat'] = pd.cut(dataset['Fare'], bins=[0,10,50,100,550], labels=['Low_fare','median_fare','Average_fare','high_fare'])
plt.figure(figsize = (8, 5))
ag = sns.countplot(x='Pclass', hue='Fare_cat', data=df_train)

In [None]:
# Fare vs survived;
sns.barplot(x='Fare_cat', y='Survived', data=df_train)

# **特征值相关系数**

特征值的相关系数反应了特征值的冗余度。相关系数接近1.0的两个特征值有明显的冗余度，可以将其中一列从特征值中剔除。

In [None]:
corr=df_train.corr()#['Survived']

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.subplots(figsize = (12,8))
sns.heatmap(corr, 
            annot=True,
            mask = mask,
            cmap = 'RdBu',
            linewidths=.9, 
            linecolor='white',
            vmax = 0.3,
            fmt='.2f',
            center = 0,
            square=True)
plt.title("Correlations Matrix", y = 1,fontsize = 20, pad = 20);