### 介绍
    * 对数据进行探索性数据分析和特征工程；分成三个部分
    * 探索性数据分析、特征工程、模型

In [636]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import OneHotEncoder,LabelEncoder,StandardScaler
from sklearn.metrics import roc_curve,auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

import string
import warnings

SEED = 42

### 一、探索新数据分析

In [637]:
# 对训练集和测试集进行合并，进行统一的操作和特征工程
def concat_df(train_data,test_data):
    return pd.concat([train_data,test_data],sort=True).reset_index(drop=True) # sort=true 连接时对列进行排序；drop=true 重新设置索引，并删除原始索引


def devide_df(all_data):
    return all_data.loc[:890],all_data.loc[891:].drop(['Survived'],axis=1)


In [638]:
df_train = pd.read_csv('/Users/duoduo/Desktop/讲义/pandas/泰坦尼克/train.csv')
df_test = pd.read_csv('/Users/duoduo/Desktop/讲义/pandas/泰坦尼克/test.csv')
df_all = concat_df(df_train,df_test)

print('number of train examples = {}'.format(df_train.shape[0]))
print('number od test example = {}'.format(df_test.shape[0]))
print('-'*40)
print(df_train.columns)
print(df_test.columns)

number of train examples = 891
number od test example = 418
----------------------------------------
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


### 1.1 概述：
    passengerID
    Survived(0,1)
    pclass(1,2,3)
    name
    sibsp
    parch
    ticker
    fare
    cabin
    embarked(cqs)

In [639]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [640]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [641]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


### 1.2 缺失值：
    df_train：Age、embarked、cabin列有缺失;
    df_test：Age、fare、cabin列有缺失
    如下所示，age、embarked和fare中的缺失值较少。它们的缺失值可以用描述性统计方法来填补。但大约 80% 的机舱数据是缺失的，不能用描述性统计方法来填补
    

In [642]:
df_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [643]:
df_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [644]:
df_all.describe()

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,1046.0,1308.0,1309.0,1309.0,1309.0,1309.0,891.0
mean,29.881138,33.295479,0.385027,655.0,2.294882,0.498854,0.383838
std,14.413493,51.758668,0.86556,378.020061,0.837836,1.041658,0.486592
min,0.17,0.0,0.0,1.0,1.0,0.0,0.0
25%,21.0,7.8958,0.0,328.0,2.0,0.0,0.0
50%,28.0,14.4542,0.0,655.0,3.0,0.0,0.0
75%,39.0,31.275,0.0,982.0,3.0,1.0,1.0
max,80.0,512.3292,9.0,1309.0,3.0,8.0,1.0


### 1.2.1 Age 填充
    年龄数据分布呈偏态，有异常值，选择用中位数填充；


In [645]:
import plotly.figure_factory as ff
fig = ff.create_distplot([df_all['Age'].dropna()],
                         group_labels=['Age'],
                         show_hist=False,
                         show_rug=False)
fig.show()

In [646]:
fig = px.histogram(df_all,x='Age',histnorm='probability density',nbins=50,
                   marginal='rug')
fig.show()

fig2 = px.box(df_all,x='Age')
fig2.show()

计算特征在之间的相关性，不同的客户群体(性别、船舱等级)对年龄分布有影响

In [647]:
# age 和pcalss 相关系数0.408106，survived 与pcalss 相关系数0.338481
df_all_corr = df_all.corr().abs()
df_all_corr

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
Age,1.0,0.17874,0.150917,0.028814,0.408106,0.243699,0.077221
Fare,0.17874,1.0,0.221539,0.031428,0.558629,0.160238,0.257307
Parch,0.150917,0.221539,1.0,0.008942,0.018322,0.373587,0.081629
PassengerId,0.028814,0.031428,0.008942,1.0,0.038354,0.055224,0.005007
Pclass,0.408106,0.558629,0.018322,0.038354,1.0,0.060832,0.338481
SibSp,0.243699,0.160238,0.373587,0.055224,0.060832,1.0,0.035322
Survived,0.077221,0.257307,0.081629,0.005007,0.338481,0.035322,1.0


* 为了更加准确，在填补缺失的年龄值时，使用'sex'特征作为第二级分组。
* 如下所示可以看出，等级组和性别组具有不同的年龄中值,当乘客等级增加时，男性和女性的年龄中位数也会增加；不过男性的年龄中值 一直 高于女性年龄中值；


In [648]:
age_by_pclass = df_all.groupby('Pclass')['Age'].median()
age_by_pclass

Pclass
1    39.0
2    29.0
3    24.0
Name: Age, dtype: float64

In [649]:
age_by_sex_pclass = df_all.groupby(['Pclass','Sex'])['Age'].median()
age_by_sex_pclass

Pclass  Sex   
1       female    36.0
        male      42.0
2       female    28.0
        male      29.5
3       female    22.0
        male      25.0
Name: Age, dtype: float64

In [650]:
df_all['Age'] = df_all.groupby(['Sex','Pclass'])['Age'].apply(lambda x:x.fillna(x.median()))

## 1.2.2 embarked
Embarked 是一个分类特征，整个数据集中只有 2 个缺失值。这两名乘客都是女性，上层阶级，并且拥有相同的票号。这意味着她们相互认识，并一起从同一港口登船。
高舱位女性乘客的模式登船值为 C（瑟堡），但这并不意味着她们一定从该港口登船。


In [651]:
df_all[df_all.Embarked.isnull()]

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
61,38.0,B28,,80.0,"Icard, Miss. Amelie",0,62,1,female,0,1.0,113572
829,62.0,B28,,80.0,"Stone, Mrs. George Nelson (Martha Evelyn)",0,830,1,female,0,1.0,113572


google查询信息。发现它们canbinB-28

In [652]:
df_all['Embarked'] = df_all['Embarked'].fillna('S')

### 1.2.3 Fare
只有一个缺失值。假设fare与 家庭规模（parch\sibsp)和 pclass特征有关；如下，3-pclass,无家庭成员，男性；进行填充

In [653]:
df_all[df_all.Fare.isnull()]

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
1043,60.5,,S,,"Storey, Mr. Thomas",0,1044,3,male,0,,3701


In [654]:
me_fare = df_all.groupby(['Pclass','Parch','SibSp']).Fare.median()[3][0][0] #[3][0][0]对于的3pclass,0parch，0sibsp
df_all['Fare'] = df_all['Fare'].fillna(me_fare)

### 1.2.4 Cabin
数据中的cabin出现大量缺失。首先看一下对应的存活率；
机舱值的第一个字母代表甲板，甲板的为了给仓位分割，pclass与存活有相关性，因此cabin的特征不能删除

In [655]:
# 提取cabin首字母
import re
df_all['Cabin'] = df_all['Cabin'].str.extract(r'([A-Z])')

In [656]:
df_all_iscabin = df_all[df_all.Cabin.notnull()]
df_all_iscabin.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
1,38.0,C,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
3,35.0,C,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
6,54.0,E,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463
10,4.0,G,S,16.7,"Sandstrom, Miss. Marguerite Rut",1,11,3,female,1,1.0,PP 9549
11,58.0,C,S,26.55,"Bonnell, Miss. Elizabeth",0,12,1,female,0,1.0,113783


In [657]:
fig = px.histogram(df_all_iscabin,x='Cabin',color='Pclass',
                   facet_col='Survived',
                   text_auto=True,
                   category_orders=dict(Cabin=['A','B','C','D','E','T']),
                   hover_data=['Sex','Pclass']
)
fig.show()



In [658]:
# Creating Deck column from the first letter of the Cabin column (M stands for Missing)
df_all['Deck'] = df_all['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')

df_all_decks = df_all.groupby(['Deck', 'Pclass']).count().drop(columns=['Survived', 'Sex', 'Age', 'SibSp', 'Parch', 
                                                                        'Fare', 'Embarked', 'Cabin', 'PassengerId', 'Ticket']).rename(columns={'Name': 'Count'}).transpose()

def get_pclass_dist(df):
    
    # Creating a dictionary for every passenger class count in every deck
    deck_counts = {'A': {}, 'B': {}, 'C': {}, 'D': {}, 'E': {}, 'F': {}, 'G': {}, 'M': {}, 'T': {}}
    decks = df.columns.levels[0] 
    for deck in decks:
        for pclass in range(1, 4):
            try:
                count = df[deck][pclass][0]
                deck_counts[deck][pclass] = count 
            except KeyError:
                deck_counts[deck][pclass] = 0
                
    df_decks = pd.DataFrame(deck_counts)    
    deck_percentages = {}

    # Creating a dictionary for every passenger class percentage in every deck
    for col in df_decks.columns:
        deck_percentages[col] = [(count / df_decks[col].sum()) * 100 for count in df_decks[col]]
        
    return deck_counts, deck_percentages


def display_pclass_dist(percentages):
    
    df_percentages = pd.DataFrame(percentages).transpose()
    deck_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'M', 'T')
    bar_count = np.arange(len(deck_names))  
    bar_width = 0.85
    
    pclass1 = df_percentages[0]
    pclass2 = df_percentages[1]
    pclass3 = df_percentages[2]


all_deck_count, all_deck_per = get_pclass_dist(df_all_decks)
display_pclass_dist(all_deck_per)

In [659]:
# Passenger in the T deck is changed to A
idx = df_all[df_all['Deck'] == 'T'].index
df_all.loc[idx, 'Deck'] = 'A'

In [660]:
df_all_decks_survived = df_all.groupby(['Deck', 'Survived']).count().drop(columns=['Sex', 'Age', 'SibSp', 'Parch', 'Fare', 
                                                                                   'Embarked', 'Pclass', 'Cabin', 'PassengerId', 'Ticket']).rename(columns={'Name':'Count'}).transpose()

def get_survived_dist(df):
    
    # Creating a dictionary for every survival count in every deck
    surv_counts = {'A':{}, 'B':{}, 'C':{}, 'D':{}, 'E':{}, 'F':{}, 'G':{}, 'M':{}}
    decks = df.columns.levels[0]    

    for deck in decks:
        for survive in range(0, 2):
            surv_counts[deck][survive] = df[deck][survive][0]
            
    df_surv = pd.DataFrame(surv_counts)
    surv_percentages = {}

    for col in df_surv.columns:
        surv_percentages[col] = [(count / df_surv[col].sum()) * 100 for count in df_surv[col]]
        
    return surv_counts, surv_percentages

def display_surv_dist(percentages):
    
    df_survived_percentages = pd.DataFrame(percentages).transpose()
    deck_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'M')
    bar_count = np.arange(len(deck_names))  
    bar_width = 0.85    

    not_survived = df_survived_percentages[0]
    survived = df_survived_percentages[1]

In [661]:
df_all['Deck'] = df_all.Deck.replace(['A','B','C'],'ABC')
df_all['Deck'] = df_all.Deck.replace(['D','E'],'DE')
df_all['Deck'] = df_all.Deck.replace(['F','G'],'FG')

df_all.Deck.value_counts()

M      1014
ABC     182
DE       87
FG       26
Name: Deck, dtype: int64

In [662]:
df_all.drop(['Cabin'], inplace=True, axis=1)


In [663]:
df_train,df_test = devide_df(df_all)

In [664]:
fig = px.histogram(df_all,x='Age',color='Survived',barmode='group',nbins=20)
fig.show()

In [665]:
fig = px.histogram(df_train,x='Survived',color='Survived',height=400,width=700,text_auto=True
                   )
fig.update_layout(bargap=0.1)
fig.show()

### 二、特征工程

## 2.1 fare

In [666]:
fig = px.histogram(df_all,x='Fare',
                   nbins=30,
                   color='Survived',
                   barmode='group',
                  )
fig.update_layout(bargap=0.2,
                  )
fig.show()

In [667]:
fig = px.histogram(df_all,x='Age',color='Survived',barmode='group',nbins=20)
fig.update_layout(bargap=0.2)
fig.show()

#### familysize

In [668]:
df_all['family_size'] = df_all['SibSp'] + df_all['Parch'] + 1

In [669]:
fig = px.histogram(df_all,x='family_size',color='Survived')
fig.update_layout(bargap=0.1)
fig.show()

有很多 Ticket 值需要分析，因此将它们按频率分组更方便。
该功能与 Family_Size 有何不同？
许多乘客都是随团旅行。这些团体包括朋友、保姆、女佣等。他们不被算作家人，但却使用同一张车票。
为什么不对车票进行分组？如果票据功能中的前缀有任何意义，那么它们已经在Polass 或 Embarked 功能中，因为这可能是从 Ticket 功能中得出的唯一逻辑信息。
票据特征。
根据下图，有 2、3 和 4 名成员的团队存活率较高。单独旅行的乘客存活率最低。4 名成员后，存活率急剧下降。这种模式与
Family_Size 特征非常相似，但也有细微差别。
Ticket_Frequency 值没有像 Family_Size 一样分组，因为这样基本上会产生相同的特征。
那样分组，因为那样基本上就会产生具有完美相关性的相同特征。这种特征不会提供任何
额外的信息增益。

In [670]:
df_all['ticket_frequency'] = df_all.groupby('Ticket')['Ticket'].transform('count')


In [671]:
fig = px.histogram(df_all,x='ticket_frequency',color='Survived',barmode='group')
fig.update_layout(bargap=0.1)
fig.show()


In [672]:
df_all.Name[:5]

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [673]:
fig = px.histogram(df_all,x='title')
fig.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Age', 'Embarked', 'Fare', 'Name', 'Parch', 'PassengerId', 'Pclass', 'Sex', 'SibSp', 'Survived', 'Ticket', 'Deck', 'family_size', 'ticket_frequency'] but received: title

In [None]:
# creating a  categorical variable for age
df_train['agecat'] = ''
df_train['agecat'].loc[(df_train.Age < 18)] = 'young'
df_train['agecat'].loc[(df_train.Age >= 18)&(df_train.Age < 56)] = 'mature'
df_train['agecat'].loc[(df_train.Age >= 56)] = 'senior'


df_train['Familysize'] = ''
df_train['Familysize'].loc[(df_train.SibSp <= 2)] = 'small'
df_train['Familysize'].loc[(df_train.SibSp > 2)&(df_train.SibSp <= 5)] = 'medium'
df_train['Familysize'].loc[(df_train.SibSp > 5)] = 'large'

df_train['isalone'] = ''
df_train['isalone'].loc[(df_train.SibSp) + (df_train.Parch) > 0] ='no'
df_train['isalone'].loc[(df_train.SibSp) + (df_train.Parch) == 0] = 'yes'

df_train['sexcat'] = ''
df_train['sexcat'].loc[(df_train.Sex == 'male')&(df_train.Age <= 21)] ='youngmale'
df_train['sexcat'].loc[(df_train.Sex == 'male')&(df_train.Age > 21)&(df_train.Age < 50)] = 'maturemale'
df_train['sexcat'].loc[(df_train.Sex == 'male')&(df_train.Age > 50)] = 'seniormale'
df_train['sexcat'].loc[(df_train['Sex'] == 'female') & (df_train['Age'] <= 21)] = 'youngfemale'
df_train['sexcat'].loc[(df_train['Sex'] == 'female') & ((df_train['Age'] > 21) & (df_train['Age']) < 50)] = 'maturefemale'
df_train['sexcat'].loc[(df_train['Sex'] == 'female') & (df_train['Age'] > 50)] = 'seniorfemale'


df_train['title'] = df_train['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]
df_train['ismarried'] = 0
df_train['ismarried'].loc[df_train.title == 'Mrs'] = 1
df_train['title'] = df_train['title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
df_train['title'] = df_train['title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')


df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,agecat,Familysize,isalone,sexcat,title,ismarried
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,mature,small,no,maturemale,Mr,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,mature,small,no,maturefemale,Mrs,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,mature,small,yes,maturefemale,Miss,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,mature,small,no,maturefemale,Mrs,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,mature,small,yes,maturemale,Mr,0


### 三、模型

管道建设
好了，你现在可能想知道 什么是管道？


我们可以把管道理解为应用于数据的一系列操作。就像上图一样，你可以看到一个完整的管道是由多个不同的小管道组成的。将其理解为数据科学：想象一下，每个小管道都是建模过程中的一个步骤。例如

-> 第 1 步：从数值列中填充空值。

-> 第 2 步：对数值特征进行归一化处理，使它们具有相同的比例。

-> 第 3 步：从分类特征中填充空值。

-> 第 4 步：对分类特征进行 OneHotEncode。

-> 第 5 步：拟合机器学习模型并进行评估。

我们可以创建一个管道对象，将所有这些步骤整合在一起，然后将该对象拟合到我们的训练数据中，而不是分别执行这些步骤。

我们为什么要这么做呢？

使用管道有很多好处。以下是我认为与本次讨论最相关的几个方面：

1 - 生产代码更容易实现

将机器学习模型部署到生产环境中时，主要目标是将其用于之前从未见过的数据。为此，需要对新数据进行与训练数据相同的转换。您可以使用一个管道对象来依次应用所有预处理任务，而不是为每个预处理任务设置几个不同的函数。这意味着，只需一行代码，就能应用所有需要的转换。请查看本笔记本 "预测 "部分中的示例。

2 - 与 RandomSearchCV 结合使用时，可以测试多个不同的管道选项

在训练模型时，您一定已经问过自己 "对于这类数据，什么方法最有效？用平均值还是列的中位数来填补缺失值？我应该使用 MinMaxScaler 还是 StandardScaler？应用降维？使用多项式特征等创建更多特征？使用管道和超参数搜索功能（如 RandomSearchCV），您可以自动搜索整套数据管道、模型和参数，从而节省您在搜索最佳特征工程方法和模型/超参数时投入的精力。

假设我们有 4 个不同的管道：

-> 管道 1：通过估算每列的平均值来填补数值特征中的缺失值 - 应用 MinMaxScaler - 对分类特征应用 OneHotEncoder - 将数据拟合到 KNN 分类器中，n_neighbors = 15.

-> 管道 2：通过估算每列的平均值来填补数值特征中的缺失值 - 应用 StandardScaler - 对分类特征应用 OneHotEncoder - 将数据拟合到 n_neighbors = 30 的 KNN 分类器中。

-> 管道 3：通过估算每列的中位数来填补数值特征的缺失值 - 应用 MinMaxScaler - 对分类特征应用 OneHotEncoder - 将数据拟合到 n_estimators = 100 的随机森林分类器中。

->管道 4：通过估算每列的中位数来填补数值特征的缺失值--应用 StandardScaler--对分类特征应用 OneHotEncoder--将数据拟合到 n_estimators = 150 的随机森林分类器中。

最初，您可能会认为，要检查哪个管道更好，只需手动创建所有管道，拟合数据，然后评估结果即可。但是，如果我们想把搜索范围扩大到数百种不同的管道呢？这就很难手动完成了。这就是 RandomSearchCV 发挥作用的地方。

3-交叉验证时不会泄露信息
这一点比较棘手，尤其是对初学者而言。基本上，在交叉验证时，数据应该在每个交叉验证步骤中进行转换，而不是在转换之前。在转换训练集（例如使用 StandardScaler）后进行交叉验证时，训练集的信息会泄露到验证集。这可能会导致有偏差/不理想的结果。

正确的做法是在交叉验证中对数据进行归一化处理。也就是说，在每个交叉验证步骤中，只在训练集上拟合一个标度器。然后，这个标度器对验证集进行转换，并对模型进行评估。这样，训练集的信息就不会泄露给验证集。如果在 RandomSearchCV（或 GridSearchCV）中使用管道，就能解决这个问题。

In [None]:
df = df_train

In [None]:
def get_feature_names(df):
    target = df['Survived']

    df.drop(['PassengerId', 'Survived', 'Ticket', 'Name', 'Cabin'],axis=1,inplace=True)
    # spliting categorical and numerical column dataframes
    categorical_df = df.select_dtypes(include=['object'])
    numeric_df = df.select_dtypes(exclude=['object'])

    categorical_columns = list(categorical_df.columns)
    numeric_columns = list(numeric_df.columns)

    print('categorical_columns:\n',categorical_columns)
    print('numerica_columns:\n',numeric_columns)

    return target,categorical_columns,numeric_columns

target,categorical_columns,numeric_columns = get_feature_names(df)


categorical_columns:
 ['Sex', 'Embarked', 'agecat', 'Familysize', 'isalone', 'sexcat', 'title']
numerica_columns:
 ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'ismarried']
