背景说明：泰坦尼克沉船是震惊世界的海难事件，1912年4月15日，在它的处女航中，撞上冰川后沉没。造成了超过1502人死亡，该事件也引起了全世界对于船舶安全法规的重视。在这场灾难中，有一些因素也导致了部分乘客的获救机率比较高，如老人，小孩，上流阶层，我们的目标是利用机器学习算法对获救乘客就行准确的预测。

# 数据处理

In [3]:
# 导入第三方库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# 全文忽略警告
import warnings
warnings.filterwarnings('ignore')

In [4]:
# 导入数据
df_train =pd.read_csv('/Users/fangcheng/sklearn/项目一：Titanic数据集乘客获救预测/train.csv')
df_test =pd.read_csv('/Users/fangcheng/sklearn/项目一：Titanic数据集乘客获救预测/test.csv')
# 查看数据组成情况
print(df_train.shape,df_test.shape)

(891, 12) (418, 11)


该数据集由两部分组成：

训练集：891条
测试集：418条

In [5]:
# 查看数据
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# 查看数据字段类型
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
# 查看数据缺失情况
df_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

可以看出“Age”，“Cabin“，“Embarked”，三个特征存在缺失

In [8]:
# 查看数据描述
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


特征选取
不做复杂的特征工程，采用最快速的方法做一个预测。

数据空值处理

1.Cabin列缺失值数量较多，直接填充会对最终结果产生较大的误差影响，暂时不考虑该特征。
2.Age列对最终结果的影响较大，取Age的中位数对空值进行填充。
3.PassengerID为连续的序列值，对最终结果没有影响，不考虑该特征。

In [11]:
# 取Age的中位数对空值进行填充
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


使用三个模型来对目标数据进行预测；分别是线性回归模型、逻辑回归模型以及随机森林模型

# 线性回归模型

In [12]:
# 导入第三方库
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

KFold 是sklearn 包中用于交叉验证的函数。
在机器学习中，将数据集A分为训练集（training set）B和测试集（test set）C，在样本量不充足的情况下，为了充分利用数据集对算法效果进行测试，将数据集A随机分为k个包，每次将其中一个包作为测试集，剩下k-1个包作为训练集进行训练

In [13]:
# 选取简单的可用特征
features = ['Pclass','Age','SibSp','Parch','Fare']

In [22]:
LR = LinearRegression()
# 将样本分成三折交叉验证
kf = KFold(n_splits=3,shuffle=False)
predictions = []
for train_index,test_index in kf.split(df_train):
    train_predictors = df_train[features].iloc[train_index,:]
    train_target = df_train['Survived'].iloc[train_index]
    LR.fit(train_predictors,train_target)
    test_predictions = LR.predict(df_train[features].iloc[test_index,:])
    predictions.append(test_predictions)

In [23]:
predictions=np.concatenate(predictions,axis=0)

In [24]:
predictions[predictions > 0.5] = 1
predictions[predictions <= 0.5] = 0
accuracy = sum(predictions == df_train['Survived'])/len(predictions)
print('accuracy:',accuracy)

accuracy: 0.7037037037037037


# 罗辑回归模型

In [28]:
from sklearn.model_selection import cross_val_score # 交叉验证函数
from sklearn.linear_model import LogisticRegression
Lr = LogisticRegression()
score = cross_val_score(Lr,df_train[features],df_train['Survived'],cv=3)
print(score.mean())

0.7003367003367004


cross_val_score：交叉验证函数，用于评估模型性能，他可以将数据集分成K个子集，每个子集轮流作为测试集，其余自己作为训练及，最终返回k个测试集的得分，这个函数可以用于分类、回归等不同类型的模型估计

# 增加"Sex","Embarked"特征

In [29]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
sex_map = {'male':0,'female':1}
df_train['Sex'] = df_train['Sex'].map(sex_map)

In [31]:
df_train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [32]:
df_train['Embarked'] = df_train['Embarked'].fillna('S')

In [33]:
embarked_map = {'S':0,'C':1,'Q':2}
df_train['Embarked'] = df_train['Embarked'].map(embarked_map)

In [34]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,0


In [35]:
features = ['Pclass','Age','SibSp','Parch','Fare','Sex','Embarked']

score = cross_val_score(Lr,df_train[features],df_train['Survived'],cv=3)
print(score.mean())

0.7957351290684623


从上述结果可以看出，增加新的'Sex','Embarked'特征，模型效果有一个极大的提升。

# 随机森林模型

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

RFC = RandomForestClassifier(n_estimators = 10,min_samples_split = 2,min_samples_leaf = 1)

kf = KFold(n_splits = 3)

scores = cross_val_score(RFC,df_train[features],df_train['Survived'],cv = kf)
print(scores.mean())

0.7946127946127947


## 使用网格搜索寻找最佳参数组合

In [37]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators':[10,20,30,40,50,60,70,80,90,100],'max_depth':[2,3,4,5,6,7,8,9]}

grid = GridSearchCV(RFC,param_grid = param_grid,scoring = 'roc_auc',cv = 5)
grid.fit(df_train[features],df_train['Survived'])

In [38]:
print(grid.best_params_,grid.best_score_)

{'max_depth': 6, 'n_estimators': 20} 0.8727576013543541


In [39]:
#选择最佳的模型

RFC = RandomForestClassifier(n_estimators = 20,max_depth = 6)

## 对测试集数据进行处理

In [40]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [41]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [42]:
# 对df_test的缺失数据集进行处理

df_test['Age'] = df_test['Age'].fillna(df_test['Age'].median())
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].max())

In [43]:
sex_map = {'male':0,'female':1}
df_test['Sex'] = df_test['Sex'].map(sex_map)
embarked_map = {'S':0,'C':1,'Q':2}
df_test['Embarked'] = df_test['Embarked'].map(embarked_map)

In [44]:
# 对RFC模型进行训练

RFC.fit(df_train[features],df_train['Survived'])
prediction = RFC.predict(df_test[features])
prediction[:10]

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0])

In [47]:
prediction

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [50]:
# 将预测的结果数组与PassengerId合并称为DataFrame形式
submission = pd.DataFrame({
    'PassengerId':df_test['PassengerId'],
    'Survived':prediction
})
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
