#  泰坦尼克号数据探索

#### 本文旨在探索泰坦尼克号乘客的生存率的影响因素：
有3个初步的想法，文章从**3个方面**探索：
 1. **妇女更有可能生存**；
 2. **儿童更有可能生存**；
 3. **上层乘客更有可能生存**；

**关于本文的简单说明：**
 1. 本文采用的是891名乘客的数据，文章中的观点是基于此数据，简单的默认此数据直接代表所有乘客；
 2. 真实的影响因素是多样且互相关联的，本文得出的所有结论只是很简单的分析某些变量与生存率的关系，一家之言，欢迎交流；

In [20]:
# 引入必要的库
import numpy  as  np
import pandas  as  pd
import matplotlib.pyplot  as  plt
import seaborn  as  sns



# 导入csv文件的数据，并查看
titanic_df = pd.read_csv('titanic-data.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
# 先简单看下数据的基本信息
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [22]:
titanic_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Graham, Mr. George Edward",male,CA. 2343,C23 C25 C27,S
freq,1,577,7,4,644


可以找出几个有用的信息：
 1. 乘客平均的生存率为0.38；
 2.



#  几个简单的数据探索
#### 1.性别对生存率的影响

女人的生存率为0.74，男人生存率为0.18，差别非常大，可以得出结论：
 ####  妇女更有可能生存；

In [23]:
titanic_df[['Survived','Sex']].groupby('Sex', as_index=False).mean()

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


#### 2. 年龄对生存率的影响
年龄age有714个，正常有891个，需要处理缺失值，这里采用的方法是按照性别填充平均值；

In [30]:
titanic_age_mean = titanic_df['Age'].mean()

29.69911764705882

In [26]:
female_age_mean = titanic_age_mean[0][1]
male_age_mean = titanic_age_mean[1][1]


In [32]:
titanic_df["Age"].fillna(titanic_age_mean)

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
5      29.699118
6      54.000000
7       2.000000
8      27.000000
9      14.000000
10      4.000000
11     58.000000
12     20.000000
13     39.000000
14     14.000000
15     55.000000
16      2.000000
17     29.699118
18     31.000000
19     29.699118
20     35.000000
21     34.000000
22     15.000000
23     28.000000
24      8.000000
25     38.000000
26     29.699118
27     19.000000
28     29.699118
29     29.699118
         ...    
861    21.000000
862    48.000000
863    29.699118
864    24.000000
865    42.000000
866    27.000000
867    31.000000
868    29.699118
869     4.000000
870    26.000000
871    47.000000
872    33.000000
873    47.000000
874    28.000000
875    15.000000
876    20.000000
877    19.000000
878    29.699118
879    56.000000
880    25.000000
881    33.000000
882    22.000000
883    28.000000
884    25.000000
885    39.000000
886    27.000000
887    19.000000
888    29.6991

#### 3.乘客所在船层对生存率的影响

In [28]:
# 查看性别对生存率的影响；
titanic_df[['Survived','Pclass']].groupby('Pclass', as_index=False).mean()

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


不同船层的乘客生存率有显著差异：
* Pclass为1，即在 船上层 的乘客生存率为0.63；
* Pclass为2，即在 船中层 的乘客生存率为0.47；
* Pclass为3，即在 船底层 的乘客生存率为0.24；

可以得出结论：

** 上层乘客更有可能生存**
