# 了解数据集

## 加载数据集

此次所选的数据集为泰坦尼克号数据，文件格式为csv。

打开文件查看，发现存在PassengerId索引字段，因此采用pandas加载数据，index_col设置为PassengerId

In [9]:
import pandas as pd
import numpy as np

titanic_df = pd.read_csv('titanic-data.csv', index_col='PassengerId')
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 数据集字段含义

数据集字段在Kaggle的介绍如下：

|变量名称|定义|中文释义|值|
| :--------- | :----- | :---- | :---- |
|survival|Survival|是否生还|0 = No, 1 = Yes|
|pclass	|Ticket class|船票的等级|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex|性别||	
|Age|Age in years|年龄||
|sibsp|# of siblings / spouses aboard the Titanic|兄弟姐妹/配偶 的登船人数||	
|parch|# of parents / children aboard the Titanic|父母/孩子 的登船人数||	
|ticket|Ticket number|船票编号||
|fare|Passenger fare|票价||	
|cabin|Cabin number|船舱号||
|embarked|Port of Embarkation|出发港口|C = Cherbourg, Q = Queenstown, S = Southampton|

<br>

几个特殊字段值得含义：

***pclass:*** A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

<br>

***age:*** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

<br>

***sibsp:*** The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

<br>

***parch:*** The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.


# 提出问题及假设

## 问题：影响乘客生还的因素有哪些，这些因素有怎样的影响

## 假设：

在进行探索之前，推测以下因素对生还有较大影响：

1、性别：男女身体及体能的差异会影响存活的可能

2、年龄：年龄对体能有直接的影响，与存活概率可能有较大关系。

3、亲属数量：包括sibsp 和 parch两个字段，有亲属在船上，可能会对乘客的心理产生影响。

<br>

另外，还有一些因素是否会影响生还概率尚不明确，在此也进行保留：

1、船舱等级与票价

2、出发港口

<br>

推测以下几个因素不太可能与生还率相关，此次不进行考虑：

1、乘客姓名

2、船舱号


# 清理数据集

In [10]:
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
titanic_df.isna()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False
6,False,False,False,False,True,False,False,False,False,True,False
7,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False,False,True,False
10,False,False,False,False,False,False,False,False,False,True,False
