# 数据可视化

这里我们仍然选取的泰坦尼克数据作为例子，学习数据可视化的各个步骤。

首先加载数据并且预处理，填补缺失的数据，并且去除无用的信息。参考课程“数据处理（EDA）”。

# 数据读取

首先加载软件包。

In [None]:
import pandas as pd
import numpy as np
import os
# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

泰坦尼克的数据已经加载到远程目录,将数据读入到DataFrame。

In [None]:
# Load data
df_train = pd.read_csv('/kaggle/input/titanic-machine-learning-from-disaster/train.csv')
# Show first lines of data
df_train.head()

# 数据预处理

正如在课程“数据处理（EDA）”中的步骤，数据的预处理填补缺失的数据，将字符特征值数字化，并且去除无用信息。

In [None]:
# Processing missing data: 
# Cabin has high rate of missing data; insted of deleting the column,
# I will give 1 if Cabin is not null; otherwise 0;
df_train['Cabin']=np.where(df_train['Cabin'].isnull(),0,1)

# Combine train and test data, fill the missing values;
dataset = [df_train]

# Encoding;
for data in dataset:
    #complete missing age with median
    data['Age'].fillna(data['Age'].mean(), inplace = True)

    #complete missing Embarked with Mode
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace = True)

    #complete missing Fare with median
    data['Fare'].fillna(data['Fare'].mean(), inplace = True)
    
# Delete irrelavent columns: 
# Name, Ticket (which is ticket code)
drop_column = ['Name','Ticket','Embarked']
df_train.drop(drop_column, axis= 1, inplace = True)

In [None]:
# Convert ‘Sex’ feature into numeric.
genders = {"male": 0, "female": 1}
all_data = [df_train]

for dataset in all_data:
    dataset['Sex'] = dataset['Sex'].map(genders)
df_train['Sex'].value_counts()

至此，数据预处理已经完成，在进行数据可视化之前，检查一下数据的最初几行。

In [None]:
df_train.head()

# 数据可视化

Seaborn library 是一个流行作图工具。这里以它为例作图，直观地分析数据中的每个特征值和目标（survived）关系。

下面的函数用于调整图形中文本信息的尺寸和位置。

In [None]:
# Function of drawing graph;
def draw(graph):
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height + 5,height ,ha= "center")

# **CountPlot**

以下的图形包括：

* Survived vs. non-survied
* Cabin vs. survived
* Sex vs. survived
* Pclass vs. survived
* Parch vs. survived
* SibSp vs. survived

In [None]:
# Draw survided vs. non-survived;
sns.set(style="darkgrid")

plt.figure(figsize = (8, 5))
graph= sns.countplot(x='Survived', hue="Survived", data=df_train)

draw(graph)

In [None]:
# Cabin and survived;
sns.set(style="darkgrid")
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Cabin", hue ="Survived", data = df_train)
draw(graph)

In [None]:
# Sex and survied;
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Sex", hue ="Survived", data = df_train)
draw(graph)

In [None]:
# Pclass and survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Pclass", hue ="Survived", data = df_train)
draw(graph)

In [None]:
# Parch vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Parch", hue ="Survived", data = df_train)
draw(graph)

In [None]:
# SibSp vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="SibSp", hue ="Survived", data = df_train)
draw(graph)

根据日常经验，我们猜测 SibSp 和 Parch 的组合，也就是家庭成员，可能会提供额外的有用信息。首先作图提供直观的信息。

In [None]:
# Combine SibSp and Parch as new feature; 
# Combne train test first;
all_data=[df_train]

for dataset in all_data:
    dataset['Family'] = dataset['SibSp'] + dataset['Parch'] + 1

In [None]:
# Family vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Family", hue ="Survived", data = df_train)
draw(graph)

# **Line Plot**

线图中x轴式序号，y轴有两列数据：Fare, Age.

In [None]:
plt.figure(figsize=(14,6))

plt.title("Line Plot of Age and Fare")

graph = sns.lineplot(data=df_train[['Fare', 'Age']])

下面的散点图描述旅客的 Age。

In [None]:
plt.figure(figsize=(14,6))

plt.title("Scatter Plot of Age")

graph = sns.scatterplot(x=df_train['Age'], y=df_train['Survived'])

# **Histogram**

直方图描述 Fare 的分布规律，可以看出多数票价几种在0-100之间。

In [None]:
plt.figure(figsize=(14,6))

plt.title("Histogram of Fare")

graph = sns.distplot(a=df_train['Fare'], kde=False)

连续化的直方图。

In [None]:
plt.figure(figsize=(14,6))

plt.title("Histogram of Fare")

graph = sns.kdeplot(data=df_train['Fare'], shade=True)

# **Bar Chat**

条形图描述Fare的变化规律。

In [None]:
plt.figure(figsize=(14,6))

plt.title("BarChart of Fare")

graph = sns.barplot(x=df_train.index, y=df_train['Fare'])

# **Heat Map**

热图用颜色描述 Fare 的数值变化。

In [None]:
plt.figure(figsize=(14,6))

plt.title("Heatmap of Fare")

graph = sns.heatmap(data=df_train[['Fare']], annot=True)