# **一个完整的模型**

以Titanic为例，建立一个完整的机器学习模型。模型的建立流程包括：
* 软件包和数据加载
* EDA
  * 数据检查
  * 处理缺失数据
  * 无关信息和冗余信息
  * 非数值数据处理
* 数据可视化
* 特征值工程
* 机器学习模型



# **软件包和数据加载**

加载软件包和数据。

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# ML algorithms;
# Algorithms
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

In [None]:
# Get train/test data
# Notice that train and test have same columns EXCEPT survial;
titanic_train = pd.read_csv('/kaggle/input/titanic-machine-learning-from-disaster/train.csv')
titanic_test = pd.read_csv('/kaggle/input/titanic-machine-learning-from-disaster/test.csv')

# **数据检查**

检查数据的基本信息和统计信息。

In [None]:
titanic_train.head(10)
titanic_test.head(10)

In [None]:
# Size of train data
titanic_train.shape

# Summary of numeric features; the count will tell if there are missing values;
titanic_train.describe()

# Info;
titanic_train.info()

# **处理缺失数据**

以下是一个检查 DataFrame 的数据缺失的函数，用以检查训练数据和测试数据的缺失情况。可以看到，'Age', 'Cabin', 'Fare' 都有数据缺失。

In [None]:
# Function to check the missing percent of a DatFrame;
def check_missing_data(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False) * 100 /len(df),2)
    return pd.concat([total, percent], axis=1, keys=['Total','Percent'])

In [None]:
# Lets check train and test data;
check_missing_data(titanic_train)
check_missing_data(titanic_test)

一个简单的处理缺失数据的方式，就是用0和1填补 'Cabin'的缺失数据。因为Cabin本身的字符没有意义，所以在 Cabin缺失的位置填0，并且用1替代Cabin的原有非空字符。

In [None]:
# Missing data: Cabin has high rate of missing data; insted of deleting the column,
# I will give 1 if Cabin is not null; otherwise 0;
titanic_train['Cabin']=np.where(titanic_train['Cabin'].isnull(),0,1)
titanic_test['Cabin']=np.where(titanic_test['Cabin'].isnull(),0,1)

对于 'Age', 'Fare', 用对应列的平均值填补缺失的数据。而对于 'Embarked'，用相应列中出现最多的字符来代替缺失的值，这通过 dataframe 的函数 data['Embarked'].mode() 来实现。

In [None]:
# Combine train and test data, fill the missing values;
dataset = [titanic_train, titanic_test]

# def missing_data(x):
for data in dataset:
    #complete missing age with median
    data['Age'].fillna(data['Age'].mean(), inplace = True)

    #complete missing Embarked with Mode
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace = True)

        #complete missing Fare with median
    data['Fare'].fillna(data['Fare'].mean(), inplace = True)

缺失数据经过处理以后，再次确认titanic_train已没有缺失数据，并检查 titanic_train 的最初几行。

In [None]:
check_missing_data(titanic_train)
check_missing_data(titanic_test)

# **无关信息和冗余信息**

在之后的练习中，我们将利用训练数据中的信息预测测设数据中的生存状况（即 Survived）。而根据经验，我们相信 'Name', 'Ticket'的信息对生存没有影响。因此将这两列从数据中删除。

In [None]:
# Delete the irrelavent columns: Name, Ticket (which is ticket code)
drop_column = ['Name','Ticket']
titanic_train.drop(drop_column, axis= 1, inplace = True)
titanic_test.drop(drop_column,axis = 1,inplace = True)

# **非数值变量的处理**

数据中的'Sex'仍然是字符串类型，我们需要将其转化为数值变量。以下的代码中，我们将 'male'用0取代， 'female'用1取代。

In [None]:
all_data = [titanic_train, titanic_test]

# Convert ‘Sex’ feature into numeric.
genders = {"male": 0, "female": 1}

for dataset in all_data:
    dataset['Sex'] = dataset['Sex'].map(genders)
titanic_train['Sex'].value_counts()

# **数据可视化**


Seaborn library 是一个流行作图工具。这里以它为例作图，直观地分析数据中的每个特征值和目标（survived）关系。

以下的图形包括：
*     Survived vs. non-survied
*     Cabin vs. survived
*     Sex vs. survived
*     Pclass vs. survived
*     Parch vs. survived
*     SibSp vs. survived

In [None]:
# Function of drawing graph;
def draw(graph):
    for p in graph.patches:
        height = p.get_height()
        graph.text(p.get_x()+p.get_width()/2., height + 5,height ,ha= "center")

In [None]:
# Draw survided vs. non-survived;
sns.set(style="darkgrid")
plt.figure(figsize = (8, 5))
graph= sns.countplot(x='Survived', hue="Survived", data=titanic_train)
draw(graph)

In [None]:
# Cabin and survived;
sns.set(style="darkgrid")
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Cabin", hue ="Survived", data = titanic_train)
draw(graph)

In [None]:
# Sex and survied;
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Sex", hue ="Survived", data = titanic_train)
draw(graph)

In [None]:
# Pclass and survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Pclass", hue ="Survived", data = titanic_train)
draw(graph)

In [None]:
# Embarked and survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Embarked", hue ="Survived", data = titanic_train)
draw(graph)

从'Embarked' 的图中可以看到，'Embarked' 对 'Survived' 没有太大影响，我们认为这一列是不重要的信息，将这一列从数据中删除。

In [None]:
# We think embaked is not important, so drop it;
drop_column = ['Embarked']
titanic_train.drop(drop_column, axis=1, inplace = True)
titanic_test.drop(drop_column,axis=1,inplace=True)

In [None]:
# Parch vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Parch", hue ="Survived", data = titanic_train)
draw(graph)

In [None]:
# SibSp vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="SibSp", hue ="Survived", data = titanic_train)
draw(graph)

# **特征值工程**

根据日常经验，我们猜测 SibSp 和 Parch 的组合，也就是家庭成员，可能会提供额外的有用信息。首先作图提供直观的信息。

In [None]:
# Combine SibSp and Parch as new feature; 
# Combne train test first;
all_data=[titanic_train,titanic_test]

for dataset in all_data:
    dataset['Family'] = dataset['SibSp'] + dataset['Parch'] + 1

In [None]:
# Family vs survied
plt.figure(figsize = (8, 5))
graph  = sns.countplot(x ="Family", hue ="Survived", data = titanic_train)
draw(graph)

将年龄分段，取代原来的连续的年龄值。

In [None]:
# Create bins of ages and check ages vs survived;
# Notice that different bins can be used;
# Add new column in all_data;
for dataset in all_data:
    dataset['Age_cat'] = pd.cut(dataset['Age'], bins=[0,12,20,40,120], labels=['Children','Teenage','Adult','Elder'])
    
plt.figure(figsize = (8, 5))
sns.barplot(x='Age_cat', y='Survived', data=titanic_train)

In [None]:
plt.figure(figsize = (8, 5))
ag = sns.countplot(x='Age_cat', hue='Survived', data=titanic_train)
draw(ag)

将票价分段，取代原来连续的票价。

In [None]:
# Check fare vs survived;
# Create categorical of fare to plot fare vs Pclass first;
for dataset in all_data:
    dataset['Fare_cat'] = pd.cut(dataset['Fare'], bins=[0,10,50,100,550], labels=['Low_fare','median_fare','Average_fare','high_fare'])
plt.figure(figsize = (8, 5))
ag = sns.countplot(x='Pclass', hue='Fare_cat', data=titanic_train)

In [None]:
# Fare vs survived;
sns.barplot(x='Fare_cat', y='Survived', data=titanic_train)

根据以上的年龄分段图形，我们决定用下面的年龄分段方法。

In [None]:
# Use bin to convert ages to bins;
for dataset in all_data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 15, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 15) & (dataset['Age'] <= 20), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 20) & (dataset['Age'] <= 26), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 28), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 28) & (dataset['Age'] <= 35), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 35) & (dataset['Age'] <= 45), 'Age'] = 5
    dataset.loc[ dataset['Age'] > 45, 'Age'] = 6
titanic_train['Age'].value_counts()

综合以上关于特征值的分析，以下的特征值被认为是不重要的，因此从数据中删除。

In [None]:
# Remove features that are not sued, combined, etc
for dataset in all_data:
    drop_column = ['Age_cat','Fare','SibSp','Parch','Fare_cat','PassengerId']
    dataset.drop(drop_column, axis=1, inplace = True)

查看保留下来的特征值。

In [None]:
titanic_train.head()

# **特征值相关性**

特征值的相关性有助于分析数据中是否存在冗余信息。从相关系数矩阵中，我们没有发现冗余信息。

In [None]:
# Correlation;
corr=titanic_train.corr()#['Survived']

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.subplots(figsize = (12,8))
sns.heatmap(corr, 
            annot=True,
            mask = mask,
            cmap = 'RdBu',
            linewidths=.9, 
            linecolor='white',
            vmax = 0.3,
            fmt='.2f',
            center = 0,
            square=True)
plt.title("Correlations Matrix", y = 1,fontsize = 20, pad = 20);

# **机器预测模型**

在运行机器学习模型之前，首先需要将训练数据中的特征值和标签分离：训练数据中的列'Survived'是标签，其他列都是特征值。将特征值存于X-train，标签存于y_train。

选择测试数据。值得注意的是在测试数据中没有标签列'Survived'，其他列和训练数据相同。模型的目的是要预测测试数据的标签

In [None]:
# Re-organize the data; keep the columns with useful features;
input_cols = ['Pclass',"Sex","Age","Cabin","Family"]
output_cols = ["Survived"]
X_train = titanic_train[input_cols]
y_train = titanic_train[output_cols]

X_test = titanic_test

通常，调用scikit 的机器学习模型包括以下标准步骤：

* 建立模型架构，并设定所用的参数
* 使用模型架构拟合训练数据，得到最终的模型
* 使用模型预测测试数据的标签，并计算性能指标

下面的例子中，对同样的训练和测试数据调用不同的模型。

1. Logistic Regression

In [None]:
# Logistic regression;

model = LogisticRegression()
model.fit(X_train,y_train)
y_pred_lr=model.predict(X_test)
model.score(X_train,y_train)


对以上的 Logistic Regression 模型使用5折交叉验证，并显示每一次验证的结果。

In [None]:

from sklearn.model_selection import cross_val_score
-cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_absolute_error')

2. K-Nearest Neighbor

In [None]:
# KNN
model = KNeighborsClassifier(n_neighbors = 3) 
model.fit(X_train, y_train)  
y_pred_knn = model .predict(X_test)  
model.score(X_train,y_train)

3. Gaussian Naive Bayesian

In [None]:
# Gaussian naive bayesian
from sklearn.naive_bayes import GaussianNB
model= GaussianNB()
model.fit(X_train,y_train)
y_pred_gnb=model.predict(X_test) 
model.score(X_train,y_train)

4. Support Vector Machines

In [None]:
# Linear SVM
model  = LinearSVC()
model.fit(X_train, y_train)

y_pred_svc = model.predict(X_test)
model.score(X_train,y_train)

5. Random Forest

In [None]:
# Random forest
model  = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

y_pred_rf = model.predict(X_test)
model.score(X_train,y_train)

6. Decision Tree

In [None]:
# Decision tree
model = DecisionTreeClassifier() 
model.fit(X_train, y_train)
y_pred_dt = model.predict(X_test) 
model.score(X_train,y_train)