# **机器学习模型**

这里继续以Titanic的数据为例子，介绍常用的机器学习模型的用法。

在之前的课程中，我们对训练数据预处理，并构造新的特征值。按照特征值排列的训练和测试数据保存在目录'ml_course\'下面的文件中。这里我们略过数据预处理特特征值处理，直接读入处理好的训练和测试数据，并测试几种不同的机器模型。

# **软件包和数据加载**

首先加载软件包。从scikit读入将要用到的模型。

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# ML algorithms;
# Algorithms
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

显示目录下的文件。

In [None]:
# Display the folders and files in current directory;
import os
for dirname, _, filenames in os.walk('/kaggle/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

读入含有处理好的特征值的训练数据 train_features.csv 和测试数据 test_features.csv。

In [None]:
# Load train data already pre-processed;
titanic_train = pd.read_csv('/kaggle/input/ml-course/train_features.csv', index_col=0)
titanic_test = pd.read_csv('/kaggle/input/ml-course/test_features.csv', index_col=0)

检查训练数据和测试数据的最初几行。可以看到处理好的特征值：'Pclass', 'Sex', 'Age', 'Cabin', 'Family'。所有的特征值都是数值类型。其中，训练数据'titanic_train'的列'Survived'是作为标签使用的。

In [None]:
titanic_train.head()

In [None]:
titanic_test.head()

# **机器训练模型**

在运行机器学习模型之前，首先需要将训练数据中的特征值和标签分离：训练数据中的列'Survived'是标签，其他列都是特征值。将特征值存于X-train，标签存于y_train。

In [None]:
# Re-organize the data; keep the columns with useful features;
input_cols = ['Pclass',"Sex","Age","Cabin","Family"]
output_cols = ["Survived"]
X_train = titanic_train[input_cols].values
y_train = titanic_train[output_cols].values

选择测试数据。值得注意的是在测试数据中没有标签列'Survived'，其他列和训练数据相同。模型的目的是要预测测试数据的标签。

In [None]:
X_test = titanic_test.values

通常，调用scikit 的机器学习模型包括以下标准步骤：
* 建立模型架构，并设定所用的参数
* 使用模型架构拟合训练数据，得到最终的模型
* 使用模型预测测试数据的标签，并计算性能指标

下面的例子中，对同样的村联和测试数据调用不同的模型。

1. Logistic Regression

In [None]:
# Logistic regression;

# Construct model; the paramters are set as default values;
model = LogisticRegression(penalty='l2',tol=0.0001,random_state=None,solver='lbfgs')
# Fit the model to the data;
model.fit(X_train,y_train)

# Use the model to predict the labels of test data;
y_pred_lr=model.predict(X_test)

# Check the performance of model by using training data;
model.score(X_train,y_train)

2. KNN

In [None]:
# KNN
model = KNeighborsClassifier(n_neighbors = 3) 
model.fit(X_train, y_train)  
y_pred_knn = model .predict(X_test)  
model.score(X_train,y_train)

3. Gussian Naive Bayesian

In [None]:
# Gaussian naive bayesian
from sklearn.naive_bayes import GaussianNB
model= GaussianNB()
model.fit(X_train,y_train)
y_pred_gnb=model.predict(X_test) 
model.score(X_train,y_train)

4. Linear Support Vector Machines

In [None]:
# Linear SVM
model  = LinearSVC()
model.fit(X_train, y_train)

y_pred_svc = model.predict(X_test)
model.score(X_train,y_train)

5. Random Forest

In [None]:
# Random forest
model  = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

y_pred_rf = model.predict(X_test)
model.score(X_train,y_train)

6. Decision Tree

In [None]:
# Decision tree
model = DecisionTreeClassifier() 
model.fit(X_train, y_train)
y_pred_dt = model.predict(X_test) 
model.score(X_train,y_train)