本人為機器學習的新人,最近學習完GOOGLE的Machine Learning課程([https://developers.google.com/machine-learning/crash-course/](http://)),因為網上的機器學習例子多數是用R語言或Python的sklearn,所以在此嘗試以tensorflow來解決tantic問題

**1.  Import Libary and Dataset 導入庫與數據**

1.1 import Libary

In [None]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
import datetime
import random
import seaborn as sns
from functools import reduce

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.2f}'.format


1.2 import dataset

In [None]:
train = pd.read_csv("../input/train.csv", sep=",")
test = pd.read_csv("../input/test.csv", sep=",")
full = train.append(test)

#train = train.reindex(np.random.permutation(train.index))
full.info()

In [None]:
train.head(5)

In [None]:
train.describe()

經觀察初步了解數據的內容,接下來尋找各數據與Survived的關係。

**2.  Feature Engineering 特征工程**

從現有的特征當中尋找出與目標有關聯的部份,對缺失的項進行修補,或從現有的特征中派生出潛在的特征。



2.1 Observation觀察數據

在進行數據挖掘時往往會在特征工程的部份花費最多的時間,由於本人在機器學習方面的經驗較少,所以進行特征工程亦參考了其他優秀的kernal,接下來我會先對Pclass, Sex, Embarked, SibSp, Parch 這幾項特征值總數較少的特征開始,畫圖表進行分析。

In [None]:
f,ax = plt.subplots(2,3,figsize=(16,10))
sns.countplot('Pclass',hue='Survived',data=train,ax=ax[0,0])
sns.countplot('Sex',hue='Survived',data=train,ax=ax[0,1])
sns.countplot('Embarked',hue='Survived',data=train,ax=ax[0,2])
sns.countplot('SibSp',hue='Survived',data=train,ax=ax[1,0])
sns.countplot('Parch',hue='Survived',data=train,ax=ax[1,1])

從以上圖表可對存活率初步得出以下的結論:

Pclass: 1 > 2 > 3

Sex: female > male

Embark: C > Q > S

SibSp: 1 > others

Parch: 1 > others

這些均可作為有效的特征以供機器學習

2.2 Missing Data 缺失數據

從full.info()中得知數據中Age, Cabin, Embarked, Fare均有缺失的項,如果不進行處理則會影響計算的結果,令學習不能收斂,因此要先補上缺失項。

可以簡單地直接將該項的平均數,中位數或眾數填補到缺失項中,或以其他特征作分組,再以該分組的平均值作為替補值,更可以以一個新值作為替補,因為缺失數據本身也是一項資訊。

下面先以Embarked作示範,先將缺失值標記為X,觀察與Fare, Pclass間的關係:

In [None]:
x = 'Embarked'
y = 'Fare'
hue = 'Pclass'
data = full.copy()
data['Embarked'].fillna('X', inplace=True)
f, ax = plt.subplots(figsize=(8, 5))
fig = sns.boxplot(x=x, y=y,  data=data)
fig.axis(ymin=0, ymax=200);

從上圖可看出,缺失Embarked的乘客,其票價相對於Embarked S和Q都屬於偏離值,所以其Embarked較大可能是C,又考慮到Fare與Pclass有關,下面再畫出根據Pclass分類的關係圖:

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
fig = sns.boxplot(x=x, y=y, hue=hue, data=data)
fig.axis(ymin=0, ymax=250);

可以看到,缺失Embarked的乘客的Pclass為1, 其Fare又與Embarked=C Pclass=1的中位數相吻合,所以將其缺失值填為C

In [None]:
full['Embarked'].fillna('C', inplace=True)

接下來觀察Fare與Pclass, Sex, Embarked間的關係:

In [None]:
f,ax = plt.subplots(1,3,figsize=(16,5))

x = 'Embarked'
y = 'Fare'
hue = 'Sex'
data = full.copy()
fig = sns.boxplot(x=x, y=y, hue=hue, data=data,ax=ax[0])
fig.axis(ymin=0, ymax=300);

x = 'Sex'
y = 'Fare'
hue = 'Pclass'
data = full.copy()
fig = sns.boxplot(x=x, y=y, hue=hue, data=data,ax=ax[1])
fig.axis(ymin=0, ymax=300);

x = 'Pclass'
y = 'Fare'
hue = 'Embarked'
data = full.copy()
fig = sns.boxplot(x=x, y=y, hue=hue, data=data,ax=ax[2])
fig.axis(ymin=0, ymax=300);

可看到Fare與Embarked, Sex和Pclass都有相對的關係,且存在著一定數量的離群值,所以選擇Fare的缺失值以此三項分組,找出中位數來填補

In [None]:
for sex in full.Sex.unique():
    for pclass in full.Pclass.unique():
        for embarked in full.Embarked.unique():
            features = (full.Sex == sex) & (full.Pclass == pclass) & (full.Embarked == embarked)
            select_nan = np.isnan(full["Fare"]) & features
            full.loc[select_nan,'Fare'] = full[features].Fare.median()

再來是Age, 同樣地觀察Age與其他特征的關係:

In [None]:
f,ax = plt.subplots(1,5,figsize=(20,5))

x = 'Embarked'
y = 'Age'
data = full.copy()
fig = sns.boxplot(x=x, y=y, data=data,ax=ax[0])
fig.axis(ymin=0, ymax=85);

x = 'Sex'
y = 'Age'
data = full.copy()
fig = sns.boxplot(x=x, y=y,  data=data,ax=ax[1])
fig.axis(ymin=0, ymax=85);

x = 'Pclass'
y = 'Age'
data = full.copy()
fig = sns.boxplot(x=x, y=y,  data=data,ax=ax[2])
fig.axis(ymin=0, ymax=85);

x = 'Parch'
y = 'Age'
data = full.copy()
fig = sns.boxplot(x=x, y=y,  data=data,ax=ax[3])
fig.axis(ymin=0, ymax=85);

x = 'SibSp'
y = 'Age'
data = full.copy()
fig = sns.boxplot(x=x, y=y,  data=data,ax=ax[4])
fig.axis(ymin=0, ymax=85);

從上面三圖判斷Age與Sex並沒有太大的相關性,Parch因為父輩和子輩間年齡差較大,所以選擇用Embarked, Pclass, SibSp作分組,計選其平均值以填補缺失

In [None]:
for sibSp in full.SibSp.unique():
    for pclass in full.Pclass.unique():
        for embarked in full.Embarked.unique():
            features = (full.SibSp == sibSp) & (full.Pclass == pclass) & (full.Embarked == embarked)
            select_nan = np.isnan(full["Age"]) & features
            full.loc[select_nan,'Age'] = full[features].Age.mean()

而Cabin因為缺失值太多,所以選擇放棄其特征,把其他缺失值都補全後再檢查一下full.info().以確保沒有缺失值:

In [None]:
full.info()

發現Age仍然後三條缺失的資料,因為以SibSp, Pclass和 Embarked為分組後,這三條為同一組且沒有非缺失的值,因為選擇只以其中兩項作分組再次填補

In [None]:
full['Age'].fillna(full[(full.SibSp == 2) & (full.Pclass == 3)]['Age'].mean(), inplace=True)
full.info()

填補完後,再觀察連結值Age和Fare與Survived的關係:

In [None]:
a = sns.FacetGrid( train, hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , train['Age'].max()))
a.add_legend()

a = sns.FacetGrid( train, hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Fare', shade= True )
a.set(xlim=(0 , train['Fare'].quantile(0.95)))
a.add_legend()

可得知在不同年齡段和票價段,其生存率亦有所不同,因為以圖中的交點作為分段,作出Age和Fare的boundaries

In [None]:
age_boundaries = [14, 30, 40, 49, 57]
fare_boundaries = [18, 25]

2.3 Derived Feature 派生特征
從現有的特征中通過計算組合出新的特征,或從特征的值中提取出新特征

In [None]:
train.info()

In [None]:
full['FamilySize'] = full.SibSp + full.Parch + 1
train = full.head(891)
test = full.tail(418)
sns.countplot('FamilySize',hue='Survived',data=train)

從圖得知FamilySize為2~4時生存機率較高,因此設定family_size_boundaries的邊界為1和4

In [None]:
family_size_boundaries=[1, 4]

In [None]:
full['Single'] = full.FamilySize.apply(lambda fs: True if fs == 1 else False)

接下來看一看name的數據

In [None]:
full.Name.sample(10)

所有的名字中間均有其稱呼,因此可以將其提取出來作為Title特征

In [None]:
full['Title'] = full['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
full['Title'].unique()

將Title數目大於2和小於2分開兩張圖表示

In [None]:
train = full.head(891)
test = full.tail(418)

title_names = (train['Title'].value_counts() > 2) 
train.insert(loc = len(train.columns),column='BigTitle', value=train['Title'].apply(lambda x: title_names.loc[x]))
train[train.BigTitle == True].Title.unique()
f,ax = plt.subplots(1,2,figsize=(16,5))
sns.countplot('Title',hue='Survived',data=train[train.BigTitle == True], ax=ax[0])
sns.countplot('Title',hue='Survived',data=train[train.BigTitle == False], ax=ax[1])

將小於2的Title轉成同一種Title作為同類處理

In [None]:
full['Title'] = full['Title'].apply(lambda title: 'Don' if not title_names.index.contains(title) else title)
full['Title'] = full['Title'].apply(lambda title: title if title_names.loc[title] == True else 'X')
full.Title.unique()

觀察名字長度與生存率之間的關係

In [None]:
full['NameLength'] = full['Name'].apply(lambda name: len(name))
train = full.head(891)
test = full.tail(418)
a = sns.FacetGrid( train, hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'NameLength', shade= True )
a.set(xlim=(0 , train['NameLength'].quantile(0.95)))
a.add_legend()

從圖得知nameLength為12~25時生存機率較低,因此設定family_size_boundaries的邊界為12和28

In [None]:
name_length_boundaries = [12, 28]

In [None]:
full.columns

接下來觀察組合特征,了解在不同維度的特征組合下的生存機率

In [None]:
a = sns.FacetGrid( train,col='Sex', hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , train['Age'].max()))
a.add_legend()

a = sns.FacetGrid( train,col='Pclass', hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , train['Age'].max()))
a.add_legend()

a = sns.FacetGrid( train,col='Pclass', hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Fare', shade= True )
a.set(xlim=(0 , 100))
a.add_legend()

a = sns.FacetGrid( train,col='Sex', hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'NameLength', shade= True )
a.set(xlim=(0 , 100))
a.add_legend()

In [None]:
a = sns.FacetGrid( train,col='Sex', row='Single', hue = 'Survived', aspect=3 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , 100))
a.add_legend()

In [None]:
sex_cross_age_boundaries = [15, 26, 32,46, 54]
parch_cross_age_boundaries = [5, 10, 18, 30, 35]
pclass_cross_age_boundaries = [18, 30, 36, 40, 47]
pclass_cross_fare_boundaries = [8,18,25, 55]
sex_cross_name_length_boundaries = [12, 26, 42]

In [None]:
train = full.head(891)
test = full.tail(418)

train['SexCode'] = train.Sex.apply(lambda sex: 1 if sex == 'male' else 0)

f,ax = plt.subplots(2,2,figsize=(20,16))

sns.swarmplot(x='Single',y='Pclass',hue='Survived',data=train,palette='husl',ax=ax[0,0])
sns.swarmplot(x='Parch',y='SexCode',hue='Survived',data=train,palette='husl',ax=ax[0,1])
sns.swarmplot(x='Embarked',y='Pclass',hue='Survived',data=train,palette='husl',ax=ax[1,0])
sns.swarmplot(x='Embarked',y='SexCode',hue='Survived',data=train,palette='husl',ax=ax[1,1])

In [None]:
full['Parch'] = full['Parch'].apply(lambda parch: parch if parch <= 2 else 2)
full['SibSp'] = full['SibSp'].apply(lambda parch: parch if parch <= 5 else 5)

得出存活率與各條件的相關圖

In [None]:
train = full.head(891)
test = full.tail(418)
train['SexCode'] = train.Sex.apply(lambda sex: 1 if sex == 'male' else 0)
#train = pd.get_dummies(data=train, columns = ['Sex'])
corrmat = train[['Survived', 'SexCode', 'Pclass', 'Age', 'SibSp', 'FamilySize', 'Parch', 'Fare']].corr()
f, ax = plt.subplots(figsize=(12, 9))
colormap = plt.cm.RdBu
sns.heatmap(corrmat,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

**3.  Maching Learning with Tensorflow 用Tensorflow 進行機器學習**

下面的函數都是參考GOOGLE的課程練習,再加以修改以取得dataframe中的features欄目

In [None]:
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """Trains a linear regression model of one feature.

    Args:
      features: pandas DataFrame of features
      targets: pandas DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """
    
    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(features).items()}                             

    # Construct a dataset, and configure batching/repeating
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit      
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    # Shuffle the data, if specified
    if shuffle:
        ds = ds.shuffle(10000)

    # Return the next batch of data
    features, labels = ds.make_one_shot_iterator().get_next()

    return features, labels

In [None]:
def train_linear_classifier_model(
    learning_rate,
    steps,
    batch_size,
    periods,
    regularization_strength,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets):
    """Trains a linear regression model of one feature.
  
  In addition to training, this function also prints training progress information,
  as well as a plot of the training and validation loss over time.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for training.
    training_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for training.
    validation_examples: A `DataFrame` containing one or more columns from
      `california_housing_dataframe` to use as input features for validation.
    validation_targets: A `DataFrame` containing exactly one column from
      `california_housing_dataframe` to use as target for validation.
      
  Returns:
    A `LinearClassifier` object trained on the training data.
  """

    steps_per_period = steps / periods
    
    # Create a linear classifier object.
    my_optimizer = tf.train.FtrlOptimizer(learning_rate=learning_rate, l1_regularization_strength=regularization_strength)
    #my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)    
    linear_classifier = tf.estimator.DNNClassifier(
    #linear_classifier = tf.estimator.LinearClassifier(
      feature_columns=construct_feature_columns(training_examples),
      hidden_units=[10, 10],
      optimizer=my_optimizer
    )
    
    # Create input functions
    training_input_fn = lambda: my_input_fn(training_examples, 
                                          training_targets["target"], 
                                          batch_size=batch_size)
    predict_training_input_fn = lambda: my_input_fn(training_examples, 
                                                  training_targets["target"], 
                                                  num_epochs=1, 
                                                  shuffle=False)
    predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                    validation_targets["target"], 
                                                    num_epochs=1, 
                                                    shuffle=False)
    # Train the model, but do so inside a loop so that we can periodically assess
    # loss metrics.
    print("Training model...")
    print("LogLoss (on training data):")
    training_log_losses = []
    validation_log_losses = []
    for period in range (0, periods):
        # Train the model, starting from the prior state.
        linear_classifier.train(
            input_fn=training_input_fn,
            steps=steps_per_period
        )
        # Take a break and compute predictions.    
        training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)
        training_probabilities = np.array([item['probabilities'] for item in training_probabilities])

        validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
        validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])

        training_log_loss = metrics.log_loss(training_targets, training_probabilities)
        validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)
        # Occasionally print the current loss.
        print( "  period %02d : %0.2f" % (period, training_log_loss))
        # Add the loss metrics from this period to our list.
        training_log_losses.append(training_log_loss)
        validation_log_losses.append(validation_log_loss)
    print("Model training finished.")
    
    # Output a graph of loss metrics over periods.
    plt.ylabel("LogLoss")
    plt.xlabel("Periods")
    plt.title("LogLoss vs. Periods")
    plt.tight_layout()
    plt.plot(training_log_losses, label="training")
    plt.plot(validation_log_losses, label="validation")
    plt.legend()

    return linear_classifier

In [None]:
def preprocess_features(df):
    """Prepares input features from tantic data set.

    Args:
    df: A Pandas DataFrame expected to contain data
      from the train data set.
    Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
    """
    selected_features = df[
        ['Sex', 'Pclass', 'Age', 'Parch', 'SibSp', 'FamilySize', 'Single', 'Fare', 'Title', 'Embarked', 'NameLength']]
    processed_features = selected_features.copy()
    
    return processed_features

def preprocess_targets(df):
    """Prepares target features (i.e., labels) from tantic data set.

    Args:
    df: A Pandas DataFrame expected to contain data
      from the train data set.
    Returns:
    A DataFrame that contains the target feature.
    """
    output_targets = pd.DataFrame()
    output_targets["target"] =  df['Survived'] 
    return output_targets

定義好函數後,將數據分割為訓練組train和答案test,再將traing分為訓練和驗證組validation

In [None]:
train = full.head(891)
test = full.tail(418)

training_examples = preprocess_features(train.head(700))
training_targets = preprocess_targets(train.head(700))

validation_examples = preprocess_features(train.tail(291))
validation_targets = preprocess_targets(train.tail(291))

# Double-check that we've done the right thing.
print ("Training examples summary:")
display.display(training_examples.describe())
print( "Validation examples summary:")
display.display(validation_examples.describe())

#print( "Training targets summary:")
#display.display(training_targets.describe())
#print( "Validation targets summary:")
#display.display(validation_targets.describe())

將上一部份所得的相關項套入函數中,構建訓練用的欄目,主要用bucket column和numeric column去學習

In [None]:
def cross_columns(crolss_array, hash_bucket_size=1000):
    cross_column = tf.feature_column.indicator_column(tf.feature_column.crossed_column(crolss_array , hash_bucket_size=hash_bucket_size))
    return cross_column
    
def construct_feature_columns(input_features):
    """Construct the TensorFlow Feature Columns.

    Args:
    input_features: The names of the numerical input features to use.
    Returns:
    A set of feature columns
    """
    features = []

    sex_categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key='Sex',vocabulary_list=["M", "F"])
    sex_indicator_column = tf.feature_column.indicator_column(sex_categorical_column)
    features.append(sex_indicator_column)

    pclass_categorical_column = tf.feature_column.categorical_column_with_identity(key='Pclass',num_buckets=4)
    pclass_indicator_column = tf.feature_column.indicator_column(pclass_categorical_column)
    features.append(pclass_indicator_column)
    
    embarked_categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key='Embarked',vocabulary_list=["S", "C", "Q"])
    embarked_indicator_column = tf.feature_column.indicator_column(embarked_categorical_column)
    features.append(embarked_indicator_column)
        
    title_categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key='Title',vocabulary_list=full.Title.unique())
    title_indicator_column = tf.feature_column.indicator_column(title_categorical_column)
    features.append(title_indicator_column)
    
    name_length_categorical_column = tf.feature_column.numeric_column("NameLength")
    name_length_bucket_column = tf.feature_column.bucketized_column(name_length_categorical_column, boundaries=name_length_boundaries)
    features.append(name_length_bucket_column)

    parch_categorical_column = tf.feature_column.categorical_column_with_identity(key='Parch',num_buckets=4)
    parch_indicator_column = tf.feature_column.indicator_column(parch_categorical_column)
    #features.append(parch_indicator_column)
    
    sibsp_categorical_column = tf.feature_column.categorical_column_with_identity(key='SibSp',num_buckets=4)
    sibsp_indicator_column = tf.feature_column.indicator_column(sibsp_categorical_column)
    #features.append(sibsp_indicator_column)
    
    family_size_categorical_column = tf.feature_column.numeric_column("FamilySize")
    family_size_bucket_column = tf.feature_column.bucketized_column(family_size_categorical_column, boundaries=family_size_boundaries)
    features.append(family_size_bucket_column)
    
    single_numric_column = tf.feature_column.numeric_column('Single')
    features.append(single_numric_column)
    
    age_categorical_column = tf.feature_column.numeric_column("Age")
    age_bucket_column = tf.feature_column.bucketized_column(age_categorical_column, boundaries=age_boundaries)
    #features.append(age_bucket_column)
    
    fare_categorical_column = tf.feature_column.numeric_column("Fare")
    fare_bucket_column = tf.feature_column.bucketized_column(fare_categorical_column, boundaries=fare_boundaries)
    features.append(fare_bucket_column)    
    
    
    sex_cross_age_bucket_column = tf.feature_column.bucketized_column(age_categorical_column, boundaries=age_boundaries)
    features.append(cross_columns(['Sex', 'Single', sex_cross_age_bucket_column]))        
            
    parch_cross_age_bucket_column = tf.feature_column.bucketized_column(age_categorical_column, boundaries=parch_cross_age_boundaries)
    #features.append(cross_columns(['Parch', parch_cross_age_bucket_column]))
    
    pclass_cross_fare_bucket_column = tf.feature_column.bucketized_column(fare_categorical_column, boundaries=pclass_cross_fare_boundaries)
    #features.append(cross_columns(['Pclass', pclass_cross_fare_bucket_column]))
    
    pclass_cross_age_bucket_column = tf.feature_column.bucketized_column(age_categorical_column, boundaries=pclass_cross_age_boundaries)
    #features.append(cross_columns(['Pclass', pclass_cross_age_bucket_column]))
    
    sex_cross_name_length_bucket_column = tf.feature_column.bucketized_column(name_length_categorical_column, boundaries=sex_cross_name_length_boundaries)
    #features.append(cross_columns(['Sex', sex_cross_name_length_bucket_column]))
        
    #features.append(cross_columns(['Sex', 'Pclass']))
    features.append(cross_columns(['SibSp', 'Parch'], 18))
    features.append(cross_columns(['SibSp', 'Sex'], 12))
    features.append(cross_columns(['Single', 'Pclass'], 6))
    
    #features.append(cross_columns([age_bucket_column, 'Pclass']))
    
    #features.append(cross_columns([age_bucket_column, 'Sex']))
    
    #features.append(cross_columns(['Embarked', 'Sex']))
    
    #features.append(cross_columns(['Embarked', age_bucket_column]))
    
    features.append(cross_columns(['Embarked', 'Pclass']))

    feature_columns = set(features)
    return feature_columns

開始訓練模型,從結果圖可以得知訓練出的模型(藍線)套用在驗證組(橙線)的效果,如果偏差太大則說明出現了overfitting

In [None]:
linear_classifier = train_linear_classifier_model(
    learning_rate=0.16,
    steps=200,
    batch_size=500,
    periods=15,
    regularization_strength=0.015,
    training_examples=training_examples,
    training_targets=training_targets,
    validation_examples=validation_examples,
    validation_targets=validation_targets)

使用訓練後的模型去預測驗證組的數據,得出模型的準確率等數據

In [None]:
predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                                    validation_targets["target"], 
                                                    num_epochs=1, 
                                                    shuffle=False)

evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)
print(evaluation_metrics.keys())
print("AUC on the validation set: %0.2f" % evaluation_metrics['auc'])
print("Accuracy on the validation set: %0.2f" % evaluation_metrics['accuracy'])
print(evaluation_metrics)

In [None]:
validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
# Get just the probabilities for the positive class
validation_probabilities = np.array([item['probabilities'][1] for item in validation_probabilities])

false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(
    validation_targets, validation_probabilities)
plt.plot(false_positive_rate, true_positive_rate, label="our model")
plt.plot([0, 1], [0, 1], label="random classifier")
_ = plt.legend(loc=2)

true_positive_rate

In [None]:
def assign_probability(df, linear_classifier, validation=False,field='probability'):
    result = df.copy()
    fake = df.copy()
    fake['Survived'] = 0
    v_examples = preprocess_features(result)
    if validation:
        v_targets = preprocess_targets(result)
    else:
        v_targets = preprocess_targets(fake)
    predict_validation_input_fn = lambda: my_input_fn(v_examples, 
                                                        v_targets['target'], 
                                                        num_epochs=1, 
                                                        shuffle=False)
    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
    result[field] = np.array([item['probabilities'][1] for item in validation_probabilities])
    return result

In [None]:
result = assign_probability(test, linear_classifier,False)
validation = assign_probability(train, linear_classifier,True)
validation

In [None]:
def find_treshold(validation):
    best_accuracy = 0
    best_threshold = 0
    target = 'Survived'
    for i in range(0, 101):
        threshold = i/100.0
        validation['new_survived'] = validation['probability'].apply(lambda p: 1 if p >= threshold else 0)
        accuracy = validation[validation['new_survived'] == validation['Survived']]['Survived'].count()/validation['Survived'].count().astype(float)
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_threshold = threshold
    threshold = best_threshold
    validation['new_survived'] = validation['probability'].apply(lambda p: 1 if p >= best_threshold else 0)

    p = validation[validation['probability'] >= threshold]
    n = validation[validation['probability'] < threshold]
    tp = p[p[target] == 1]
    fp = p[p[target] == 0]
    tn = n[n[target] == 0]
    fn = n[n[target] == 1]

    pn = p['Survived'].count().astype(float)
    nn = n['Survived'].count().astype(float)
    tpn = tp['Survived'].count().astype(float)
    fpn = fp['Survived'].count().astype(float)
    tnn = tn['Survived'].count().astype(float)
    fnn = fn['Survived'].count().astype(float)

    print ('best_threshold: %s' % threshold)
    print ('best_accuracy: %s' % best_accuracy)
    print ('result number: %s' % pn)
    print ('tpn: %s' % tpn)
    print ('fpn: %s' % fpn)
    print ('tnn: %s' % tnn)
    print ('fnn: %s' % fnn)

    precision = tpn / pn
    tp_rate = tpn / (tpn + fnn)
    fp_rate = fpn / (fpn + tnn)
    precision_n = tnn / (tnn + fnn)

    print ('precision: %s' % precision)
    print ('tp_rate: %s' % tp_rate)
    print ('fp_rate: %s' % fp_rate)
    print ('precision_n: %s' % precision_n)
    return best_threshold

In [None]:
threshold = find_treshold(validation)
threshold

In [None]:
result["Survived"] = result["probability"].apply(lambda a: 1 if a > threshold else 0)
evaluation = result[["PassengerId", "Survived"]]
evaluation

In [None]:
evaluation.to_csv("evaluation_submission.csv",index=False)