家族の生存率で考える
大家族は生存率が低い
Cabinは欠測が多いから予測には使わない
→家族判定には使えそう
苗字が同じでEmbarked(乗船場所)が同じだと、家族の可能性が高い

# Kaggle-Titanic


## データセットについて

|     項目     | 説明                            | 用途                                                                    |
|:-------------|--------------------------------:|:-----------------------------------------------------------------------:|
| PassengerId  | 乗客のId                        | 予測には利用しない                                                      |
| Survival     | 乗客の生死                      | 教師データ：これを予測する                                              |
| Pclass       | 乗客の社会階級                  | 予測に使う：                                                            |
| Name         | 乗客の名前                      | 欠測値補完：年齢や家族構成の予想に使える                                |
| Sex          | 性別                            | 予測に使う：女性優先で救命ボートに乗せた(史実)                          |
| Age          | 年齢                            | 予測に使う：若い人優先で救命ボートに乗せた(史実)                        |
| Sibsp        | 乗船している夫婦，兄弟姉妹の数  | 特徴量作成：家族の数を算出                                              |
| Parch        | 乗船している親，子供の数        | 欠測値補完：年齢や家族構成の予想に使える                                |
| Ticket       | チケットNo                      | 予測には利用しない：チケットの順番からCabinは予測できるかも？           |
| Fare         | 乗船料金                        | 予測に使う：                                                            |
| Cabin        | 船室                            | 特徴量作成：家族の特定に使う                                            |
| Embarked     | 乗船場所(3カ所)                 | 特徴量作成：家族の特定に使えそう(同性で同じ場所は家族である確率が高い)  |


### データの読み込み

In [1]:
import numpy as np
import pandas as pd


In [2]:
def load_csv():
    train = pd.read_csv("datasets/train.csv")
    test1 = pd.read_csv("datasets/test.csv")
    test2 = pd.read_csv("datasets/gender_submission.csv")

    Y_train = train['Survived']
    Y_test = test2['Survived']

    PassengerId = np.array(test1["PassengerId"]).astype(int)

    return train, Y_train, test1, Y_test, PassengerId

In [3]:
X_train, Y_train, X_test, Y_test, PassengerId = load_csv()
print("Train Size: {}".format(X_train.size))
print("Test Size: {}".format(X_test.size))
print("Train Shape: {}".format(np.shape(X_train)))
X_train.head(10)

Train Size: 10692
Test Size: 4598
Train Shape: (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### 欠測値の確認

In [4]:
X_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
X_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

* これを見ると、"Cabin"はデータセット中のほとんどが欠測値で、予測には使えなさそう。

* "Age"は生存率の予測に重要なので欠測値を補完して使う。



## 特徴量選択

- 予測には、"Pclass", "Sex", "Age", "Fare"と新たに生成する"nFamily"(同乗した家族の人数)を用いる
    - 家族は同じボートに乗ろうとするため、大家族は不利であった。
- "Age"の欠測値の予測には"Name"特徴量の敬称("Mr","Miss","Mars","Master")を用いる



### 欠測値の補完

"Age"の欠測部分の補完を行う

"Name"の敬称を取り出す

In [6]:
datasets = [X_train, X_test]

for data in datasets:
    data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [7]:
X_train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Mlle          2
Col           2
Ms            1
Countess      1
Jonkheer      1
Sir           1
Don           1
Mme           1
Lady          1
Capt          1
Name: Title, dtype: int64

In [8]:
X_test['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Dr          1
Ms          1
Dona        1
Name: Title, dtype: int64

やはり敬称(特に"Mr","Miss","Mrs","Master")が使えそう
敬称に数値をマッピングする

In [9]:
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master": 3, "Dr": 4, "Rev": 4, "Col": 4, "Major": 4, "Mlle": 4,"Countess": 4,
                 "Ms": 4, "Lady": 4, "Jonkheer": 4, "Don": 4, "Dona": 4, "Mme": 4, "Capt": 4, "Sir": 4}
for data in datasets:
    data['Title'] = data['Title'].map(title_mapping)

"Age"の欠測部分を同一敬称内の平均値で埋める

In [10]:
X_train["Age"].fillna(X_train.groupby("Title")["Age"].transform("median"), inplace=True)
X_test["Age"].fillna(X_test.groupby("Title")["Age"].transform("median"), inplace=True)

In [11]:
X_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [12]:
X_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,2
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,2


### nFamilyの作成
"nFamily"は，基本的には"Name"特徴量のLast Nameでまとめて考える．"Cabin"や"Embarked"が同じで苗字が同じ場合は家族である可能性が高い。逆に、苗字が同じでも"Cabin"や"Embarked"が違えば、家族でない可能性が高い。

これらを考慮して"nFamily"を作成


In [13]:
for data in datasets:
    data['LastName'] = data['Name'].str.extract('([A-Za-z]+)\,', expand=False)

In [14]:
X_train['LastName'].value_counts()

Andersson     9
Sage          7
Carter        6
Panula        6
Skoog         6
Johnson       6
Goodwin       6
Rice          5
Harris        4
Williams      4
Kelly         4
Fortune       4
Lefebre       4
Asplund       4
Brown         4
Ford          4
Hart          4
Jensen        4
Baclini       4
Gustafsson    4
Harper        4
Palsson       4
Smith         4
Goldsmith     3
Taussig       3
West          3
Laroche       3
Allison       3
Boulos        3
Newell        3
             ..
Radeff        1
Gavey         1
Molson        1
Augustsson    1
Smiljanic     1
Laitinen      1
Lines         1
Osman         1
Hassab        1
Pelsmaeker    1
Montvila      1
Olsvigen      1
Allum         1
Touma         1
Sedgwick      1
McGowan       1
Phillips      1
Maenpaa       1
Berriman      1
Lievens       1
Sunderland    1
Sharp         1
Heininen      1
Lahoud        1
Widener       1
Sullivan      1
Stankovic     1
Hunt          1
Dahlberg      1
Madigan       1
Name: LastName, Length: 

In [14]:
sex_mapping = {"male": 0, "female": 1}
for data in datasets:
    data['Sex'] = data['Sex'].map(sex_mapping)
drop_features = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked', 'Title', 'LastName']
X_train.drop(drop_features, axis=1, inplace=True)
X_train.drop(['Survived'], axis=1, inplace=True)
X_test.drop(drop_features, axis=1, inplace=True)

In [15]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,0,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,3,1,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,3,0,35.0,0,0,8.05


In [16]:
X_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,0,34.5,0,0,7.8292
1,3,1,47.0,1,0,7.0
2,2,0,62.0,0,0,9.6875
3,3,0,27.0,0,0,8.6625
4,3,1,22.0,1,1,12.2875


In [18]:
X_train["Fare"].fillna(X_train.groupby("Pclass")["Fare"].transform("median"), inplace=True)
X_test["Fare"].fillna(X_test.groupby("Pclass")["Fare"].transform("median"), inplace=True)

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

def Model(X_train, Y_train, X_test, PassengerId):

    '''グリッドサーチによる最良モデル選択'''
    pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])
    param_grid = [
        {'classifier': [LogisticRegression()], 'preprocessing': [StandardScaler(), None],
         'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
        {'classifier': [SVC()], 'preprocessing': [StandardScaler(), None],
         'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
         'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
        {'classifier': [RandomForestClassifier()],
         'preprocessing': [None], 'classifier__max_features': [1, 2, 3],
         'classifier__n_estimators': [10, 20, 30, 50, 80]},
    ]
    grid = GridSearchCV(pipe, param_grid, cv=5)
    grid.fit(X_train, Y_train)

    print("Best parameters: {}".format(grid.best_params_))
    print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))

    Titanic_Solution = pd.DataFrame(grid.predict(X_test), PassengerId, columns=["Survived"])
    Titanic_Solution.to_csv("Titanic_Solution.csv", index_label=["PassengerId"])

In [20]:
Model(X_train, Y_train, X_test, PassengerId)

Best parameters: {'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False), 'classifier__C': 10, 'classifier__gamma': 0.1, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True)}
Best cross-validation accuracy: 0.83
