# Titanic: Machine Learning from Disaster


출처: https://github.com/minsuk-heo/kaggle-titanic/blob/master/titanic-solution.ipynb 

본인은 이것을 따라서 해보고 거기에서 궁금증 위주로 진행합니다.

## 질문1 

모두 살았다고 하면 몇 점(%)이나 나올까? 

In [1]:
## 공통부분 
import pandas as pd

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

아래에 나와 있듯이 train과 test 모두 null 값이 매우 많습니다. 

In [2]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [3]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

만약 null(NaN)이 하나라도 있는 행들을 탈락시켜보면 몇개나 남을까요? 

In [4]:
#891개 -> 183개만 남음 
train.dropna(axis=0).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
PassengerId    183 non-null int64
Survived       183 non-null int64
Pclass         183 non-null int64
Name           183 non-null object
Sex            183 non-null object
Age            183 non-null float64
SibSp          183 non-null int64
Parch          183 non-null int64
Ticket         183 non-null object
Fare           183 non-null float64
Cabin          183 non-null object
Embarked       183 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 15.0+ KB


In [5]:
#418개 -> 87개만 남음 
test.dropna(axis=0).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87 entries, 12 to 414
Data columns (total 11 columns):
PassengerId    87 non-null int64
Pclass         87 non-null int64
Name           87 non-null object
Sex            87 non-null object
Age            87 non-null float64
SibSp          87 non-null int64
Parch          87 non-null int64
Ticket         87 non-null object
Fare           87 non-null float64
Cabin          87 non-null object
Embarked       87 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 6.5+ KB


따라서 null 값이 있는 곳도 생존 여부를 추정할 수 있는 방법이 필요합니다. 

### 1.1 모두 생존으로 만들기 

In [6]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [16]:
## 결과 생성하기 
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": 1
    })
submission.to_csv('submission_y01.csv', index=False)

In [17]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,1
1,893,1
2,894,1
3,895,1
4,896,1


### 결과는? 0.37320 / 11336 등 

## 질문2 : 481 명 모두 죽었다고 하면? 

In [18]:
## 결과 생성하기 
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": 0
    })
submission.to_csv('submission_y02.csv', index=False)

In [19]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


### 결과는? 0.62679 / 11096 등 (대략 240등 오름)
이제는 어떻게 해야 하지? 

사실 위와 같은 결과를 train에도 해보고 싶긴 한데.. 다음 기회에~ 

## 질문3: 이제는 재대로 해보자 

주요 인자로 Pclass, Sex 와 Age를 넣는다 /  
train 데이터를 기준으로 머신러닝을 하고 그것으로 test에 대입한다 

In [22]:
## 좀더 손쉽게 원하는 컬럼만 가져오는 방법이 있을까? 
cpy = pd.DataFrame({
        "Pclass": train["Pclass"],
        "Sex": train["Sex"],
        "Age": train["Age"]
    }).copy()

In [25]:
cpy.head()

Unnamed: 0,Age,Pclass,Sex
0,22.0,3,male
1,38.0,1,female
2,26.0,3,female
3,35.0,1,female
4,35.0,3,male


In [24]:
cpy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
Age       714 non-null float64
Pclass    891 non-null int64
Sex       891 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 17.4+ KB


### 3.2 Feature Engineering

머신러닝을 하기 위해서는 NaN 이 있으면 안된다. 
따라서 Age를 feature engineering 해야 한다. 
> 해답에서는 Title(Mr.,Mrs. 등)에 따른 중위값으로 선정하였음 

In [2]:
train_test_data = [train, test] # combining train and test dataset

for dataset in train_test_data:
    dataset['Title'] = dataset['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [3]:
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 3 }
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)

In [4]:
# fill missing age with median age for each title (Mr, Mrs, Miss, Others)
train["Age"].fillna(train.groupby("Title")["Age"].transform("median"), inplace=True)
test["Age"].fillna(test.groupby("Title")["Age"].transform("median"), inplace=True)

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Title          891 non-null int64
dtypes: float64(2), int64(6), object(5)
memory usage: 73.1+ KB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
Title          418 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 31.1+ KB


### 3.2.1 Age 구획화 (Binning)

feature vector map:
child: 0
young: 1
adult: 2
mid-age: 3
senior: 4

In [13]:
for dataset in train_test_data:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0,
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 26), 'Age'] = 1,
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 36), 'Age'] = 2,
    dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <= 62), 'Age'] = 3,
    dataset.loc[ dataset['Age'] > 62, 'Age'] = 4

### 3.2.2 Sex

In [14]:
sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

### 3.3 머신러닝 

In [15]:
## 좀더 손쉽게 원하는 컬럼만 가져오는 방법이 있을까? 
cpy = pd.DataFrame({
        "Pclass": train["Pclass"],
        "Sex": train["Sex"],
        "Age": train["Age"]
    }).copy()

In [16]:
#Age, Pclass, Sex 가 다 채워져있는 것을 확인 (물론 test도 다 채워져 있을 것임)
cpy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
Age       891 non-null float64
Pclass    891 non-null int64
Sex       891 non-null int64
dtypes: float64(1), int64(2)
memory usage: 20.9 KB


In [17]:
cpy.head()

Unnamed: 0,Age,Pclass,Sex
0,1.0,3,0
1,3.0,1,1
2,1.0,3,1
3,2.0,1,1
4,2.0,3,0


#### 이제 머신러닝을 돌려보자 (SVM으로.. 이게 가장 좋다고(?) 하니까..) 

In [9]:
# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

In [None]:
# train_data = train.drop('Survived', axis=1)
# target = train['Survived']

In [20]:
train_data = cpy
target = train['Survived']
clf = SVC()
clf.fit(train_data, target)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [21]:
test_data = pd.DataFrame({
        "Pclass": test["Pclass"],
        "Sex": test["Sex"],
        "Age": test["Age"]
    }).copy()
prediction = clf.predict(test_data)

In [22]:
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": prediction
    })

submission.to_csv('submission_y03.csv', index=False)

In [24]:
submission = pd.read_csv('submission_y03.csv')
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


### 결과는? 0.77511 / 6373등 (4760등 향상)
일단 여기까지 