## 타이타닉 데이터셋 도전

- 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표

- [캐글](https://www.kaggle.com)의 [타이타닉 챌린지](https://www.kaggle.com/c/titanic)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 datasets 디렉토리에 titanic_train.csv titanic_test.csv로 저장

### 1. 데이터 적재

In [74]:
import pandas as pd
train_data = pd.read_csv("datasets\\titanic_train.csv")
test_data = pd.read_csv("datasets\\titanic_test.csv")

### 2. 데이터 탐색

#### train_data 살펴보기

In [75]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


* **Survived**: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
* **Pclass**: 승객 등급. 1, 2, 3등석.
* **Name**, **Sex**, **Age**: 이름 그대로의 의미
* **SibSp**: 함께 탑승한 형제, 배우자의 수
* **Parch**: 함께 탑승한 자녀, 부모의 수
* **Ticket**: 티켓 아이디
* **Fare**: 티켓 요금 (파운드)
* **Cabin**: 객실 번호
* **Embarked**: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)


#### 누락 데이터 살펴보기

In [76]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- **Age**, **Cabin**, **Embarked** 속성의 일부가 null
- 특히 **Cabin**은 77%가 null. 일단 **Cabin**은 무시하고 나머지를 활용
- **Age**는 177개(19%)가 null이므로 이를 어떻게 처리할지 결정해야 함 - null을 중간 나이로 바꾸기 고려
- **Name**과 **Ticket** 속성은 숫자로 변환하는 것이 조금 까다로와서 지금은 무시

#### 통계치 살펴보기

In [77]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* 38%만 **Survived**
* 평균 **Fare**는 32.20 파운드
* 평균 **Age**는 30보다 적음

#### Survived(머신러닝에서 타깃)가 0과 1로 이루어졌는지 확인

In [78]:
train_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

#### 범주형(카테고리) 특성들을 확인

In [79]:
train_data["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [80]:
train_data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [81]:
train_data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**Embarked** 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.

### 3. 데이터 탐색 (상세)

#### Name과 Age 열 을 Age 순으로 정렬해서 보기

In [82]:
train_data[["Name", "Age"]].sort_values(by="Age")

Unnamed: 0,Name,Age
803,"Thomas, Master. Assad Alexander",0.42
755,"Hamalainen, Master. Viljo",0.67
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
78,"Caldwell, Master. Alden Gates",0.83
...,...,...
859,"Razi, Mr. Raihed",
863,"Sage, Miss. Dorothy Edith ""Dolly""",
868,"van Melkebeke, Mr. Philemon",
878,"Laleff, Mr. Kristo",


#### 나이(Age)가 60 이상인 사람들의 Name과 Age 확인해 보기

In [83]:
train_data[train_data["Age"] >= 60][["Name", "Age"]]

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0
252,"Stead, Mr. William Thomas",62.0
275,"Andrews, Miss. Kornelia Theodosia",63.0
280,"Duane, Mr. Frank",65.0
326,"Nysveen, Mr. Johan Hansen",61.0
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0


#### 나이가(Age)가 60 이고 1등석에 탔으며 여성인 탑승자 확인해 보기

In [84]:
train_data[(train_data["Age"] == 60) & (train_data["Pclass"] == 1) & (train_data["Sex"] == "female")]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C


#### 요금(Fare)의 최대값 최소값 확인해 보기

In [85]:
train_data["Fare"].max()

512.3292

In [86]:
train_data["Fare"].min()

0.0

#### 등급(Pclass) 그룹별 생존률 확인해보기

In [87]:
train_data.groupby("Pclass").size()  # train_data["Pclass"].value_counts()

Pclass
1    216
2    184
3    491
dtype: int64

In [88]:
train_data.groupby(["Pclass", "Survived"]).size()

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
dtype: int64

In [89]:
train_data.groupby("Pclass")["Survived"].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

#### 특성을 조합해 또다른 특성(RelativesOnboard)을 만들기(가족과 탑승한 사람과 혼자 탑승한 사람)

In [90]:
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"] + 1

In [91]:
train_data["RelativesOnboard"].value_counts()

1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: RelativesOnboard, dtype: int64

#### RelativesOnboard 그룹의 생존률 평군 알아보기

In [92]:
train_data.groupby("RelativesOnboard")["Survived"].mean()

RelativesOnboard
1     0.303538
2     0.552795
3     0.578431
4     0.724138
5     0.200000
6     0.136364
7     0.333333
8     0.000000
11    0.000000
Name: Survived, dtype: float64

### 4. 데이터 전처리 (누락 데이터 처리, 범주화 등)

#### Cabin 열 : 전체 삭제하기

In [93]:
train_data.dropna(thresh=500, axis=1, inplace=True)

In [94]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked', 'RelativesOnboard'],
      dtype='object')

#### Embarked 열 : 승선도시 최고 빈도수 값으로 대체하기

In [95]:
train_data["Embarked"].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

In [96]:
# embark_town 열의 NaN값을 승선도시 중에서 가장 많이 출현한 값으로 치환하기
most_freq = train_data["Embarked"].value_counts().idxmax()
most_freq

'S'

In [97]:
train_data["Embarked"].fillna(most_freq, inplace=True)

In [98]:
train_data["Embarked"].value_counts(dropna=False)

S    646
C    168
Q     77
Name: Embarked, dtype: int64

#### Age 열 : 중간값으로 대체하기

In [99]:
train_data["Age"].isnull().sum()

177

In [100]:
median_age = train_data["Age"].median(axis=0)
median_age

28.0

In [101]:
train_data["Age"].fillna(median_age, inplace=True)

In [102]:
train_data["Age"].isnull().sum()

0

#### Age 열: 범주로 나눠보기

In [103]:
bins = [0, 18, 25, 35, 60, 80]
labels = ["Children", "Youth", "YoungAdult", "MiddleAged", "Senior"]
age_cats = pd.cut(train_data["Age"], bins=bins, labels=labels)
age_cats

0           Youth
1      MiddleAged
2      YoungAdult
3      YoungAdult
4      YoungAdult
          ...    
886    YoungAdult
887         Youth
888    YoungAdult
889    YoungAdult
890    YoungAdult
Name: Age, Length: 891, dtype: category
Categories (5, object): ['Children' < 'Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [104]:
pd.value_counts(age_cats) # age_cut.value_counts()

YoungAdult    373
MiddleAged    195
Youth         162
Children      139
Senior         22
Name: Age, dtype: int64

In [114]:
Age_dummies = pd.get_dummies(age_cats)
Age_dummies = Age_dummies.add_prefix("Age_")
Age_dummies

Unnamed: 0,Age_Children,Age_Youth,Age_YoungAdult,Age_MiddleAged,Age_Senior
0,0,1,0,0,0
1,0,0,0,1,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,1,0,0
...,...,...,...,...,...
886,0,0,1,0,0
887,0,1,0,0,0
888,0,0,1,0,0
889,0,0,1,0,0


#### 중복 데이터 확인

In [106]:
train_data.duplicated().sum()

0

#### ["Pclass", "Sex", "Embarked"] 에 대해 각각 One-hot Encoding

In [107]:
Pclass_dummies = pd.get_dummies(train_data["Pclass"])
Pclass_dummies = Pclass_dummies.add_prefix("Pclass_")
Pclass_dummies

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1
...,...,...,...
886,0,1,0
887,1,0,0
888,0,0,1
889,1,0,0


In [108]:
Sex_dummies = pd.get_dummies(train_data["Sex"])
Sex_dummies = Sex_dummies.add_prefix("Sex_")
Sex_dummies

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1


In [109]:
Embarked_dummies = pd.get_dummies(train_data["Embarked"])
Embarked_dummies = Embarked_dummies.add_prefix("Embarked_")
Embarked_dummies

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


#### 레이블 가져오기

In [110]:
y_train = train_data["Survived"].copy()

#### 위에서 만들어진 더미 데이터들 프레임을 합치기

In [111]:
train_data.drop(["PassengerId", "Survived", "Pclass", "Name", "Sex", "Embarked", "Ticket", "Age"], 
                axis=1, inplace=True)

In [117]:
X_train = pd.concat([train_data, Age_dummies, Pclass_dummies, Sex_dummies, Embarked_dummies], axis=1)

In [118]:
X_train.shape

(891, 17)

In [119]:
X_train.head()

Unnamed: 0,SibSp,Parch,Fare,RelativesOnboard,Age_Children,Age_Youth,Age_YoungAdult,Age_MiddleAged,Age_Senior,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,7.25,2,0,1,0,0,0,0,0,1,0,1,0,0,1
1,1,0,71.2833,2,0,0,0,1,0,1,0,0,1,0,1,0,0
2,0,0,7.925,1,0,0,1,0,0,0,0,1,1,0,0,0,1
3,1,0,53.1,2,0,0,1,0,0,1,0,0,1,0,0,0,1
4,0,0,8.05,1,0,0,1,0,0,0,0,1,0,1,0,0,1


#### 분류기 훈련

SVC(gamma='auto')

#### 교차 검증으로 평가

NameError: name 'svm_clf' is not defined

#### RandomForestClassifier 적용

0.8035705368289637

#### 예측 결과를 CSV 파일로 만들어 업로드
(이 때 test_data도 train_data 전처리 과정을 거친후 예측)

In [120]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
