# <center><span style="color:DarkBlue">타이타닉 생존 예측 (2): 2차 전처리 및 모델링</span></center> <a class="tocSkip">
- 수많은 유명인사와 부호들의 죽음으로 사람들에게 각인되었던 타이타닉호 침몰 사건.
- 타이타닉 승객들에 관한 데이터를 토대로 이들의 생존 여부를 예측해보기.
- 이를 통해, 기초적인 데이터 분석 과정과 Classification(분류) 문제를 경험해볼 수 있음.

**변수 설명**

\**일차적으로 다음의 변수들은 모델링에 사용하지 않기로 함.*
<br>\**사용X 변수: PassengerId, Name, Surname, Firstname, SibSp, Parch, Ticket, Cabin*


- ~(1) PassengerId: 승객의 고유 식별자 (데이터셋 내에서 각 승객마다 고유한 ID가 부여됨.)~
- (2) Survived: 생존 여부 (0은 사망, 1은 생존을 의미)
- (3) Pclass: 승객의 객실 등급 (1은 일등석, 2는 이등석, 3은 삼등석을 나타냄.)
- ~(4) Name: 승객의 이름 (형식: 성, 호칭. 이름)~
- ~(5) Surname: Name에서 '성'에 해당하는 파생변수~
- (6) Title: Name에서 '호칭'에 해당하는 파생변수
- ~(7) Firstname: Name에서 '이름'에 해당하는 파생변수~
- (8) Sex: 승객의 성별 ("male"은 남성, "female"은 여성을 의미함.)
- (9) Age: 승객의 나이
- ~(10) SibSp: 함께 탑승한 형제자매 또는 배우자의 수~
- ~(11) Parch: 함께 탑승한 부모 또는 자녀의 수~
- ~(12) Ticket: 티켓 번호~
- (13) Fare: 지불한 운임 요금
- ~(14) Cabin: 객실 번호~
- (15) Embarked: 탑승한 항구 ("C"는 Cherbourg, "Q"는 Queenstown, "S"는 Southampton을 의미함.)
- (16) FamilySize: SibSp와 Parch를 더한 '가족수'를 의미하는 파생변수

## 라이브러리 설치 및 환경 설정 <a class="tocSkip">

In [48]:
#기본 라이브러리 설치
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'  #plot내 글씨를 선명하게 해주는 옵션

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler

In [3]:
#working directory 확인
os.getcwd() #현재 작업 위치 출력

'C:\\Users\\lys17\\Desktop'

# 데이터 불러오기
1차 전처리가 완료된 Titanic_clean.csv 파일 불러옴

In [4]:
data = pd.read_csv("Titanic_clean.csv")
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Surname,Title,Firstname,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",Braund,Mr,Owen Harris,male,22.000000,1,0,A/5 21171,7.2500,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Cumings,Mrs,John Bradley (Florence Briggs Thayer),female,38.000000,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",Heikkinen,Miss,Laina,female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle,Mrs,Jacques Heath (Lily May Peel),female,35.000000,1,0,113803,53.1000,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",Allen,Mr,William Henry,male,35.000000,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Montvila,Others,Juozas,male,27.000000,0,0,211536,13.0000,,S,0
887,888,1,1,"Graham, Miss. Margaret Edith",Graham,Miss,Margaret Edith,female,19.000000,0,0,112053,30.0000,B42,S,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Johnston,Miss,"Catherine Helen ""Carrie""",female,21.845638,1,2,W./C. 6607,23.4500,,S,3
889,890,1,1,"Behr, Mr. Karl Howell",Behr,Mr,Karl Howell,male,26.000000,0,0,111369,30.0000,C148,C,0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Surname      891 non-null    object 
 5   Title        891 non-null    object 
 6   Firstname    891 non-null    object 
 7   Sex          891 non-null    object 
 8   Age          891 non-null    float64
 9   SibSp        891 non-null    int64  
 10  Parch        891 non-null    int64  
 11  Ticket       891 non-null    object 
 12  Fare         891 non-null    float64
 13  Cabin        204 non-null    object 
 14  Embarked     891 non-null    object 
 15  FamilySize   891 non-null    int64  
dtypes: float64(2), int64(6), object(8)
memory usage: 111.5+ KB


# 범주형 데이터 수치화
- 명목형 변수: Title, Sex, Embarked
- 순서형 변수: Pclass

## One-Hot Encoding
명목형 변수를 수치화할 때 사용

### pd.get_dummies() 이용
- 문제점: train 데이터의 특성을 학습하지 않기 때문에, train에만 있고 test에는 없는 카테고리를 test에서 원핫인코딩으로 표현할 수 없음.
<br>=> 따라서 sklearn의 OneHotEncoder 사용하는게 좋음

In [8]:
#Title, Sex, Embarked 변수를 더미변수화
df_dummy = pd.get_dummies(data[["Title","Sex","Embarked"]])
df_dummy

Unnamed: 0,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Others,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1,0,0,0,1,0,0,1
1,0,0,0,1,0,1,0,1,0,0
2,0,1,0,0,0,1,0,0,0,1
3,0,0,0,1,0,1,0,0,0,1
4,0,0,1,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...
886,0,0,0,0,1,0,1,0,0,1
887,0,1,0,0,0,1,0,0,0,1
888,0,1,0,0,0,1,0,0,0,1
889,0,0,1,0,0,0,1,1,0,0


In [9]:
#더미변수들을 데이터프레임에 합침
data_v1 = data.copy() #얉은 복사

data_v1 = pd.concat([data_v1, df_dummy], axis=1)
data_v1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Surname,Title,Firstname,Sex,Age,SibSp,...,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Others,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",Braund,Mr,Owen Harris,male,22.000000,1,...,0,0,1,0,0,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Cumings,Mrs,John Bradley (Florence Briggs Thayer),female,38.000000,1,...,0,0,0,1,0,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",Heikkinen,Miss,Laina,female,26.000000,0,...,0,1,0,0,0,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle,Mrs,Jacques Heath (Lily May Peel),female,35.000000,1,...,0,0,0,1,0,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",Allen,Mr,William Henry,male,35.000000,0,...,0,0,1,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",Montvila,Others,Juozas,male,27.000000,0,...,0,0,0,0,1,0,1,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",Graham,Miss,Margaret Edith,female,19.000000,0,...,0,1,0,0,0,1,0,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",Johnston,Miss,"Catherine Helen ""Carrie""",female,21.845638,1,...,0,1,0,0,0,1,0,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",Behr,Mr,Karl Howell,male,26.000000,0,...,0,0,1,0,0,0,1,1,0,0


### sklearn의 OneHotEncoder 이용
- 변수 스케일링과 유사한 주의사항 있음.
<br>train 셋에 fit된 encoder()를 사용하여 valid와 test 셋을 스케일링해야함.

In [13]:
ohe = OneHotEncoder(sparse_output=False) #sparse_output=False: 희소행렬로 반환 X

result_ohe = ohe.fit_transform(data_v1[["Title","Sex","Embarked"]])
result_ohe

array([[0., 0., 1., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 1.],
       ...,
       [0., 1., 0., ..., 0., 0., 1.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

In [15]:
#각 변수들의 카테고리가 학습된 것을 확인
ohe.categories_

[array(['Master', 'Miss', 'Mr', 'Mrs', 'Others'], dtype=object),
 array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S'], dtype=object)]

In [22]:
#result_ohe는 numpy.array이므로 데이터프레임으로 변환
df_ohe = pd.DataFrame(result_ohe, columns=ohe.get_feature_names_out(["Title","Sex","Embarked"]))
df_ohe

Unnamed: 0,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Others,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
886,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
887,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
888,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
889,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0


## Label Encoding
순서형 변수를 수치화할 때 사용

In [37]:
#수치 맵핑(Mapping)
#1등석 -> 2
#2등석 -> 1
#3등석 -> 0

data_v1["Pclass"] = data_v1["Pclass"].map({1:2, 2:1, 3:0})
data_v1["Pclass"]

0      0
1      2
2      0
3      2
4      0
      ..
886    1
887    2
888    0
889    2
890    0
Name: Pclass, Length: 891, dtype: int64

# 변수 선택

In [40]:
data_v1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   891 non-null    int64  
 1   Survived      891 non-null    int64  
 2   Pclass        891 non-null    int64  
 3   Name          891 non-null    object 
 4   Surname       891 non-null    object 
 5   Title         891 non-null    object 
 6   Firstname     891 non-null    object 
 7   Sex           891 non-null    object 
 8   Age           891 non-null    float64
 9   SibSp         891 non-null    int64  
 10  Parch         891 non-null    int64  
 11  Ticket        891 non-null    object 
 12  Fare          891 non-null    float64
 13  Cabin         204 non-null    object 
 14  Embarked      891 non-null    object 
 15  FamilySize    891 non-null    int64  
 16  Title_Master  891 non-null    uint8  
 17  Title_Miss    891 non-null    uint8  
 18  Title_Mr      891 non-null    

In [41]:
#필요없는 열은 제거
data_final = data_v1.drop(["PassengerId","Name","Surname","Title","Firstname","Sex","SibSp","Parch","Ticket","Cabin","Embarked"], axis=1)
data_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Age           891 non-null    float64
 3   Fare          891 non-null    float64
 4   FamilySize    891 non-null    int64  
 5   Title_Master  891 non-null    uint8  
 6   Title_Miss    891 non-null    uint8  
 7   Title_Mr      891 non-null    uint8  
 8   Title_Mrs     891 non-null    uint8  
 9   Title_Others  891 non-null    uint8  
 10  Sex_female    891 non-null    uint8  
 11  Sex_male      891 non-null    uint8  
 12  Embarked_C    891 non-null    uint8  
 13  Embarked_Q    891 non-null    uint8  
 14  Embarked_S    891 non-null    uint8  
dtypes: float64(2), int64(3), uint8(10)
memory usage: 43.6 KB


# train, valid, test 셋 분리
- 타이타닉 데이터는 test셋 파일이 따로 있으므로 train_test_split을 한 번만 해주면 됨.
- 실제 데이터의 경우, train, valid, test 셋 분리를 위해 train_test_split을 두 번 해줘야 함.
- stratify=타겟변수: 타겟값의 class 비율을 유지한 채로 데이터셋을 분리해줌

In [43]:
#X(독립변수), y(종속변수) 분리
X = data_final.drop("Survived", axis=1)
y = data_final["Survived"]

In [46]:
#train과 valid 셋 분리
#분류 문제에서 'stratify=타겟변수' 옵션은 중요함

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, stratify=y, random_state=100)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(712, 14) (179, 14) (712,) (179,)


=> 데이터셋 분리 후, 차원 확인하기

# 변수 스케일링

## Standard Scaler

In [56]:
#객체 생성
ss = StandardScaler()

#학습 및 변환
X_train_ss = ss.fit_transform(X_train)
X_val_ss = ss.transform(X_val)
#X_test_ss = ss.transform(X_test) #test셋이 있는 경우

print("<train 셋>")
print(f'평균: {X_train_ss.mean()}\n분산: {X_train_ss.std()}')
print("<valid 셋>")
print(f'평균: {X_val_ss.mean()}\n분산: {X_val_ss.std()}')

<train 셋>
평균: 3.599760047741278e-17
분산: 0.9999999999999999
<valid 셋>
평균: -0.0029392635343672546
분산: 0.9792864271985184


## MinMax Scaler

In [58]:
#객체 생성
mms = MinMaxScaler()

#학습 및 변환
X_train_mms = mms.fit_transform(X_train)
X_val_mms = mms.transform(X_val)
#X_test_mms = mms.transform(X_test) #test셋이 있는 경우

print("<train 셋>")
print(f'최대: {X_train_mms.max()}\n최소: {X_train_mms.min()}')
print("<valid 셋>")
print(f'최대: {X_val_mms.max()}\n최소: {X_val_mms.min()}')

<train 셋>
최대: 1.0
최소: 0.0
<valid 셋>
최대: 1.0
최소: 0.0


## MaxAbs Scaler

In [59]:
#객체 생성
mas = MaxAbsScaler()

#학습 및 변환
X_train_mas = mas.fit_transform(X_train)
X_val_mas = mas.transform(X_val)
#X_test_mas = mas.transform(X_test) #test셋이 있는 경우

print("<train 셋>")
print(f'최대: {X_train_mas.max()}\n최소: {X_train_mas.min()}')
print("<valid 셋>")
print(f'최대: {X_val_mas.max()}\n최소: {X_val_mas.min()}')

<train 셋>
최대: 1.0
최소: 0.0
<valid 셋>
최대: 1.0
최소: 0.0


## Robust Scaler

In [60]:
#객체 생성
rs = RobustScaler()

#학습 및 변환
X_train_rs = rs.fit_transform(X_train)
X_val_rs = rs.transform(X_val)
#X_test_rs = rs.transform(X_test) #test셋이 있는 경우

print("<train 셋>")
print(f'평균: {X_train_rs.mean()}\n분산: {X_train_rs.std()}')
print("<valid 셋>")
print(f'평균: {X_val_rs.mean()}\n분산: {X_val_rs.std()}')

<train 셋>
평균: 0.16630486268773823
분산: 0.9550796180307523
<valid 셋>
평균: 0.16763370385787865
분산: 0.885141523125194
