## 생존여부 예측모델 만들기
### 학습용 데이터 (X_train, y_train)을 이용하여 생존 예측 모형을 만든 후, 이를 평가용 데이터(X_test)에 적용하여 얻은 예측값을 다음과 같은 형식의 CSV파일로 생성하시오(제출한 모델의 성능은 accuracy 평가지표에 따라 채점)

(가) 제공 데이터 목록
- y_train: 생존여부(학습용)
- X_trian, X_test : 승객 정보 (학습용 및 평가용)

(나) 데이터 형식 및 내용
- y_trian (712명 데이터)

**시험환경 세팅은 예시문제와 동일한 형태의 X_train, y_train, X_test 데이터를 만들기 위함임**

### 유의사항
- 성능이 우수한 예측모형을 구축하기 위해서는 적절한 데이터 전처리, 피처엔지니어링, 분류알고리즘, 하이퍼파라미터 튜닝, 모형 앙상블 등이 수반되어야 한다.
- 수험번호.csv파일이 만들어지도록 코드를 제출한다.
- 제출한 모델의 성능은 accuracy로 평가함

csv 출력형태

![image.png](attachment:de1920de-121e-47c3-a61f-e905386713bf.png)

## [참고]작업형2 문구
- 출력을 원하실 경우 print() 함수 활용
- 예시) print(df.head())
- getcwd(), chdir() 등 작업 폴더 설정 불필요
- 파일 경로 상 내부 드라이브 경로(C: 등) 접근 불가

### 데이터 파일 읽기 예제
- import pandas as pd
- X_test = pd.read_csv("data/X_test.csv")
- X_train = pd.read_csv("data/X_train.csv")
- y_train = pd.read_csv("data/y_train.csv")

### 사용자 코딩

### 답안 제출 참고
- 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
- pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)

In [341]:
# 시험환경 세팅 (코드 변경 X)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def exam_data_load(df, target, id_name="", null_name=""):
    if id_name == "":
        df = df.reset_index().rename(columns={"index": "id"})
        id_name = 'id'
    else:
        id_name = id_name
    
    if null_name != "":
        df[df == null_name] = np.nan
    
    X_train, X_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=2021)
    y_train = X_train[[id_name, target]]
    X_train = X_train.drop(columns=[id_name, target])
    y_test = X_test[[id_name, target]]
    X_test = X_test.drop(columns=[id_name, target])
    return X_train, X_test, y_train, y_test 
    
df = pd.read_csv("../input/titanic/train.csv")
X_train, X_test, y_train, y_test = exam_data_load(df, 
                                                  target='Survived', 
                                                  id_name='PassengerId')

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 10), (179, 10), (712, 2), (179, 2))

## Start
### 라이브러리 및 데이터 불러오기


In [342]:
### 라이브러리 및 데이터 불러오기
import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [343]:
X_train.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
90,3,"Christmann, Mr. Emil",male,29.0,0,0,343276,8.05,,S
103,3,"Johansson, Mr. Gustaf Joel",male,33.0,0,0,7540,8.6542,,S
577,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9,E44,S
215,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C
191,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S


In [344]:
X_test.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
210,3,"Ali, Mr. Ahmed",male,24.0,0,0,SOTON/O.Q. 3101311,7.05,,S
876,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
666,2,"Butler, Mr. Reginald Fenton",male,25.0,0,0,234686,13.0,,S
819,3,"Skoog, Master. Karl Thorsten",male,10.0,3,2,347088,27.9,,S
736,3,"Ford, Mrs. Edward (Margaret Ann Watson)",female,48.0,1,3,W./C. 6608,34.375,,S


In [345]:
y_train.head()

Unnamed: 0,PassengerId,Survived
90,91,0
103,104,0
577,578,1
215,216,1
191,192,0


### 전처리 및 EDA

In [346]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 90 to 116
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    712 non-null    int64  
 1   Name      712 non-null    object 
 2   Sex       712 non-null    object 
 3   Age       575 non-null    float64
 4   SibSp     712 non-null    int64  
 5   Parch     712 non-null    int64  
 6   Ticket    712 non-null    object 
 7   Fare      712 non-null    float64
 8   Cabin     170 non-null    object 
 9   Embarked  711 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 61.2+ KB


In [347]:
X_train.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,712.0,575.0,712.0,712.0,712.0
mean,2.285112,29.414783,0.533708,0.391854,33.388155
std,0.842875,14.589601,1.099284,0.802311,50.807818
min,1.0,0.42,0.0,0.0,0.0
25%,1.0,20.0,0.0,0.0,7.925
50%,3.0,28.0,0.0,0.0,15.0479
75%,3.0,37.0,1.0,0.0,31.3875
max,3.0,74.0,8.0,6.0,512.3292


In [348]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 210 to 45
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    179 non-null    int64  
 1   Name      179 non-null    object 
 2   Sex       179 non-null    object 
 3   Age       139 non-null    float64
 4   SibSp     179 non-null    int64  
 5   Parch     179 non-null    int64  
 6   Ticket    179 non-null    object 
 7   Fare      179 non-null    float64
 8   Cabin     34 non-null     object 
 9   Embarked  178 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 15.4+ KB


Age, Cabin, Embarked에 결측치 존재

결측치가 너무 많은 Cabin은 제거

In [349]:
temp = X_train.copy()
temp['target'] = y_train.iloc[:,1]

In [350]:
temp.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,target
90,3,"Christmann, Mr. Emil",male,29.0,0,0,343276,8.05,,S,0
103,3,"Johansson, Mr. Gustaf Joel",male,33.0,0,0,7540,8.6542,,S,0
577,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9,E44,S,1
215,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C,1
191,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S,0


In [351]:
temp['Embarked'].describe()

count     711
unique      3
top         S
freq      514
Name: Embarked, dtype: object

In [352]:
temp['Embarked'] = temp['Embarked'].fillna('S') # 최빈값 대체

In [353]:
temp.groupby(['Pclass', 'Sex']).median()['Age'] # 클래스별 차이가 있으므로 해당 중위값으로 대체

Pclass  Sex   
1       female    35.0
        male      38.0
2       female    30.0
        male      30.0
3       female    19.0
        male      25.0
Name: Age, dtype: float64

In [354]:
temp.loc[(temp['Pclass']==1)&(temp['Sex']=='female')&(temp['Age'].isna()),'Age'] = 35
temp.loc[(temp['Pclass']==1)&(temp['Sex']=='male')&(temp['Age'].isna()),'Age'] = 38
temp.loc[(temp['Pclass']==2)&(temp['Sex']=='female')&(temp['Age'].isna()),'Age'] = 30
temp.loc[(temp['Pclass']==2)&(temp['Sex']=='male')&(temp['Age'].isna()),'Age'] = 30
temp.loc[(temp['Pclass']==3)&(temp['Sex']=='female')&(temp['Age'].isna()),'Age'] = 19
temp.loc[(temp['Pclass']==3)&(temp['Sex']=='male')&(temp['Age'].isna()),'Age'] = 25

In [355]:
temp = pd.concat([temp, pd.get_dummies(temp.Sex).iloc[:,:-1]], axis = 1)
temp = pd.concat([temp, pd.get_dummies(temp.Embarked).iloc[:,:-1]], axis = 1)

temp.drop(['Name', 'Cabin', 'Sex', 'Embarked', 'Ticket'],axis=1, inplace=True)

In [356]:
X_test['Embarked'] = X_test['Embarked'].fillna('S')

X_test.loc[(X_test['Pclass']==1)&(X_test['Sex']=='female')&(X_test['Age'].isna()),'Age'] = 35
X_test.loc[(X_test['Pclass']==1)&(X_test['Sex']=='male')&(X_test['Age'].isna()),'Age'] = 38
X_test.loc[(X_test['Pclass']==2)&(X_test['Sex']=='female')&(X_test['Age'].isna()),'Age'] = 30
X_test.loc[(X_test['Pclass']==2)&(X_test['Sex']=='male')&(X_test['Age'].isna()),'Age'] = 30
X_test.loc[(X_test['Pclass']==3)&(X_test['Sex']=='female')&(X_test['Age'].isna()),'Age'] = 19
X_test.loc[(X_test['Pclass']==3)&(X_test['Sex']=='male')&(X_test['Age'].isna()),'Age'] = 25

X_test = pd.concat([X_test, pd.get_dummies(X_test.Sex).iloc[:,:-1]], axis = 1)
X_test = pd.concat([X_test, pd.get_dummies(X_test.Embarked).iloc[:,:-1]], axis = 1)

X_test.drop(['Name', 'Cabin', 'Sex', 'Embarked' ,'Ticket'], axis=1, inplace=True)

In [357]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 90 to 116
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  712 non-null    int64  
 1   Age     712 non-null    float64
 2   SibSp   712 non-null    int64  
 3   Parch   712 non-null    int64  
 4   Fare    712 non-null    float64
 5   target  712 non-null    int64  
 6   female  712 non-null    uint8  
 7   C       712 non-null    uint8  
 8   Q       712 non-null    uint8  
dtypes: float64(2), int64(4), uint8(3)
memory usage: 41.0 KB


In [358]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 210 to 45
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  179 non-null    int64  
 1   Age     179 non-null    float64
 2   SibSp   179 non-null    int64  
 3   Parch   179 non-null    int64  
 4   Fare    179 non-null    float64
 5   female  179 non-null    uint8  
 6   C       179 non-null    uint8  
 7   Q       179 non-null    uint8  
dtypes: float64(2), int64(3), uint8(3)
memory usage: 8.9 KB


In [359]:
temp.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,target,female,C,Q
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,2.285112,28.763343,0.533708,0.391854,33.388155,0.380618,0.352528,0.198034,0.078652
std,0.842875,13.436188,1.099284,0.802311,50.807818,0.48588,0.478093,0.398798,0.269384
min,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,20.0,0.0,0.0,7.925,0.0,0.0,0.0,0.0
50%,3.0,26.0,0.0,0.0,15.0479,0.0,0.0,0.0,0.0
75%,3.0,36.0,1.0,0.0,31.3875,1.0,1.0,0.0,0.0
max,3.0,74.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0


In [360]:
X_test.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,female,C,Q
count,179.0,179.0,179.0,179.0,179.0,179.0,179.0,179.0
mean,2.402235,29.702067,0.480447,0.340782,27.494878,0.351955,0.150838,0.117318
std,0.80392,12.995215,1.11849,0.821797,44.811156,0.47892,0.358895,0.322702
min,1.0,0.67,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,22.0,0.0,0.0,7.8958,0.0,0.0,0.0
50%,3.0,26.0,0.0,0.0,13.0,0.0,0.0,0.0
75%,3.0,38.0,1.0,0.0,28.91875,1.0,0.0,0.0
max,3.0,80.0,8.0,5.0,512.3292,1.0,1.0,1.0


In [361]:
temp.groupby('target').mean()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,female,C,Q
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2.514739,29.520408,0.544218,0.31746,22.523278,0.136054,0.14059,0.0839
1,1.911439,27.531365,0.516605,0.512915,51.068636,0.704797,0.291513,0.070111


- 사망 : 남성, 클래스가 낮을수록, 요금이 쌀수록 사망비율이 높다.
- 생존 : 여성, 클래스가 높을수록, 요금이 비쌀수록 생존비율이 높다.

In [362]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from xgboost import XGBRFClassifier, XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [363]:
X_tra, X_val, y_tra, y_val = train_test_split(temp.drop('target',axis=1), temp.target, test_size=0.2, random_state=42)

In [364]:
list_ml = [AdaBoostClassifier(random_state=42), GradientBoostingClassifier(random_state=42), 
           RandomForestClassifier(random_state=42), ExtraTreesClassifier(random_state=42),
           SVC(random_state=42), DecisionTreeClassifier(random_state=42), XGBRFClassifier(random_state=42), XGBClassifier(random_state=42)]

In [365]:
for i in range(len(list_ml)):
    clf = list_ml[i]
    clf.fit(X_tra, y_tra)
    print(list_ml[i],':', clf.score(X_val, y_val))

AdaBoostClassifier(random_state=42) : 0.7972027972027972
GradientBoostingClassifier(random_state=42) : 0.8671328671328671
RandomForestClassifier(random_state=42) : 0.8391608391608392
ExtraTreesClassifier(random_state=42) : 0.8181818181818182
SVC(random_state=42) : 0.6713286713286714
DecisionTreeClassifier(random_state=42) : 0.8041958041958042




XGBRFClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                colsample_bytree=1, enable_categorical=False, gamma=0,
                gpu_id=-1, importance_type=None, interaction_constraints='',
                max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
                monotone_constraints='()', n_estimators=100, n_jobs=4,
                num_parallel_tree=100, objective='binary:logistic',
                predictor='auto', random_state=42, reg_alpha=0,
                scale_pos_weight=1, tree_method='exact', validate_parameters=1,
                verbosity=None) : 0.8461538461538461
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             

- GradientBoostingClassifier(random_state=42) : 0.8671328671328671
- RandomForestClassifier(random_state=42) : 0.8391608391608392
- XGBRFClassifier() : 0.8461538461538461

상위 3개의 모델의 최적 하이퍼파라미터를 찾기로 한다.

In [366]:
X_train = temp.drop('target',axis=1)
y_train = temp.target

In [367]:
parameters = {'learning_rate':[0.01, 0.005], 'n_estimators':[100,300,500]}
clf = GridSearchCV(GradientBoostingClassifier(random_state=42), parameters)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
print(clf.best_score_)

GradientBoostingClassifier(learning_rate=0.01, n_estimators=500,
                           random_state=42)
0.8328769821727569


In [368]:
parameters = {'max_depth':[-1,5,10], 'n_estimators':[100,300,500]}
clf = GridSearchCV(RandomForestClassifier(random_state=42), parameters)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
print(clf.best_score_)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py", line 392, in fit
    for i, t in enumerate(trees))
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()


RandomForestClassifier(max_depth=10, n_estimators=500, random_state=42)
0.8483305426967398


In [369]:
parameters = {'max_depth':[-1,5,10], 'learning_rate':[0.01, 0.005], 'n_estimators':[100,300,500]}
clf = GridSearchCV(XGBRFClassifier(random_state=42), parameters)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
print(clf.best_score_)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/sklearn.py", line 1462, in fit
    super().fit(**args)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/core.py", line 506, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/sklearn.py", line 1261, in fit
    callbacks=callbacks,
  File "/opt/conda/lib/python3.7/site-packages/xgboost/training.py", line 196, in train
    early_stopping_rounds=early_stopping_rounds)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/training.py", line 81, in _train_internal
    bst.update(dtrain, i, obj)
  File "/opt/conda/lib/python3.7/site-packages/xgboost/core.py", line 1682, in update
    dtrai

XGBRFClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                colsample_bytree=1, enable_categorical=False, gamma=0,
                gpu_id=-1, importance_type=None, interaction_constraints='',
                learning_rate=0.01, max_delta_step=0, max_depth=10,
                min_child_weight=1, missing=nan, monotone_constraints='()',
                n_estimators=100, n_jobs=4, num_parallel_tree=100,
                objective='binary:logistic', predictor='auto', random_state=42,
                reg_alpha=0, scale_pos_weight=1, tree_method='exact',
                validate_parameters=1, verbosity=None)
0.841307987786861


0.8483305426967398으로 가장 점수가 높은 RandomForestClassifier(max_depth=10, n_estimators=500, random_state=42)을 최종모델 선정

In [370]:
clf = RandomForestClassifier(max_depth=10, n_estimators=500, random_state=42)
clf.fit(X_train, y_train)
pred = y_test.copy()
pred.iloc[:,1] = clf.predict(X_test)

In [371]:
print('최종 스코어 :', accuracy_score(pred.iloc[:,1], y_test.iloc[:,1]))

최종 스코어 : 0.770949720670391


In [372]:
# pred.to_csv('titanic.csv', index = False)

## kaggle code

## 라이브러리 및 데이터 불러오기

In [373]:
# 시험환경 세팅 (코드 변경 X)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def exam_data_load(df, target, id_name="", null_name=""):
    if id_name == "":
        df = df.reset_index().rename(columns={"index": "id"})
        id_name = 'id'
    else:
        id_name = id_name
    
    if null_name != "":
        df[df == null_name] = np.nan
    
    X_train, X_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=2021)
    y_train = X_train[[id_name, target]]
    X_train = X_train.drop(columns=[id_name, target])
    y_test = X_test[[id_name, target]]
    X_test = X_test.drop(columns=[id_name, target])
    return X_train, X_test, y_train, y_test 
    
df = pd.read_csv("../input/titanic/train.csv")
X_train, X_test, y_train, y_test = exam_data_load(df, 
                                                  target='Survived', 
                                                  id_name='PassengerId')

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 10), (179, 10), (712, 2), (179, 2))

In [374]:
# 라이브러리 불러오기
import pandas as pd

In [375]:
# 데이터 불러오기 (생략)
X_train.shape, y_train.shape, X_test.shape

((712, 10), (712, 2), (179, 10))

## EDA

In [376]:
X_train.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
90,3,"Christmann, Mr. Emil",male,29.0,0,0,343276,8.05,,S
103,3,"Johansson, Mr. Gustaf Joel",male,33.0,0,0,7540,8.6542,,S
577,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9,E44,S
215,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C
191,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S


In [377]:
# float64(2), int64(3), object(5)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 90 to 116
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    712 non-null    int64  
 1   Name      712 non-null    object 
 2   Sex       712 non-null    object 
 3   Age       575 non-null    float64
 4   SibSp     712 non-null    int64  
 5   Parch     712 non-null    int64  
 6   Ticket    712 non-null    object 
 7   Fare      712 non-null    float64
 8   Cabin     170 non-null    object 
 9   Embarked  711 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 61.2+ KB


In [378]:
y_train.head()

Unnamed: 0,PassengerId,Survived
90,91,0
103,104,0
577,578,1
215,216,1
191,192,0


In [379]:
# 생존 비율
y_train['Survived'].value_counts()

0    441
1    271
Name: Survived, dtype: int64

## 데이터 전처리

In [380]:
y = y_train["Survived"]

# sex만 원핫인코딩 됨
features = ["Pclass", "Sex", "SibSp", "Parch"]
X_train = pd.get_dummies(X_train[features])
X_test = pd.get_dummies(X_test[features])

In [381]:
X_train.shape, X_test.shape

((712, 5), (179, 5))

In [382]:
X_train.head()

Unnamed: 0,Pclass,SibSp,Parch,Sex_female,Sex_male
90,3,0,0,0,1
103,3,0,0,0,1
577,1,1,0,1,0
215,1,1,0,1,0
191,2,0,0,0,1


## 모델 및 평가

In [383]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, max_depth=7, random_state=2021)
model.fit(X_train, y)
predictions = model.predict(X_test)

In [384]:
model.score(X_train, y)

0.8356741573033708

In [385]:
output = pd.DataFrame({'PassengerId': y_test.PassengerId, 'Survived': predictions})
output.head()

Unnamed: 0,PassengerId,Survived
210,211,0
876,877,0
666,667,0
819,820,0
736,737,0


In [386]:
# 수험번호.csv로 출력
output.to_csv('1234567.csv', index=False)

## 결과 체점 (수험자는 알 수 없는 부분임)

In [387]:
model.score(X_test, y_test['Survived'])

0.7318435754189944