## Q. 타이타닉 생존자 예측모델 개발을 위한 Titanic 분석용 데이터셋

#### Titanic data 전처리
- 분석 데이터 : titanic3.csv
- 재사용 가능한 전처리 사용자 함수 작성 하여 전처리
    - Null 값 처리 : Age는 평균나이, 나머지 칼럼은 'N'값으로 변경
    - 불필요한 속성 칼럼 삭제
    - 문자열 칼럼 레이블 인코딩
- 통계적, 시각적 탐색을 통한 다양한 인사이트 도출
- 탐색적 분석을 통한 feature engineering, 파생변수

#### 컬럼 정보

- survived : 생존여부(1: 생존, 0 : 사망)
- pclass : 승선권 클래스(1 : 1st, 2 : 2nd ,3 : 3rd)
- name : 승객 이름
- sex : 승객 성별
- age : 승객 나이
- sibsp : 동반한 형제자매, 배우자 수
- parch : 동반한 부모, 자식 수
- ticket : 티켓의 고유 넘버
- fare 티켓의 요금
- cabin : 객실 번호
- embarked : 승선한 항구명(C : Cherbourg, Q : Queenstown, S : Southampton)
- boat: Lifeboat
- body: Body Identification Number
- home.dest: Home/Destination

In [1]:
import pandas as pd
titanic_df = pd.read_csv('dataset/titanic3.csv')
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [2]:
# embarked onehot encoding : sex, embarked

onehot_sex = pd.get_dummies(titanic_df['sex'])
onehot_embarked = pd.get_dummies(titanic_df['embarked'], prefix='town')
titanic_df = pd.concat([titanic_df,onehot_sex,onehot_embarked], axis=1)
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,female,male,town_C,town_Q,town_S
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO",True,False,False,False,True
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",False,True,False,False,True
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",False,True,False,False,True
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True


In [9]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 19 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
 14  female     1309 non-null   bool   
 15  male       1309 non-null   bool   
 16  town_C     1309 non-null   bool   
 17  town_Q     1309 non-null   bool   
 18  town_S     1309 non-null   bool   
dtypes: bool(5), float64(3), int64(4), object(7)
memo

In [84]:
titanic_df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'female',
       'male', 'town_C', 'town_Q', 'town_S'],
      dtype='object')

In [3]:
# 동일한 티켓 번호를 가지고 있는 사람의 수
ticket_series = titanic_df['ticket'].value_counts()
ticket_series['CA. 2343']

fticket = lambda x: ticket_series[x]

titanic_df['company'] = titanic_df['ticket'].apply(fticket)
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,female,male,town_C,town_Q,town_S,company
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO",True,False,False,False,True,4
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",False,True,False,False,True,6
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True,6
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",False,True,False,False,True,6
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True,6


In [4]:
# 패밀리 변수 만들기
titanic_df['family'] = titanic_df['sibsp'] + titanic_df['parch']
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,...,boat,body,home.dest,female,male,town_C,town_Q,town_S,company,family
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,...,2.0,,"St Louis, MO",True,False,False,False,True,4,0
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,...,11.0,,"Montreal, PQ / Chesterville, ON",False,True,False,False,True,6,3
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,...,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True,6,3
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,...,,135.0,"Montreal, PQ / Chesterville, ON",False,True,False,False,True,6,3
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,...,,,"Montreal, PQ / Chesterville, ON",True,False,False,False,True,6,3


In [87]:
titanic_df[['pclass', 'survived','age', 'sibsp', 'parch',
       'fare','female',
       'male', 'town_C', 'town_Q', 'town_S','company','family']].corr()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,female,male,town_C,town_Q,town_S,company,family
pclass,1.0,-0.312469,-0.408106,0.060832,0.018322,-0.558629,-0.124617,0.124617,-0.269658,0.230491,0.096335,-0.078554,0.050027
survived,-0.312469,1.0,-0.055512,-0.027825,0.08266,0.244265,0.528693,-0.528693,0.182123,-0.016071,-0.154558,0.075293,0.026876
age,-0.408106,-0.055512,1.0,-0.243699,-0.150917,0.17874,-0.063645,0.063645,0.085777,-0.019458,-0.075972,-0.185284,-0.240229
sibsp,0.060832,-0.027825,-0.243699,1.0,0.373587,0.160238,0.109609,-0.109609,-0.048396,-0.048678,0.075198,0.679444,0.861952
parch,0.018322,0.08266,-0.150917,0.373587,1.0,0.221539,0.213125,-0.213125,-0.008635,-0.100943,0.073258,0.647029,0.792296
fare,-0.558629,0.244265,0.17874,0.160238,0.221539,1.0,0.185523,-0.185523,0.286269,-0.130059,-0.172683,0.47894,0.226492
female,-0.124617,0.528693,-0.063645,0.109609,0.213125,0.185523,1.0,-1.0,0.066564,0.088651,-0.119504,0.172765,0.188583
male,0.124617,-0.528693,0.063645,-0.109609,-0.213125,-0.185523,-1.0,1.0,-0.066564,-0.088651,0.119504,-0.172765,-0.188583
town_C,-0.269658,0.182123,0.085777,-0.048396,-0.008635,0.286269,0.066564,-0.066564,1.0,-0.164166,-0.775441,0.028193,-0.036553
town_Q,0.230491,-0.016071,-0.019458,-0.048678,-0.100943,-0.130059,0.088651,-0.088651,-0.164166,1.0,-0.489874,-0.114046,-0.08719


In [None]:
# null 처리: age, 
# 변수 선택: serial number 같은 것은 배제하기 그리고 중복되는 것 배제 => family 숫자도 중요하지만 ticket에 묶여있는 숫자가 크면 클수록
# body 배제, 
# 파생변수, normalization
# corr 사용해서 연관계수 확인하기.
# 모델 선택 Logit 종속변수가 0 혹은 1이니깐.


In [5]:
# null 처리 age
# if titanic_df['name'].find('Miss.') >= 0: 'Miss.'
# elif titanic_df['name'].find('Mister.') >= 0: 'Mister.'
# elif titanic_df['name'].find('Master.') >=0: 'Master.'
# elif titanic_df['name'].find('Mrs.') >=0: 'Mrs.'
# elif titanic_df['name'].find('Dr.') >=0: 'Dr.'
def address_type (x):
    prefix = ''
    if x.find('Miss.') >= 0: prefix = 'Miss.'
    elif x.find('Mr.') >= 0: prefix = 'Mr.'
    elif x.find('Master.') >=0: prefix = 'Master.'
    elif x.find('Mrs.') >=0: prefix = 'Mrs.'
    elif x.find('Dr.') >=0: prefix = 'Dr.'
    elif x.find('Sir.') >=0: prefix = 'Sir.'
    elif x.find('Rev.') >=0: prefix = 'Rev.'
    elif x.find('Col.') >=0: prefix = 'Col.'
    elif x.find('Ms.') >=0: prefix = 'Ms.'
    elif x.find('Lady.') >=0: prefix = 'Lady.'
    elif x.find('Mme.') >=0: prefix = 'Mme.'
    elif x.find('Major.') >=0: prefix = 'Major.'
    elif x.find('Capt.') >=0: prefix = 'Capt.'
    elif x.find('Mlle.') >=0: prefix = 'Mlle.'
    elif x.find('Dona.') >=0: prefix = 'Dona.'
    elif x.find('Jonkheer.') >=0: prefix = 'Jonkheer.'
    elif x.find('Countess.') >=0: prefix = 'Countess.'
    elif x.find('Don.') >=0: prefix = 'Don.'
    else: prefix = 'Not mentioned'
    return prefix
        
titanic_df['prefix'] = titanic_df['name'].apply(lambda x:address_type(x))
prefix_group = titanic_df.groupby('prefix')
age_mean_series = prefix_group.age.mean()
# age_median_series = prefix_group.age.median()
# age_2interquartile_series = prefix_group.age.describe()['50%']

age_null = titanic_df[titanic_df['age'].isnull()].index.tolist()
for x in age_null:
    titanic_df.loc[x,'age'] = age_mean_series[titanic_df.loc[x,'prefix']]


titanic_df['age'].isnull().sum()
titanic_df['age'].describe()

count    1309.000000
mean       29.896894
std        13.193803
min         0.170000
25%        21.774238
50%        30.000000
75%        36.000000
max        80.000000
Name: age, dtype: float64

In [6]:
#'fare' null 처리 S3 의 mean으로
titanic_df['fare'].describe()
group = titanic_df.groupby(['embarked','pclass'])
group.describe().fare
titanic_df[titanic_df['fare'].isnull()].index[0]
# S3의 mean : 14.435422
titanic_df.loc[titanic_df[titanic_df['fare'].isnull()].index[0],'fare'] = 14.435422
titanic_df['fare'].isnull().sum()

0

In [8]:
def get_cat(age):
    cat = ''
    if age < 5: cat = 1
    elif age < 15: cat = 2
    elif age < 25: cat = 3
    elif age < 45: cat = 4
    else: cat = 65
    return cat

# f = lambda x: ('child' if x <5 else 
#                'school age' if x <15 else 
#                'youth' if x < 25 else 
#                'prime age' if x < 45 else 
#                'middle age' if x < 65 else 
#                'old age')

titanic_df['age_cat'] = titanic_df['age'].apply(lambda x: get_cat(x))

In [9]:
# normalization with minmax
# 'fare' and 'age'
titanic_df['fare_norm'] = (titanic_df['fare']-titanic_df['fare'].min())/(titanic_df['fare'].max()-titanic_df['fare'].min())
titanic_df['age_norm'] = (titanic_df['age']-titanic_df['age'].min())/(titanic_df['age'].max()-titanic_df['age'].min())
titanic_df['company_norm'] = (titanic_df['company']-titanic_df['company'].min())/(titanic_df['company'].max()-titanic_df['company'].min())
titanic_df['pclass_norm'] = (titanic_df['pclass']-titanic_df['pclass'].min())/(titanic_df['pclass'].max()-titanic_df['pclass'].min())
titanic_df['age_cat_norm'] = (titanic_df['age_cat']-titanic_df['age_cat'].min())/(titanic_df['age_cat'].max()-titanic_df['age_cat'].min())

In [136]:
# normalization with Z
# # (x - min(x)) / std(x)
# titanic_df['fare_norm'] = (titanic_df['fare']-titanic_df['fare'].min())/(titanic_df['fare'].std())
# titanic_df['age_norm'] = (titanic_df['age']-titanic_df['age'].min())/titanic_df['age'].std()
# titanic_df['company_norm'] = (titanic_df['company']-titanic_df['company'].min())/(titanic_df['company'].max()-titanic_df['company'].min())
# titanic_df['pclass_norm'] = (titanic_df['pclass']-titanic_df['pclass'].min())/(titanic_df['pclass'].max()-titanic_df['pclass'].min())

In [93]:
titanic_df.columns.tolist()

['pclass',
 'survived',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home.dest',
 'female',
 'male',
 'town_C',
 'town_Q',
 'town_S',
 'company',
 'family',
 'prefix',
 'fare_norm',
 'age_cat']

In [13]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# 독립변수, 종속변수 분리
y_t_df = titanic_df['survived'] # 종속변수
X_t_df = titanic_df.drop(['survived','name','sex','age','sibsp','ticket','fare','cabin','embarked','boat','body','home.dest','female','town_C','town_Q','town_S','prefix','family','company','pclass_norm','age_cat','age_cat_norm'], axis = 1) # 독립변수

# 독립변수 정규화
# X_t_df = preprocessing.StandardScaler().fit(X_t_df).transform(X_t_df)

# 학습용 데이터와 평가용 데이터를 8:2 혹은 7:3으로 분리
X_train, X_test, y_train, y_test = train_test_split(X_t_df, y_t_df, test_size = 0.2,
                                                   random_state = 11)

# print(X_train.shape)
# print(X_test.shape)
X_t_df.head()

Unnamed: 0,pclass,parch,male,fare_norm,age_norm,company_norm
0,1,0,False,0.412503,0.361142,0.3
1,1,2,True,0.295806,0.009395,0.5
2,1,2,False,0.295806,0.022924,0.5
3,1,2,True,0.295806,0.373669,0.5
4,1,2,False,0.295806,0.311036,0.5


In [24]:
# 모델 학습 및 평가
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, rf_pred).round(2)

lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
lr_pred = lr_model.predict(X_test)
accuracy_lr = accuracy_score(y_test,lr_pred).round(2)

print('rf 정확도:{}, lr 정확도:{}'.format(accuracy_rf,accuracy_lr))

rf 정확도:0.81, lr 정확도:0.83


In [11]:
pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.0-cp311-cp311-win_amd64.whl (9.2 MB)
     ---------------------------------------- 0.0/9.2 MB ? eta -:--:--
     ---------------------------------------- 0.1/9.2 MB 1.7 MB/s eta 0:00:06
     -- ------------------------------------- 0.6/9.2 MB 5.8 MB/s eta 0:00:02
     ---- ----------------------------------- 1.0/9.2 MB 8.8 MB/s eta 0:00:01
     ------ --------------------------------- 1.6/9.2 MB 10.0 MB/s eta 0:00:01
     -------- ------------------------------- 2.0/9.2 MB 10.0 MB/s eta 0:00:01
     ---------- ----------------------------- 2.5/9.2 MB 10.0 MB/s eta 0:00:01
     ------------- -------------------------- 3.0/9.2 MB 10.6 MB/s eta 0:00:01
     --------------- ------------------------ 3.5/9.2 MB 10.6 MB/s eta 0:00:01
     ----------------- ---------------------- 3.9/9.2 MB 10.9 MB/s eta 0:00:01
     ------------------- -------------------- 4.5/9.2 MB 10.7 MB/s eta 0:00:01
     ---------------------- ----------------- 5.2/9

In [48]:
from statsmodels.discrete.discrete_model import Probit
from sklearn.metrics import accuracy_score
import warnings
import numpy as np
warnings.filterwarnings('ignore')
model = Probit(y_train, X_train.astype(float))
probit_model = model.fit()
print(probit_model.summary())
probit_pred = probit_model.predict(X_test.astype(float))

# 결과를 0.5 이상이면 1 아니면 0으로 바꿔주기
dfprobit = pd.DataFrame(probit_pred,columns=['probit_pred'])
dfprobit = dfprobit.reset_index()
dfprobit.drop(['index'],axis=1,inplace=True)
dfprobit['y_pred'] = 0.000
for i in range(len(dfprobit['probit_pred'])):
    if dfprobit['probit_pred'][i] >0.500:
        dfprobit['y_pred'][i] = 1.000
    else:
        dfprobit['y_pred'][i] = 0.000
y_pred = np.array(dfprobit['y_pred'])
y_pred = y_pred.astype('int64')
y_pred

accuracy_probit = accuracy_score(y_test,y_pred).round(2)
print('probit 정확도:{}'.format(accuracy_probit))

Optimization terminated successfully.
         Current function value: 0.513399
         Iterations 6
                          Probit Regression Results                           
Dep. Variable:               survived   No. Observations:                 1047
Model:                         Probit   Df Residuals:                     1041
Method:                           MLE   Df Model:                            5
Date:                Thu, 30 Nov 2023   Pseudo R-squ.:                  0.2247
Time:                        12:53:23   Log-Likelihood:                -537.53
converged:                       True   LL-Null:                       -693.36
Covariance Type:            nonrobust   LLR p-value:                 3.125e-65
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
pclass           0.0459      0.037      1.238      0.216      -0.027       0.119
parch            0.0194