# 책과 코드 읽기: 파이썬 머신러닝 완벽가이드 
 - Ch.6 사이킷런으로 수행하는 타이타닉 생존자 예측(p131~ 146)
 - 타이타닉 탑승자 데이터를 기반으로 생존자 예측하기
 - 목표: 예측 정확도 82%

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

pd.set_option('display.max_rows', 500) # row 한번에 여러개 보기
pd.set_option('display.max_columns', 100) # 컬럼 한번에 여러개 보기

In [None]:
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')

## LOAD DATA SET (step.01)

In [None]:
train = pd.read_csv("C:/Users/User/Downloads/data/titanic/train.csv")
print(train.shape)
train.head()

In [None]:
test = pd.read_csv("C:/Users/User/Downloads/data/titanic/test.csv")
print(test.shape)
test.head()

## DATA INFO FIND NULL DATA (step.02)

In [None]:
# DATA INFO
train.info()
# 'Survived' 컬럼을 통해 탑승자의 생사여부를 알 수 있다.

In [None]:
test.info()

In [None]:
# FIND NULL DATA
train.isnull().sum()
# 'Age', 'Cabin', 'Embarked' 컬럼에 결손값이 있다.

In [None]:
test.isnull().sum()
# 'Age', 'Fare', 'Cabin' 컬럼에 결손값이 있다.

## Explore (step.03)

#### 1. SEX

In [None]:
sns.countplot(data = train, x = 'Sex', hue = 'Survived')

In [None]:
pd.pivot_table(data= train, 
               index = 'Sex', 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# train 데이터 셋에서 여성의 74% (233명), 남성의 18%(109명)이 생존했습니다.

#### 2. Pclass

In [None]:
sns.countplot(data = train, x = 'Pclass', hue = 'Survived')

In [None]:
pd.pivot_table(data = train, 
               index = 'Pclass', 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# train 데이터 셋에서 Pclass 1의 62% (136명), 2의 47%(87명), 3의 24%(119명)이 생존했습니다.

#### 3. Embarked

In [None]:
sns.countplot(data = train, x = 'Embarked', hue = 'Survived')

In [None]:
pd.pivot_table(data = train, 
               index = 'Embarked', 
               values = "Survived", 
               aggfunc =[np.sum, np.mean])
# train 데이터 셋에서 Embarked C의 55% (93명), Q의 38%(30명), S의 33%(217명)이 생존했습니다.

#### 4. Age&Fare

In [None]:
sns.lmplot(data = train, 
           x = "Age", 
           y = "Fare", 
           hue = 'Survived', 
           fit_reg = False)
# Fare 가 500 이상인 특잇값(outliers)이 존재합니다.

In [None]:
fa = sns.FacetGrid(train, hue = 'Survived', aspect = 3)
fa.map(sns.kdeplot, 'Age')
fa.add_legend()
# 정규분포 모양을 보이지만, 어린이들의 생존율이 높습니다.

In [None]:
fa = sns.FacetGrid(train, hue = 'Survived', aspect = 5)
fa.map(sns.kdeplot, 'Fare')
fa.add_legend()

#### 5. SibSp & Parch

In [None]:
train['Family_size'] = train['SibSp'] + train["Parch"] + 1
print(train.shape)
train[['SibSp', 'Parch', 'Family_size']].head()
# 함께 탄 가족의 숫자 컬럼을 새로 만듭니다.
# 직계 가족만 포함되기에, 친적들과 같이 탑승한 경우는 확인이 어렵습니다.

In [None]:
sns.countplot(data = train, x = 'Family_size', hue = "Survived")

In [None]:
pd.pivot_table(data = train, 
               index = "Family_size", 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# 1인, 5인 이상 가족의 생존율이 낮습니다. 2인 이상 4인 이하 가족의 생존율이 높습니다.

#### 6. Name

In [None]:
#train['Title'] = train['Name'].str.split(',')
train['Title'] = train['Name'].str.split(',').str[1].str.split('.').str[0].str.strip()
print(train.shape)
train[["Name", "Title"]].head()
# 이름에 들어간 Mr, Mrs 등으로 결혼 여부, 사회적 지위 등을 알 수 있습니다.

In [None]:
sns.countplot(data = train, x = 'Title', hue = 'Survived')

In [None]:
train['Title'].value_counts()

In [None]:
pd.pivot_table(data = train, 
               index = 'Title', 
               values = 'Survived', 
               aggfunc = ['mean', 'sum', 'count'])

# 가설 검증 1차 (4.12)

- 1. Age에 결손값이 있으면 생존율이 낮을 것이다.
- 2. 선실 등급 & 성별에 따라 생존율 차이가 있을 것이다.

#### 가설 검증 1차 (4.12) : 1. Age에 결손값이 있으면 생존율이 낮을 것이다 -> True

In [None]:
train['Age_null'] = pd.isna(train['Age'])
print(train.shape)
train[['Age_null', 'Age']].head()

In [None]:
sns.countplot(data = train, x = 'Age_null', hue = "Survived")

In [None]:
pd.pivot_table(data = train, 
               index = ["Age_null"], 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# Age에 결손값이 있는 경우 생존율이 29% 입니다.

In [None]:
pd.pivot_table(data = train, 
               index = ["Sex", "Pclass", "Age_null"], 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# 여성의 경우 나이에 결손값과 여부가 생존에 큰 영향을 미치지 않지만, 3등석 남성의 경우..

#### 가설 검증 1차 (4.12) : 2. 선실 등급 & 성별에 따라 생존율 차이가 있을 것이다. -> True

In [None]:
pd.pivot_table(data = train, 
               index = ['Sex', 'Pclass'], 
               values = 'Survived', 
               aggfunc = [np.sum, np.mean])
# 3등석의 여성의 경우 생존률이 50% 입니다. 1등석의 남성의 경우 생존율이 36% 입니다.

In [None]:
# 사망한 3등석 여성의 경우 어떤 특징을?

In [None]:
pclass3_female = train[(train['Sex'] == 'female') & (train['Pclass'] == 3)]
pd.pivot_table(data = pclass3_female, 
               index = ["Embarked", 'Family_size'], 
               values = 'Survived', 
               aggfunc = [np.mean, np.sum])
# Embarked 'S'에서 탑승한 3등석 여성 승객의 경우 생존률이 낮다

# 가설 검증 2차 (4.14)

- 1. Pcalss에서 Fare에 따라 생존률이 다를 것이다.
- 2. Age에 따라 생존률이 다를 것이다.

In [None]:
pd.pivot_table(data = train, 
               index = ['Sex', 'Pclass'], 
               values = 'Survived', 
               aggfunc = [np.sum, np.mean, 'count'])
# Sex에 따른 Pclass 별 평균 생존률  

In [None]:
train['Fare'].value_counts()
# 티겟 가격이 248개로 나눠져 있다.

In [None]:
figure, (ax1) = plt.subplots(nrows=1, ncols=1)
figure.set_size_inches(32, 8)
sns.countplot(data = train, x = 'Fare', hue = 'Survived', ax = ax1)

In [None]:
# 특정 Fare에 대해 유독 생존률이 낮다 왜 그럴까?

In [None]:
fare_pivot = pd.pivot_table(data = train, 
                            index = ['Pclass','Sex','Fare'], 
                            values = 'Survived', 
                            aggfunc = [np.sum, np.mean, 'count'])

fare_pivot.sort_values(by = ('mean', 'Survived'), ascending=False).sort_index()
# 같은 Pclass, Sex에도 특정 Fare에 따라 결정적으로 생존률이 나뉩니다. 
# 아마도 Fare에 따라 배정된 방이 다르고 방에 위치에 따라 생존률이 결정된 것으로 보입니다.

In [None]:
# Age에 따른 생존률
age_pivot = pd.pivot_table(data = train[train['Age_null'] == False], 
               index = ['Age'], 
               values = 'Survived', 
               aggfunc = [np.sum, np.mean, 'count'])

age_pivot.sort_values(by = ('mean', 'Survived'), ascending=False).sort_index()
# 대체적으로 6세 이하 승객들의 생존률이 높습니다.
# 2세 승객들은 어떤 이유에서 생존률이 낮을까요?

In [None]:
age_pivot = pd.pivot_table(data = train[train['Age_null'] == False], 
               index = ['Family_size','Age'], 
               values = 'Survived', 
               aggfunc = [np.sum, np.mean, 'count'])

age_pivot.sort_values(by = ('mean', 'Survived'), ascending=False).sort_index()
# 답은 가족 숫자였습니다. 4인 가족 이하 9세 이하 승객들의 생존률이 높습니다.

# 결론 및 전처리 계획:
1. null data: train 'Age' , test 'Fare'
2. 'Family_size' 1 = 'small'  / 2~4 = 'middel' / 5 ~ 'big' one hot encoding
3. 'Name' - Master T/F
4. 'small_family_baby' = 'Age' + 'Family_size' under 9 & 'small' T/F

5. lable encoding
6. one hot encoding

7. 'Fare' = good_ticket / bad_ticket

## Preprocessing

#### 1. null data

In [None]:
# train 'Age'
train["Age_fillin"] = train["Age"]
train.loc[train['Age'].isnull(), 'Age_fillin'] = train['Age'].mean()
train.loc[train['Age'].isnull(), ['Age', 'Age_fillin']].head()

In [None]:
# test 'Age'
test["Age_fillin"] = test["Age"]
test.loc[test['Age'].isnull(), 'Age_fillin'] = test['Age'].mean()
test.loc[test['Age'].isnull(), ['Age', 'Age_fillin']].head()

In [None]:
# train 'Fare'
train["Fare_fillin"] = train["Fare"]
train.loc[train['Fare'].isnull(), 'Fare_fillin'] = train['Fare'].mean()
train.loc[train['Fare'].isnull(), ['Fare', 'Fare_fillin']].head()

In [None]:
# test 'Fare'
test["Fare_fillin"] = test["Fare"]
test.loc[test['Fare'].isnull(), 'Fare_fillin'] = test['Fare'].mean()
test.loc[test['Fare'].isnull(), ['Fare', 'Fare_fillin']].head()

#### 2. Family size

In [None]:
train["Family_size"] = train["SibSp"] + train["Parch"] + 1
print(train.shape)
train[["SibSp", "Parch", "Family_size"]].head()

In [None]:
test["Family_size"] = test["SibSp"] + test["Parch"] + 1
print(test.shape)
test[["SibSp", "Parch", "Family_size"]].head()

In [None]:
train.loc[train['Family_size'] == 1, 'Family_size_name'] = 'single'
train.loc[(train['Family_size'] > 1) & (train['Family_size'] < 5), 'Family_size_name'] = 'small'
train.loc[train['Family_size'] > 4, 'Family_size_name'] = 'big'

train[['Family_size', 'Family_size_name']].head()

In [None]:
test.loc[test['Family_size'] == 1, 'Family_size_name'] = 'single'
test.loc[(test['Family_size'] > 1) & (test['Family_size'] < 5), 'Family_size_name'] = 'small'
test.loc[test['Family_size'] > 4, 'Family_size_name'] = 'big'

test[['Family_size', 'Family_size_name']].head()

In [None]:
one_got_train_Family_size_name = pd.get_dummies(train['Family_size_name'], prefix = 'Family_size_name')
one_got_train_Family_size_name.head()

In [None]:
one_got_test_Family_size_name = pd.get_dummies(test['Family_size_name'], prefix = 'Family_size_name')
one_got_test_Family_size_name.head()

#### 3.Name

In [None]:
train['Title'] = train['Name'].str.split(',').str[1].str.split('.').str[0].str.strip()

print(train.shape)
train[['Name', 'Title']].head()

In [None]:
test['Title'] = test['Name'].str.split(',').str[1].str.split('.').str[0].str.strip()

print(test.shape)
test[['Name', 'Title']].head()

In [None]:
train["Master"] = train["Title"].str.contains("Master")
print(train.shape)
train[['Master', 'Name']].head()

In [None]:
test["Master"] = test["Title"].str.contains("Master")
print(test.shape)
test[['Master', 'Name']].head()

#### 4. small_family_baby

In [None]:
train['small_family_baby'] = (train['Family_size'] == 'small') & (train['Age_fillin'] > 10)
train[['small_family_baby', 'Family_size','Age_fillin']].head()

In [None]:
test['small_family_baby'] = (test['Family_size'] == 'small') & (test['Age_fillin'] > 10)
test[['small_family_baby', 'Family_size','Age_fillin']].head()

#### 5. Pclass

In [None]:
one_got_train_Pclass = pd.get_dummies(train['Pclass'], prefix = 'Pclass')
print(one_got_train_Pclass.shape)
one_got_train_Pclass.head()

In [None]:
one_got_test_Pclass = pd.get_dummies(test['Pclass'], prefix = 'Pclass')
print(one_got_test_Pclass.shape)
one_got_test_Pclass.head()

#### 6. Concat

In [None]:
train_concat = pd.concat([train,one_got_train_Pclass, one_got_train_Family_size_name], axis = 1)
print(train_concat.shape)
train_concat.head()

In [None]:
test_concat = pd.concat([test,one_got_test_Pclass, one_got_test_Family_size_name], axis = 1)
print(test_concat.shape)
test_concat.head()

## Feature engineering

In [None]:
train_concat.columns

In [None]:
feature_names = ["Sex",  
                 'small_family_baby', 
                 'Family_size_name_big',
                 'Family_size_name_single', 
                 'Family_size_name_small',
                 'Pclass_1',
                 'Pclass_2', 
                 'Pclass_3',
                 "Master", ]

feature_names

In [None]:
df = train_concat.copy()
df['target'] = df['Survived']
print(df.shape)

In [None]:
label_name = "target"
label_name

In [None]:
train_feature_names = train_concat[feature_names]
test_feature_names = test_concat[feature_names]

In [None]:
dtypes_train = train_feature_names.dtypes
encoders = {}
for column in train_feature_names.columns:
    if str(dtypes_train[column]) == 'object':
        encoder = LabelEncoder()
        encoder.fit(train_feature_names[column])
        encoders[column] = encoder
        
df_train = train_feature_names.copy()        
for column in encoders.keys():
    encoder = encoders[column]
    df_train[column] = encoder.transform(train_feature_names[column])

print(df_train.shape)
df_train.head()

In [None]:
dtypes_test = test_feature_names.dtypes
encoders = {}
for column in test_feature_names.columns:
    if str(dtypes_test[column]) == 'object':
        encoder = LabelEncoder()
        encoder.fit(test_feature_names[column])
        encoders[column] = encoder
        
df_test = test_feature_names.copy()        
for column in encoders.keys():
    encoder = encoders[column]
    df_test[column] = encoder.transform(test_feature_names[column])

print(df_test.shape)
df_test.head()

In [None]:
X_train = df_train.copy()
print(X_train.shape)
X_train.head()

In [None]:
X_test = df_test.copy()
print(X_test.shape)
X_test.head()

In [None]:
y_train = df[label_name]
print(y_train.shape)
y_train.head()

## Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=8, random_state=0)
model

In [None]:
model.fit(X_train, y_train)

In [None]:
predictions = model.predict(X_test)
print(predictions.shape)
predictions[0:10]

In [None]:
submission = pd.read_csv("C:/Users/User/Downloads/data/titanic/gender_submission.csv", index_col="PassengerId")
print(submission.shape)
submission.head()

In [None]:
submission["Survived"] = predictions

print(submission.shape)
submission.head()

In [None]:
submission.to_csv("C:/Users/User/Downloads/data/titanic/20210418_gender_submission.csv")

In [None]:
# 모듈 import
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

# 기타 라이브러리
import random
import gc
import os


# 0. 쓰고싶은 모델을 Import 한다.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# 1. 모델들과 결과들의 Dictionary를 만들어준다.
models_list = {'DecisionTreeClassifier': DecisionTreeClassifier(),
              'RandomForestClassifier': RandomForestClassifier(),
              'svm':svm.SVC(),
              'SGDClassifier':SGDClassifier(),
              'LogisticRegression':LogisticRegression()}


# 2. 클래스로 만들어 보기

class AutoML:
    
    def __init__(self, data, target,test_size, model):
        
        # 모델 리스트
        models_list = {'DecisionTreeClassifier': DecisionTreeClassifier(),
              'RandomForestClassifier': RandomForestClassifier(),
              'svm':svm.SVC(),
              'SGDClassifier':SGDClassifier(),
              'LogisticRegression':LogisticRegression()}
        
        self.data = data
        self.target = target
        self.test_size = test_size
        self.model = models_list[model]
        self.results = dict()
        
        # Feature, target 나누기
        X = self.data
        Y = self.target
        
        # train, test 데이터 나누기
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X,
                                                           Y,
                                                           test_size = self.test_size,
                                                           random_state = 31)
    
    def fit(self):
        self.fit = self.model.fit(self.X_train, self.y_train)
        
    def predict(self):
        self.predict = self.model.predict(self.X_test)
        
    def show(self):
        print('accuracy_score:',accuracy_score(self.y_test, self.predict))
        
    def kfold(self, nfold):
        self.nfold = nfold
        folds = KFold(n_splits = nfold)
        splits = folds.split(self.X_train, self.y_train)
        columns = self.X_train.columns
        y_preds = np.zeros(self.X_test.shape[0])
        y_oof = np.zeros(self.X_train.shape[0])
        score = 0
        
        
        for fold_n, (trn_idx, val_idx)in enumerate(splits):
            X_trn, X_val = self.X_train[columns].iloc[trn_idx], self.X_train[columns].iloc[val_idx]
            y_trn, y_val = self.y_train.iloc[trn_idx], self.y_train.iloc[val_idx]
            
            self.model.fit(self.X_train, self.y_train)
            
            y_pred_val = self.model.predict(X_val)
            y_pred_val = [int(v >= 0.5) for v in y_pred_val]
            y_oof[val_idx] = y_pred_val
            
            print(f"Fold {fold_n + 1} | F1 Score: {f1_score(y_val, y_pred_val, average='weighted')}")
    
            score += f1_score(y_val, y_pred_val, average='weighted') / self.nfold
            y_preds += self.model.predict(self.X_test) / self.nfold
    
            del X_trn, X_val, y_trn, y_val
            gc.collect()
            
        print(f"\nMean F1 score = {score}")
        
        
    def Coarse_Finer_Search(self):
        n_estimators = 300
        num_epoch = 100
        coarse_hyperparameters_list = []

        for epoch in range(num_epoch):
            max_depth = np.random.randint(low=2, high=100)
            max_features = np.random.uniform(low=0.1, high=1.0)

            model = RandomForestClassifier(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  n_jobs=-1,
                                  random_state=37)
            
            score = cross_val_score(model, 
                                    self.data, self.target, 
                                    cv=20).mean()
    
            # hyperparameter 탐색 결과를 딕셔너리화 합니다.
            hyperparameters = {'epoch': epoch,
                               'score': score,
                               'n_estimators': n_estimators,
                               'max_depth': max_depth,
                               'max_features': max_features
                              }

            # hyperparameter 탐색 결과를 리스트에 저장합니다.
            coarse_hyperparameters_list.append(hyperparameters)

            # hyperparameter 탐색 결과를 출력합니다.
            print(f"{epoch:2} n_estimators = {n_estimators}, max_depth = {max_depth:2}, max_features = {max_features:.6f}, Score = {score:.5f}")

        # coarse_hyperparameters_list를 Pandas의 DataFrame으로 변환합니다.
        coarse_hyperparameters_list = pd.DataFrame.from_dict(coarse_hyperparameters_list)

        # 변환한 coarse_hyperparameters_list를 score가 높은 순으로 정렬합니다.
        coarse_hyperparameters_list = coarse_hyperparameters_list.sort_values(by="score", ascending = True)

        # coarse_hyperparameters_list 변수에 할당된 데이터의 행렬 사이즈를 출력합니다.
        print(coarse_hyperparameters_list.shape)

        # coarse_hyperparameters_list의 상위 10개를 출력합니다.
        coarse_hyperparameters_list.head(10)

In [None]:
test = AutoML(df_train, df.target, 0.3, 'RandomForestClassifier')

In [None]:
test.fit()
test.predict()
test.show()

In [None]:
test.kfold(5)

In [None]:
test.Coarse_Finer_Search()