- [First look](#first_look)
    - [(1) Check data for NA](#check_na)
    
- [EDA(Exploratory Data Analysis)](#eda)
    - [(1) Survived](#survived)
    - [(2) Age](#age)
    - [(3) Cabin](#cabin)
    - [(3) Family](#family)
    - [(3) Pclass](#pclass)
    - [(3) Sex](#sex)
    - [(3) Embarked](#embarked)
    - [(3) Fare](#fare)



<a id="eda"></a>
## EDA(Exploratory Data Analysis)
> *references*
> [https://www.kaggle.com/demidova/titanic-eda-tutorial](https://www.kaggle.com/demidova/titanic-eda-tutorial)
> [https://www.kaggle.com/demidova/titanic-logistic-regression-random-forest-xgboost?scriptVersionId=46567425]

![titanic](https://ww.namu.la/s/1cc50931b5875401a9465ba06eaaf3d357ebfeabdf50346cd03636ab60ef0a9783a060b2c9cc7808148cd2a075699fa0f094e7f34df4a69fd5fdeb31137a37ef6e3a6c57a0b629606097a954052b7abba6a51a1a32ed5be9a92174b2ada23080602e6277fd7f0a200ddffcbd5b581746)
- [https://namu.wiki/jump/9AGb4mj%2Bgar2D116rRySHULPcuF9aQA9dU1%2FKaQlJabHnX1Bwo7dW3QKZZU5EDX7tyS7%2BeKInzFlBX0PyH2gvmr0xlEeT19AQhYRU4yv8erx25eqVyS5NlWU2pDAk3mhBaO4i%2BaABck5vAWwFaAE0g%3D%3D](https://namu.wiki/jump/9AGb4mj%2Bgar2D116rRySHULPcuF9aQA9dU1%2FKaQlJabHnX1Bwo7dW3QKZZU5EDX7tyS7%2BeKInzFlBX0PyH2gvmr0xlEeT19AQhYRU4yv8erx25eqVyS5NlWU2pDAk3mhBaO4i%2BaABck5vAWwFaAE0g%3D%3D)

<a id='data_import'></a>
### (1) Data Import

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import os

print("Version Pandas", pd.__version__)
print("Version Matplotlib", matplotlib.__version__)
print("Version Numpy", np.__version__)
print("Version Seaborn", sb.__version__)

os.listdir('../input/tabular-playground-series-apr-2021/')

In [None]:
BASE_DIR = '../input/tabular-playground-series-apr-2021/'
train = pd.read_csv(BASE_DIR + 'train.csv')
test = pd.read_csv(BASE_DIR + 'test.csv')
sample_submission = pd.read_csv(BASE_DIR + 'sample_submission.csv')

train.shape, test.shape, sample_submission.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sample_submission.head()

In [None]:
frames= [train, test]
total_df=pd.concat(frames, sort=False)
print('total data shape: ', total_df.shape)
total_df.head()

In [None]:
total_df.describe(include=[object])

In [None]:
total_df.describe(include=[object])

<a id="check_na"></a>
### (2) Check data for NA
- dataset의 feature들을 살펴보고, null data의 여부를 체크해보자

> 종속변수
> - **Survived(생존여부)**: target label (1,0) -> integer 

> 독립변수
> - **PassengerId**: 10000명
> - **Pclass(티켓의 클래스)**: Upper(1), Middle(2), Lower(3) -> categorical -> integer
> - **Name(이름)**: 탑승자 성명들 
> - **Sex(성별)**: Male, Female -> binary -> string
> - **Age(나이)**: continuous -> integer
> - **SibSp(함께 탑승한 형제와 배우자의 수)**: quantitative -> integer 
> - **Parch(함께 탑승한 부모, 아이의 수)**: quantitative -> integer
> - **Ticket(티켓 번호)**: alphabet + integer -> string
> - **Fare(탑승료)**: continous -> float
> - **Cabin(객실 번호)**: alphabet + integer -> string
> - **Embarked(탑승항구)**: C(Cherbourg), Q(Queenstown), S(Southhampton) -> string

*references*
- [https://kaggle-kr.tistory.com/17](https://kaggle-kr.tistory.com/17)

In [None]:
total_df.info()

Age, Fare -> numeric variables\
Pclass -> integer but in fact 'categorical variable'

In [None]:
total_df_na=total_df.isna().sum()
train_na=train.isna().sum()
test_na=test.isna().sum()

pd.concat([train_na, test_na, total_df_na], axis=1, sort=False, keys=['Train NA','Test NA','Total NA'])

missing data를 handling하기 위해서 EDA에서는 dataset을 합쳤지만, ML에서는 'data leakage'를 피하기 위해서 오직 train data set만 사용할 것이다. 

In [None]:
total_df.describe()

<a id="survived"></a>
### (1) Survived
- train set에서 survived의 0,1 분포가 어떤지 확인해보겠습니다. 
- 분포에 따라 모델의 평가 방법이 달라질 수 있습니다. 

In [None]:
plt.figure(figsize=(6, 4.5))

ax= sb.countplot(x='Survived', data=total_df, palette=['#4287f5','#7cd91e'])

plt.xticks(np.arange(2), ['Drowned','Survived'])
plt.title('Overall survival', fontsize=14)
plt.xlabel('Survived vs Drowned')
plt.ylabel('Number of Passendgers')

labels=(total_df['Survived'].value_counts())

for i,v in enumerate(labels):
    ax.text(i, v-40, str(v), horizontalalignment='center', size=14, color='w', fontweight='bold')
    
plt.show()

In [None]:
total_df['Survived'].value_counts(normalize=True)

<a id="independent_variables"></a>
### (2) Independent Variables 
> *references*
> - [https://wikidocs.net/75068](https://wikidocs.net/75068)

#### 1) Age
6779 : age missing values
- 3292 : train dataset
- 3487 : test dataset

In [None]:
plt.figure(figsize=(15,3))

sb.distplot(total_df[(total_df['Age']>0)].Age, kde_kws={'lw':3}, bins=50)

plt.title('Distribution of passengers age (total data)', fontsize=14)
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.tight_layout()

In [None]:
age_distr= pd.DataFrame(total_df['Age'].describe())
age_distr.transpose()

0.08세 ~ 87세까지 다양하게 나이대가 있으며 mean=34.46세 이다. 

### 1-1) Age by surviving status

In [None]:
plt.figure(figsize=(15,3))

sb.boxplot(y='Survived', x='Age', data=train, palette=['#4287f5','#7cd91e'], fliersize=0, orient='h')

sb.stripplot(y='Survived',x='Age', data=train, linewidth=0.6, palette=['#4287f5','#7cd91e'], orient='h')
plt.yticks(np.arange(2), ['Drowned','Survived'])
plt.title('Age distribution grouped by survivng status (train data)', fontsize=14)
plt.ylabel('Passengers status after the tragedy')
plt.tight_layout()

In [None]:
pd.DataFrame(total_df.groupby('Survived')['Age'].describe())

### 1-2) Age by Pclass

In [None]:
plt.figure(figsize=(20,6))

palette=sb.cubehelix_palette(5, start=3)

plt.subplot(1,2,1)
sb.boxplot(x='Pclass', y='Age', data=total_df, palette=palette, fliersize=0)

plt.xticks(np.arange(3), ['1st class','2nd class','3rd class'])
plt.title('Age distribution grouped by ticket class (total data)', fontsize=16)
plt.xlabel('Ticket class')

plt.subplot(1,2,2)

age_1_class = total_df[(total_df['Age']>0)&(total_df['Pclass']==1)]
age_2_class = total_df[(total_df['Age']>0)&(total_df['Pclass']==2)]
age_3_class = total_df[(total_df['Age']>0)&(total_df['Pclass']==3)]

# Ploting the 3 variables that we create
sb.kdeplot(age_1_class["Age"], shade=True, color='#eed4d0', label = '1st class')
sb.kdeplot(age_2_class["Age"], shade=True,  color='#cda0aa', label = '2nd class')
sb.kdeplot(age_3_class["Age"], shade=True,color='#a2708e', label = '3rd class')
plt.title('Age distribution grouped by ticket class (total data)',fontsize= 16)
plt.xlabel('Age')
plt.xlim(0, 90)
plt.tight_layout()
plt.show()

In [None]:
pd.DataFrame(total_df.groupby('Pclass')['Age'].describe())

2nd 클래스는 1st, 3rd 클래스에 비해 더 넓은 분포를 가진다. 또한 거의 대칭 적이다.\
가장 나이가 적은 passenger은 1,2,3 등급 동일한 나이인 0.08세이다. \
가장 나이가 많은 passenger은 2nd 클래스의 87세이다.

3rd 클래스 mean age= 30.2세\
2nd 클래스 mean age= 36.9세\
1st 클래스 mean age= 40.7세

### 1-3) Age vs Pclass vs Sex

In [None]:
plt.figure(figsize=(20, 5))
palette = "Set3"

plt.subplot(1, 3, 1)
sb.boxplot(x = 'Sex', y = 'Age', data = age_1_class,
     palette = palette, fliersize = 0)
#sb.stripplot(x = 'Sex', y = 'Age', data = age_1_class,linewidth = 0.6, palette = palette)
plt.title('1st class Age distribution by Sex',fontsize= 14)
plt.ylim(-5, 80)

plt.subplot(1, 3, 2)
sb.boxplot(x = 'Sex', y = 'Age', data = age_2_class,
     palette = palette, fliersize = 0)
#sb.stripplot(x = 'Sex', y = 'Age', data = age_2_class,linewidth = 0.6, palette = palette)
plt.title('2nd class Age distribution by Sex',fontsize= 14)
plt.ylim(-5, 80)

plt.subplot(1, 3, 3)
sb.boxplot(x = 'Sex', y = 'Age',  data = age_3_class,
     order = ['female', 'male'], palette = palette, fliersize = 0)
#sb.stripplot(x = 'Sex', y = 'Age', data = age_3_class,order = ['female', 'male'], linewidth = 0.6, palette = palette)
plt.title('3rd class Age distribution by Sex',fontsize= 14)
plt.ylim(-5, 80)

plt.show()

In [None]:
age_1_class_stat = pd.DataFrame(age_1_class.groupby('Sex')['Age'].describe())
age_2_class_stat = pd.DataFrame(age_2_class.groupby('Sex')['Age'].describe())
age_3_class_stat = pd.DataFrame(age_3_class.groupby('Sex')['Age'].describe())

pd.concat([age_1_class_stat, age_2_class_stat, age_3_class_stat], axis=0, sort = False, keys = ['1st', '2nd', '3rd'])

#### 2) Cabin
- 첫번째 코드만 추출함
- A: lst class
- B
- C: 3rd class
- D: walking area
- E: 1st and 2nd class
- F: 2nd class, 2rd class
- G: boiler room
- T: boat deck
- U: Unknown

In [None]:
total_df['Cabin']=total_df['Cabin'].str.split('',expand=True)[1]
total_df.loc[total_df['Cabin'].isna(), 'Cabin']='X'

In [None]:
fig = plt.figure(figsize=(20, 5))

ax1 = fig.add_subplot(131)
sb.countplot(x = 'Cabin', data = total_df, palette = "hls", order = total_df['Cabin'].value_counts().index, ax = ax1)
plt.title('Passengers distribution by Cabin',fontsize= 16)
plt.ylabel('Number of passengers')

ax2 = fig.add_subplot(132)
Cabin_by_class = total_df.groupby('Cabin')['Pclass'].value_counts(normalize = True).unstack()
Cabin_by_class.plot(kind='bar', stacked='True',color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of classes on each Cabin',fontsize= 16)
plt.xticks(rotation = False)

ax3 = fig.add_subplot(133)
Cabin_by_survived = total_df.groupby('Cabin')['Survived'].value_counts(normalize = True).unstack()
Cabin_by_survived = Cabin_by_survived.sort_values(by = 1, ascending = False)
Cabin_by_survived.plot(kind='bar', stacked='True', color=["#3f3e6fd1", "#85c6a9"], ax = ax3)
plt.title('Proportion of survived/drowned passengers by Cabin',fontsize= 16)
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)
plt.tight_layout()

plt.show()


- 대부분의 passengers는 Cabin code가 없다. 
- Cabin code가 나와있는 승객들 중 가장 많은 수를 차지하는 deck은 'C'이며 lst class ticket이다. 'C' deck은 살아남은 승객들 중 4번째이다. 
- 가장 많은 생존률을 가진 deck은 'F'이다. 
- 'A' deck은 lifeboats와 가장 가까운 deck이였지만 생존률은 가장 낮은 확률을 보이고 있다. 

### 3) Family
- Family size = Sib + Parch +1

In [None]:
total_df['Family_size']=total_df['SibSp']+total_df['Parch']+1
family_size=total_df['Family_size'].value_counts()
print('Family size and number of passengers:')
print(family_size)

In [None]:
fig = plt.figure(figsize = (12,4))

ax1 = fig.add_subplot(121)
ax = sb.countplot(total_df['Family_size'], ax = ax1)

# calculate passengers for each category
labels = (total_df['Family_size'].value_counts())
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+6, str(v), horizontalalignment = 'center', size = 10, color = 'black')
    
plt.title('Passengers distribution by family size')
plt.ylabel('Number of passengers')

ax2 = fig.add_subplot(122)
d = total_df.groupby('Family_size')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', color=["#3f3e6fd1", "#85c6a9"], stacked='True', ax = ax2)
plt.title('Proportion of survived/drowned passengers by family size (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
plt.xticks(rotation = False)

plt.tight_layout()

- family size가 15명인 그룹은 모두 살아남지 못하였다. 
- 대부분은 혼자 여행하는 사람들이였고, 생존율은 40% 정도 이다. 
- 가장 높은 생존율을 보이는 family size는 2,3 정도이다. 
- 4개의 category로 family size group을 나누어보겠다.
- single
- usual(sizes 2,3,4,5)
- big(6,7,8,9)
- large(all bigger then 10)

In [None]:
total_df['Family_size_group']=total_df['Family_size'].map(lambda x: 'f_single' if x ==1
                                                         else('f_usual' if 6>x>=2
                                                             else('f_big' if 10>x>=6
                                                                 else('f_large'))))

In [None]:
fig = plt.figure(figsize = (14,5))

ax1 = fig.add_subplot(121)
d = total_df.groupby('Family_size_group')['Survived'].value_counts(normalize = True).unstack()
d = d.sort_values(by = 1, ascending = False)
d.plot(kind='bar', stacked='True', color = ["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Proportion of survived/drowned passengers by family size')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)


ax2 = fig.add_subplot(122)
d2 = total_df.groupby('Family_size_group')['Pclass'].value_counts(normalize = True).unstack()
d2 = d2.sort_values(by = 1, ascending = False)
d2.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax2)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of 1st/2nd/3rd ticket class in family group size')
_ = plt.xticks(rotation=False)

plt.tight_layout()

#### 4) Pclass

In [None]:
ax = sb.countplot(total_df['Pclass'], palette = ['#eed4d0', '#cda0aa', '#a2708e'])
# calculate passengers for each category
labels = (total_df['Pclass'].value_counts(sort = False))
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+2, str(v), horizontalalignment = 'center', size = 12, color = 'black', fontweight = 'bold')
    
    
plt.title('Passengers distribution by Pclass')
plt.ylabel('Number of passengers')
plt.tight_layout()

In [None]:
fig = plt.figure(figsize=(14, 5))

ax1 = fig.add_subplot(121)
sb.countplot(x = 'Pclass', hue = 'Survived', data = total_df, palette=["#3f3e6fd1", "#85c6a9"], ax = ax1)
plt.title('Number of survived/drowned passengers by class (train data)')
plt.ylabel('Number of passengers')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

ax2 = fig.add_subplot(122)
d = total_df.groupby('Pclass')['Survived'].value_counts(normalize = True).unstack()
d.plot(kind='bar', stacked='True', ax = ax2, color =["#3f3e6fd1", "#85c6a9"])
plt.title('Proportion of survived/drowned passengers by class (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)

plt.tight_layout()

- 가장 많은 승객이 탄 3등급 임에도 불구하고 생존율은 가장 적은 승객이 탑승한 1등급에 비해 더 적은 생존율을 보인다. 

### 4-1) Pclass vs Surviving vs Sex

In [None]:
sb.catplot(x = 'Pclass', hue = 'Survived', col = 'Sex', kind = 'count', data = total_df , palette=["#3f3e6fd1", "#85c6a9"])

plt.tight_layout()

- 1등급 클래스의 남성 승객들의 대부분은 살아남지 못하였고 여성들은 대부분 살아남았다. 
- 3등급 클래스의 여성의 절반 이상은 살아남았다. 

### 5) Embarked

In [None]:
fig = plt.figure(figsize = (15,4))

ax1 = fig.add_subplot(131)
palette = sb.cubehelix_palette(5, start = 2)
ax = sb.countplot(total_df['Embarked'], palette = palette, order = ['C', 'Q', 'S'], ax = ax1)
plt.title('Number of passengers by Embarked')
plt.ylabel('Number of passengers')

# calculate passengers for each category
labels = (total_df['Embarked'].value_counts())
labels = labels.sort_index()
# add result numbers on barchart
for i, v in enumerate(labels):
    ax.text(i, v+10, str(v), horizontalalignment = 'center', size = 10, color = 'black')
    

ax2 = fig.add_subplot(132)
surv_by_emb = total_df.groupby('Embarked')['Survived'].value_counts(normalize = True)
surv_by_emb = surv_by_emb.unstack().sort_index()
surv_by_emb.plot(kind='bar', stacked='True', color=["#3f3e6fd1", "#85c6a9"], ax = ax2)
plt.title('Proportion of survived/drowned passengers by Embarked (train data)')
plt.legend(( 'Drowned', 'Survived'), loc=(1.04,0))
_ = plt.xticks(rotation=False)


ax3 = fig.add_subplot(133)
class_by_emb = total_df.groupby('Embarked')['Pclass'].value_counts(normalize = True)
class_by_emb = class_by_emb.unstack().sort_index()
class_by_emb.plot(kind='bar', stacked='True', color = ['#eed4d0', '#cda0aa', '#a2708e'], ax = ax3)
plt.legend(('1st class', '2nd class', '3rd class'), loc=(1.04,0))
plt.title('Proportion of clases by Embarked')
_ = plt.xticks(rotation=False)

plt.tight_layout()

- 대부분의 승객(140981)들은 S 항구에서 출발하였고 S 항구에서 출발한 승객들의 생존율은 가장 낮았다. 또한 3등급 클래스 사람들이 대부분이다. 
- C항구에서 출발한 승객들은 75% 이상의 생존율을 보인다. 
- 가장 적은 승객들이 탑승한 Q항구에는 가장 많은 l등급 클래스의 승객들이 탑승하였다. 

In [None]:
sb.catplot(x="Embarked", y="Fare", kind="violin", inner=None,
            data=total_df, height = 6, palette = palette, order = ['C', 'Q', 'S'])
plt.title('Distribution of Fare by Embarked')
plt.tight_layout()

In [None]:
pd.DataFrame(total_df.groupby('Embarked')['Fare'].describe())

### 6) Fare

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sb.distplot(total_df['Fare'], color='r', label='Skewness : {:.2f}'.format(total_df['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

In [None]:
fare_map = total_df[['Fare', 'Pclass']].dropna().groupby('Pclass').median().to_dict()
total_df['Fare'] = total_df['Fare'].fillna(total_df['Pclass'].map(fare_map['Fare']))

total_df['Fare'] = total_df['Fare'].map(lambda i: np.log(i) if i > 0 else 0)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sb.distplot(total_df['Fare'], color='b', label='Skewness : {:.2f}'.format(total_df['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

# Feature Engineering

- Null values 확인

In [None]:
total_df.isna().sum()

## 1) Data Correlation

In [None]:
fig, ax=plt.subplots(1, 3, figsize=(17,5))
feature_lst=['Pclass','Age','Fare','Sex','Family_size']

corr=total_df[feature_lst].corr()

mask=np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)]=True

for idx, method in enumerate(['pearson','kendall','spearman']):
    sb.heatmap(total_df[feature_lst].corr(method=method), ax=ax[idx],
              square=True, annot=True, fmt='.2f', center=0, linewidth=2,
              cbar=False, cmap=sb.diverging_palette(240, 10, as_cmap=True),
        mask=mask)
    ax[idx].set_title(f'{method.capitalize()} Correlation', loc='left', fontweight='bold')
    
plt.show()

### 2) Age
- 각 클래스마다 나이의 평균을 각 클래스마다의 null 값에 넣어주었다. 

In [None]:
age_map= total_df[['Age','Pclass']].dropna().groupby('Pclass').median().to_dict()
total_df['Age']=total_df['Age'].fillna(total_df['Pclass'].map(age_map['Age']))

### 3) Embarked

In [None]:
print('Embarked has ', sum(total_df['Embarked'].isnull()), ' Null values')

In [None]:
total_df['Embarked'] = total_df['Embarked'].fillna('S')

In [None]:
total_df

### 4) Name

In [None]:
total_df['Name'] = total_df['Name'].map(lambda x: x.split(',')[0])

### 5) Ticket

In [None]:
total_df['Ticket'] = total_df['Ticket'].fillna('X').map(lambda x:str(x).split()[0] if len(str(x).split()) > 1 else 'X')

### 6) Drop
- PassengerId, Name, SibSp, Parch, Cabin

In [None]:
total_df

In [None]:
total_df.drop(['PassengerId','Name','Family_size_group','Family_size'], axis=1, inplace=True)
total_df.shape

In [None]:
total_df['Sex']=total_df['Sex'].map({'female':0, 'male':1})
total_df=pd.get_dummies(total_df, columns=['Embarked'], prefix='Embarked')
total_df=pd.get_dummies(total_df, columns=['Cabin'], prefix='Cabin')
total_df=pd.get_dummies(total_df, columns=['Ticket'], prefix='Ticket')
#total_df=pd.get_dummies(total_df, columns=['Family_size_group'], prefix='Family_size_group')

In [None]:
total_df

## Split data

In [None]:
X = total_df[:train.shape[0]]
print("X Shape is:", X.shape)
y = X['Survived']
X.drop(['Survived'], axis=1, inplace=True)
test_data = total_df[train.shape[0]:].drop(columns=['Survived'])
test_data.info()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, random_state=42)
#_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, stratify = X[['Pclass']], random_state=42)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

# Scikit Learn

## 1. Modeling
> *references*
> - [https://www.kaggle.com/j2hoon85/tps-april-sklearn-pycaret-for-newbies](https://www.kaggle.com/j2hoon85/tps-april-sklearn-pycaret-for-newbies)
> - [https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking](https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking)

### 1-1. Hyper Parameter Tuning - Baysian Optimization
> *references*
> - [https://www.kaggle.com/elon4773/titanic-visualization-bayesian-optimization](https://www.kaggle.com/elon4773/titanic-visualization-bayesian-optimization)

In [None]:
!pip install catboost

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score,roc_auc_score, f1_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn import model_selection

In [None]:
gbc_params = {
    'max_depth': int(round(7.18465375161774)),
  'max_features': 0.4861929134696539,
  'min_samples_leaf': int(round(113.13022692803058)),
  'min_samples_split': int(round(8.386778166939953))
}

lgbm_params = {
    'colsample_bytree': 0.6283725788215941,
  'max_bin': int(round(15.826197551963968)),
  'max_depth': int(round(39.32209311790955)),
  'min_child_weight': 44.95339851660889,
  'min_split_gain': 0.04358718365142237,
  'num_leaves': int(round(24.715504910160405)),
  'reg_alpha': 0.4127198530404361,
  'reg_lambda': 0.0006949333245371281,
  'subsample': 0.7192205961769677,
  'subsample_freq': int(round(13.984681107001574))
}

cb_params  = {
    'bagging_temperature': 534.445170361156,
  'border_count': int(round(230.32755580650806)),
  'depth': int(round(5.969930611242375)),
  'learning_rate': 0.01966964700090523,
  'min_data_in_leaf': 2.208728103621775
}

xgb_params  = {
    'colsample_bynode': 0.2816652230511576,
   'colsample_bytree': 0.6123746062455153,
   'learning_rate': 0.04706823512500192,
   'max_bin': int(round(118.8222831831757)),
   'max_depth': int(round(6.3341943151448135)),
   'min_child_weight': 25.890720015058704,
   'subsample': 0.9115493169826735
}

In [None]:
gbc = GradientBoostingClassifier(**gbc_params)
lgbm = LGBMClassifier(**lgbm_params)
cb = CatBoostClassifier(**cb_params)
xgb = XGBClassifier(**xgb_params)

mlr= LogisticRegression()

### 1-2. Blending Model
> *references*
> - [https://www.kaggle.com/eraaz1/a-comprehensive-guide-to-titanic-machine-learning](https://www.kaggle.com/eraaz1/a-comprehensive-guide-to-titanic-machine-learning)

In [None]:
from mlens.ensemble import BlendEnsemble

In [None]:
X_train.info()

In [None]:
%%time
blend= BlendEnsemble(n_jobs=-1, test_size=0.5, random_state=17)
baseModels=[lgbm, gbc, cb]
blend.add(baseModels)

blend.add_meta(mlr)
print("Fitting Blending ...")
display(blend.fit(X_train, y_train))
print("done.")

In [None]:
#test= test.drop('Name', axis=1)
#test.info()

In [None]:
pred= blend.predict(test_data).astype(int)
pred_df=pd.DataFrame(pred)
pred_df[0]

### File submission - Scikit Learn

In [None]:
sklearn_submission = pd.read_csv(BASE_DIR + 'sample_submission.csv')
sklearn_submission['Survived']= pred_df[0]
sklearn_submission

In [None]:
sklearn_submission.to_csv('Scikit Learn Submission.csv', index=False)
sklearn_submission.head()