# **Titanic Tutorial for Beginner**

<h4>Thank you for visiting my notebook :)</h4>
<h4>This notebook explains easily how to start a competition for beginner!!</h4>

<div class="list-group" id="list-tab" role="tablist">
<h2 style='color:white; background:#707C4F; border:0' role="tab" aria-controls="home"><center>Contents</center></h2>
    
* **Import Library**
    
* **Load Data**
    
* **EDA & Preprocessing**
    
* **Modeling**
    
* **Evaluation**
    
* **Submission**

# **Import Library**


<h4>In the kaggle notebook environment, you can import most of the libraries you want to use</h4>

* pandas → Python Data Analysis Library

* numpy → Linear algebra library that performs numerical operations such as vectors and matrices in Python

* matplotlib & seaborn → Visualization Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, StratifiedKFold
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [None]:
# You don't have to run this code!
# It's just for clean visualization :)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

palette = sns.color_palette("bright")
sns.set_palette("Paired")

# **Load Data**

<h4>Using 'read_csv()' function in Pandas, you can read .csv file easily</h4>

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
sub = pd.read_csv('../input/titanic/gender_submission.csv')

all_data = pd.concat([train, test]).reset_index(drop = True)
all_data

*****

# **EDA & Preprocessing**

<h4>EDA is an abbreviation of Exploratory Data Analysis !</h4>

<h4>You can use 'Matplotlib & Seaborn' for Basic EDA :)</h4>

<h4>With visualization, we can see the distribution of train, test data's features</h4>

#### **Based on the information obtained through the above work, we can preprocess the data**

* ### **Countplot**

#### Since this competition is a binary classification competition, you can check the balance of the target column using **countplot.**

<h4>With below graph, we can see that Target(Survived) column is unbalanced.</h4>

<h4>It's too bad that there are more people who haven't survived.</h4>

In [None]:
sns.countplot(train['Survived']);

#### **We can also check the distribution of categorical columns!**

#### **You can use 'nunique()' function to check which column is categorical**

In [None]:
all_data.nunique().sort_values()

<h4>With above output, we can check top 4 columns are categorical.</h4>

#### SibSp & Parch are not categorical columns!! You can visit [here](https://www.kaggle.com/c/titanic/data) and see the detail explanations about columns

#### So! Let's check about the distribution of those 4 columns using **Countplot**

In [None]:
fig, ax = plt.subplots(1, 4, figsize = (12, 4)) # Making Subplots

sns.countplot(train['Survived'], ax=ax[0]);
sns.countplot(all_data['Sex'], ax=ax[1]);
sns.countplot(all_data['Pclass'], ax=ax[2]);
sns.countplot(all_data['Embarked'], ax=ax[3]);

plt.tight_layout() # you can use this function for clear visualization
plt.show()

#### **We checked the distribution of categorical columns!**

#### **Now, aren't you curious about gender and survival rate?**

#### Using **groupby()** function, the relationship between columns can be grasped!

In [None]:
sex_survived_rate = all_data.groupby('Sex')['Survived'].mean()
sex_survived_rate

#### From the graph below, we can see that despite the large number of male passengers,
#### the survival rate of male passengers is significantly lower than that of female passengers.

#### Maybe thanks to captain's leadership

In [None]:
sex_survived_rate.plot(kind = 'bar');

#### You can practice with Pclass, Embarked columns!!
#### Do it youself :)

*****

* ### **distplot**

#### With distplot, you can check the distribution of numeric columns!

#### Below graphs : distribution of Age, Fare

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (10, 5))

sns.distplot(all_data['Age'], ax=ax[0], color='y');
sns.distplot(all_data['Fare'], ax=ax[1], color='violet');

plt.tight_layout()
plt.show()

#### Fare column looks like be skewed!!

#### Maybe need to use scaler! (ex. StandardScaler, RobustScaler, LogScaling)

#### If we use scaler, our model will be less affected by outliers. Check an image below :)

#### You can use scalers using function 'fit_transform()'

#### **!! Need to transform your target value using reshape(-1, 1) !!**

In [None]:
ss = StandardScaler()
rb = RobustScaler()
mm = MinMaxScaler()

fare_standard = ss.fit_transform(all_data['Fare'].values.reshape(-1, 1))
fare_robust = rb.fit_transform(all_data['Fare'].values.reshape(-1, 1))
fare_minmax = mm.fit_transform(all_data['Fare'].values.reshape(-1, 1))

fig, ax = plt.subplots(1, 4, figsize=(15, 5))

sns.distplot(fare_standard, color='violet', ax = ax[0]).set_title('StandardScaler');
sns.distplot(fare_robust, color='y', ax = ax[1]).set_title('RobustScaler');
sns.distplot(fare_minmax, color='r', ax = ax[2]).set_title('MinMaxScaler');
sns.distplot(np.log1p(train['Fare']), color='b', ax = ax[3]).set_title('LogScaling');

plt.tight_layout()
plt.show()

#### You can choose the scaler that has the highest performance of the model while using all four scalers above

#### **Log Scaling** seems to be attractive → Because of **skewness**

#### In this notebook, we will use **Log Scaling** :)

*****

*  ## **Now, we need to check Missing Values**

### You can use 'isna().sum()' function to see how many missing values are there

In [None]:
all_data.isna().sum()

#### We can make good features using missing values!

* **The number of missing values**
* **One-Hot-Encoding - Missing values Y/N**

#### Let's make a feature of **the number of missing values**

* **Excepting PassengerId, Survived columns**

In [None]:
all_data['missing_counts'] = all_data[all_data.columns[2:]].isna().sum(axis = 1)
all_data

#### Next, making a feature about the presence or absence of missing values

In [None]:
miss_one_hot = all_data[['Age', 'Cabin', 'Fare', 'Embarked']].isna()
miss_one_hot.columns = ['Age_miss', 'Cabin_miss', 'Fare_miss', 'Embarked_miss']
miss_one_hot

#### You must concat those dataframe using axis = 1 !!

In [None]:
all_data = pd.concat([all_data, miss_one_hot], axis = 1)
all_data

#### **EDA is a really important technique in data science**

#### **Many features can be extracted, which plays an important role in improving model performance**

#### **Then, how can we handle those missing values?**

#### Answer is..

* **Fill in the numeric column with -1 and the categorical column with just 'nan'**

* **Fill in the numeric column with each column's mean value**

* **Predict the missing values using ML models** → **KNN Imputer**

#### In this notebook, I'll use knn imputer for numeric columns and fill 'nan' for categorical column

#### For KnnImputer, I think it would be helpful to use the LableEncoder

In [None]:
all_data['Cabin'] = all_data['Cabin'].fillna('nan')
all_data['Embarked'] = all_data['Embarked'].fillna('nan')

le = LabelEncoder()

all_data['Cabin'] = le.fit_transform(all_data['Cabin'])
all_data['Sex'] = le.fit_transform(all_data['Sex'])
all_data['Embarked'] = le.fit_transform(all_data['Embarked'])
all_data

#### Need to except Name, PassengerId, Survived, Ticket columns for Imputing

In [None]:
columns = list(all_data.columns)
columns.remove('PassengerId')
columns.remove('Survived')
columns.remove('Ticket')
columns.remove('Name')
columns

#### **Knn Imputer**

In [None]:
knn = KNNImputer()

imputed_data = all_data[columns]
imputed_data = knn.fit_transform(imputed_data)
imputed_data

#### You must use iloc or loc for edit dataframe!

In [None]:
all_data.iloc[:, 5] = imputed_data[:, 2]
all_data.iloc[:, 9] = imputed_data[:, 5]
all_data

### **Clear!**

In [None]:
all_data.isna().sum()

*****

* ### **Preprocessing Name, Ticket columns**

#### Thank you for following me all the way here. Please cheer up a little bit more. We're almost there :)

#### **I think that name column is very important**

#### **If the family in the train data survived, other family members in the test data are more likely to survive.**

In [None]:
name=[]
for i in range(len(all_data['Name'])):
    name.append(all_data['Name'].iloc[i].split(',')[1].split('.')[0])
all_data['Name']=name
all_data['Name']=all_data['Name'].replace([' Dr',' Mlle',' Rev',' Major',' Col',' Don',' the Countess',' Lady',' Jonkheer',' Sir',' Mme',' Ms',' Capt',' Dona'],'Rare')
all_data['Name']=all_data['Name'].replace({' Mr':1,' Miss':2,' Mrs':2,' Master':3,'Rare':4})

name=[]
for i in range(len(test['Name'])):
    name.append(test['Name'].iloc[i].split(',')[1].split('.')[0])
test['Name']=name
test['Name']=test['Name'].replace([' Dr',' Mlle',' Rev',' Major',' Col',' Don',' the Countess',' Lady',' Jonkheer',' Sir',' Mme',' Ms',' Capt',' Dona'],'Rare')
test['Name']=test['Name'].replace({' Mr':1,' Miss':2,' Mrs':2,' Master':3,'Rare':4})

## **Ticket?**

#### **At first, we need to split by space**

In [None]:
ticket_split = all_data['Ticket'].apply(lambda x : x.split(' '))
ticket_split

#### **How about binning those group??**

#### Using agg(len), we can group those ticket data

#### Code presumed to be the type of ticket is included at the beginning of the list (A/5, STON/, etc..)

In [None]:
ticket_2 = ticket_split[ticket_split.agg(len) == 2]
ticket_2

In [None]:
ticket_2_index = ticket_split[ticket_split.agg(len) == 2].index
ticket_2_index

In [None]:
ticket_code_2 = []

for i in ticket_2.index:
    ticket_code_2.append(ticket_2[i][0])
    
ticket_code_2

In [None]:
ticket_code_2_labeled = le.fit_transform(ticket_code_2)
ticket_code_2_labeled += 1
ticket_code_2_labeled

In [None]:
ticket_3 = ticket_split[ticket_split.agg(len) == 3]
ticket_3

In [None]:
ticket_3_index = ticket_split[ticket_split.agg(len) == 3].index

ticket_code_3 = []

for i in ticket_3.index:
    ticket_code_3.append(ticket_3[i][0])
    
ticket_code_3

In [None]:
ticket_code_3_labeled = le.fit_transform(ticket_code_3)
ticket_code_3_labeled += ticket_code_2_labeled.max()
ticket_code_3_labeled

#### Making new columns !

In [None]:
all_data['ticket_code'] = 0
all_data

In [None]:
all_data.loc[ticket_2_index, 'ticket_code'] = list(ticket_code_2_labeled)
all_data.loc[ticket_3_index, 'ticket_code'] = list(ticket_code_3_labeled)
all_data

#### **Done!!!**

In [None]:
all_data2 = all_data.drop(columns = ['Ticket', 'Survived', 'PassengerId'])
all_data2

## Scaling

In [None]:
all_data2['Fare'] = np.log1p(all_data2['Fare'])

# **Modeling**

## There are many **categorical columns!**

### How about using **CatBoost**?

* ### **Train_Test_Split**

#### **Using validation data for evaluation**

#### **In case of classification competition, you can use option 'stratify' for seperation balance**

In [None]:
train2 = all_data2[:len(train)]
test2 = all_data2[len(train):].reset_index(drop = True)

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(train2, train['Survived'], test_size = 0.2, random_state = 42, stratify = train['Survived'])

In [None]:
cat = CatBoostClassifier(verbose = 1000,
                         eval_metric='Accuracy',
                         early_stopping_rounds=1000,
                         n_estimators=10000,
                         learning_rate = 0.025,
                         max_depth=7)

cat.fit(x_train, y_train, eval_set=[(x_valid, y_valid)])

result = cat.predict(test2)

sub['Survived'] = result

sub.to_csv('sub_catboost.csv', index = 0)

* ### **Stratified Kfold**

#### You can use the code below for your stratifiedKfold Baseline

In [None]:
stk = StratifiedKFold(n_splits=5, random_state = 42, shuffle = True)

result_cat = 0

for fold, (train_index, valid_index) in enumerate(stk.split(train2, train['Survived'])):
    x_train, y_train = train2.iloc[train_index], train['Survived'][train_index]
    x_valid, y_valid = train2.iloc[valid_index], train['Survived'][valid_index]
    
    cat = CatBoostClassifier(verbose = 1000,
                         eval_metric='Accuracy',
                         early_stopping_rounds=1000,
                         n_estimators=10000,
                         learning_rate = 0.02,
                         max_depth=8)
    print('----------Fold', fold+1, 'Start!--------')
    cat.fit(x_train, y_train, eval_set=[(x_valid, y_valid)])
    print('----------Fold', fold+1, 'Done!--------')
    result_cat += cat.predict_proba(test2)[:, 1] / 5

print('All Done!')

In [None]:
sub['Survived'] = result_cat
sub['Survived'] = sub['Survived'].astype(np.int64)
sub.to_csv('sub_cat_stratifiedkfold.csv', index = 0)

## **Thank you so much for reading it until the end**
## **I'm glad if it helped you!**
## **If this notebook helped you to learn, please do not forget the Upvote!!**