# Titanic

<img src="https://upload.wikimedia.org/wikipedia/commons/a/a2/Titanic_lifeboat.jpg" width="500">

# 1. Load Data & Check Information

In [None]:
import pandas as pd
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')

<font size="3">
As you can see, data is divided into two groups, which are train and test sets. <br /> Train set contains 891 personal information, and test set contains 417 personal information based on PassengerId. Each person has different information.

In [None]:
df_train.head()

In [None]:
df_train.tail()

In [None]:
df_test.head()

In [None]:
df_test.tail()

<font size=3>
Based on the describe, only 38% of the people in the titanic were survived. <br />
Pclass(=Ticket class) is divided into three classes which are 1st, 2nd, 3rd class.<br />
Oldest person in the titanic was 80 years old and yougest person was less than one year. <br />
More than 50% of people did not come with any siblings or spouses<br />
Also, more than 75% of people came along to the titanic <br />
Lastly, highest price for ticket was \$512.3 and lowest price was $0!

In [None]:
df_train.describe()

# 2. Data Overview

<font size=3>
1. Passenger Class = More than a half people in 1st class survived! Unfortunately, lots of 2nd and 3rd class people did not make it... <br />
2. Female or Male = Interestingly, the ratio of female has 3 times higher than the ration of male <br />
3. Port of Embarkation = People who were embarked from C(=Cherbourg) was highest, Q(=Queenstown) was second, and S(=Southampton) was third.

In [None]:
from matplotlib import pyplot as plt
plt.style.use('seaborn')

fig1, ax1 = plt.subplots(nrows=1, ncols=3,figsize=(10,5))
df_train.groupby('Pclass')['Survived'].mean().plot.bar(ax=ax1[0],rot=0, title='Passenger Class',edgecolor="k", xlabel='')
df_train.groupby('Sex')['Survived'].mean().plot.bar(ax=ax1[1],rot=0, title = 'Female or Male',edgecolor="k", xlabel='')
df_train.groupby('Embarked')['Survived'].mean().plot.bar(ax=ax1[2],rot=0, title = 'Port of Embarkation',edgecolor="k",xlabel='')

plt.tight_layout()
plt.show()

<font size=3>
1. Age = Under 20 years old has highest ratio and Under 80 years old has lowest ratio. <br />
2. # of siblings / spouses = Less siblings or spouses has more chance to survive <br />
3. # of parents / children = This result is kind of tricky for me. 3 parents of children has highest percentage of survive. <br />
4. Passenger Fare = This result has very close relation with Passenger Class

In [None]:
#Age
age_bins = np.arange(0, 100, 20, dtype='int')
age_labels = [f'Under {i}' for i in age_bins[1:]]
age_group = pd.cut(df_train['Age'], bins=age_bins, labels=age_labels, right=False).rename(None)

#Fare
fare_bins = np.arange(0, 601, 200, dtype='int')
fare_labels = [f'Under {i}' for i in fare_bins[1:]]
fare_group = pd.cut(df_train['Fare'], bins=fare_bins, labels=fare_labels, right=False).rename(None)

fig2, ax2 = plt.subplots(nrows=2, ncols=2,figsize=(15,10))
df_train.groupby(age_group)['Survived'].mean().plot.bar(ax=ax2[0][0], rot=0, title='Age',edgecolor="k")
df_train.groupby('SibSp')['Survived'].mean().plot.bar(ax=ax2[0][1],rot=0, title = '# of siblings / spouses', edgecolor="k",xlabel='')
df_train.groupby('Parch')['Survived'].mean().plot.bar(ax=ax2[1][0],rot=0, title = '# of parents / children',edgecolor="k",xlabel='')
df_train.groupby(fare_group)['Survived'].mean().plot.bar(ax=ax2[1][1], rot=0, title='Passenger Fare',edgecolor="k")

plt.show()

# 3. Data Engineering

<font size=3>
Based on the Data Dictionary in 'Titanic - Machine Learning from Disaster', Categorical features are pclass, Sex, and embarked. In addition, Numerical features are Age, sibsp, parch, ticket, fare, and cabin. <br />
Categorical and Numerical features will be divided separately. <br />
After splitting each other, all the data engineering will be happened to apply at the model.

In [None]:
df_train.info()

<font size=3>
Since scikit-learn cannot use dataframe directly, I made new class, DataFrameSelector, which can take data as a dataframe.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

class DataFrameSelector(BaseEstimator, TransformerMixin):
  def __init__(self, attribute_names):
    self.attribute_names = attribute_names
  def fit(self, X, y=None):
    return self
  def transform(self, X):
    return X[self.attribute_names].values

**Categorical Features** <br />
Changing Age's null data to median or mean can lead wrong solution to model, so I deleted null data. <br />
Using OneHotEncoder can apply categorical features to machine learning model.


In [None]:
df_train_drop = df_train.dropna(subset=['Age', 'Embarked']).drop('Cabin', axis=1)
df_test_drop = (df_test.dropna(subset=['Age', 'Embarked'])).drop('Cabin', axis=1)
X_train = df_train_drop.drop('Survived', axis=1)
y_train = df_train_drop['Survived']
X_test = df_test_drop.copy()


cat_attribute = ['Pclass', 'Sex', 'Embarked']
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribute)),
    ('cat_encoder', OneHotEncoder(categories='auto', sparse=False))
])

**Numerical Features** <br />
Using StandardScaler can increase accuracy of machine learning model!

In [None]:
num_attribute = ['Age', 'SibSp', 'Parch', 'Fare']
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribute)),
    ('Imputer', SimpleImputer()),
    ('num_scale', StandardScaler())
])

**ColumnTransformer** <br />
After handling Categorical and Numerical features, both are going to combined together to apply into machine learning model.

In [None]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
    ('num_pipeline', num_pipeline, num_attribute),
    ('cat_pipeline', cat_pipeline, cat_attribute)
])

train_prepared = full_pipeline.fit_transform(X_train)
train_prepared.shape

# 4. Model Selection & GridSearch

<font size=3>
I am going to use Logistic Regression and SVM. By using gridsearch, each model will be tuned. After finishing tuning, we will check which model has higher accuracy.

**Logistic Regression**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import math

log_reg = LogisticRegression(random_state=42, n_jobs=-1)
log_params = {'tol':[1e-4, 1e-2, 1e-1], 'C':[1,3,5]}
log_grid = GridSearchCV(log_reg, log_params, cv=3)
log_grid.fit(train_prepared, y_train)
log_predict = log_grid.predict(train_prepared)
log_graph = log_grid.predict_proba(train_prepared)[:,1]

**SVM**

In [None]:
from sklearn.svm import SVC

svc = SVC(random_state=42)
svc_params = {'kernel':('poly', 'rbf'), 'C':[1,3,5], 'tol':[1e-4, 1e-2, 1e-1]}
svc_grid = GridSearchCV(svc, svc_params, cv=3)
svc_grid.fit(train_prepared, y_train)
svc_predict = svc_grid.predict(train_prepared)

<font size=3>
As you can see, SVC is almost 4% more accurate than Logistic Regression. <br />
Therefore, I am going to use SVC to test data sets.

In [None]:
from sklearn.metrics import accuracy_score

log_score = str(round(accuracy_score(df_train_drop['Survived'], log_predict) * 100, 2))
svc_score = str(round(accuracy_score(df_train_drop['Survived'], svc_predict) * 100, 2))

print('The accuracy score for Logistic Regression is ' + log_score + '%')
print('The accuracy score for SVC is ' + svc_score + '%')

# 5. Submission

<font size=3>
Now, it's time to submit our machine model!

In [None]:
test_prepared = full_pipeline.fit_transform(df_test)
Y_pred = svc_grid.predict(test_prepared)

In [None]:
submission = pd.DataFrame({
        "PassengerId": df_test["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('titanic.csv', index=False)