<a href="https://colab.research.google.com/github/zzoobro/tongteuk/blob/main/ANN_practice_with_titanic_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-----------
# Keras learning on Titanic Survival data
-----------

## Load dataset

Import library

In [31]:
from google.colab import files
import io
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore') #경고 메시지 무시

Load train dataset

In [32]:
# upload the train dataset

file_uploaded = files.upload()

Saving train.csv to train (1).csv


In [33]:
# save the train dataset as 'train'

train = pd.read_csv(io.BytesIO(file_uploaded['train.csv']))

Load test dataset

In [34]:
# upload the train dataset

file_uploaded = files.upload()

Saving test.csv to test (1).csv


In [35]:
# save the train dataset as 'train'

test = pd.read_csv(io.BytesIO(file_uploaded['test.csv']))

Load test labels

In [36]:
# upload the train dataset

file_uploaded = files.upload()

Saving gender_submission.csv to gender_submission (1).csv


In [37]:
# save the train dataset as 'train'

test_label = pd.read_csv(io.BytesIO(file_uploaded['gender_submission.csv']))

## data cleaning : training dataset

There are some variables that seems not important to predict passengers's survival

ex. PassengerId, Name, Ticket

So we can delete these variables from our data

In [38]:
# delete the variables that seems not important to predict passenger;s survival

train.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

Finding Missing values

  - Age : There are 177 Missing values, and we need to delete this observations.
  - Emabarkded : There are 2 Missing values, and we need to delete this observations.
  - Cabin : There are 687 Missing values. It is large number. So drop this variable instead delete the missing observation.

In [39]:
for col in list(train.columns):
  NA_cnt = sum(pd.isnull(train[col]))
  print('# of NaN in {variable} : {number}'.format(variable=col, number=NA_cnt))

# of NaN in Survived : 0
# of NaN in Pclass : 0
# of NaN in Sex : 0
# of NaN in Age : 177
# of NaN in SibSp : 0
# of NaN in Parch : 0
# of NaN in Fare : 0
# of NaN in Cabin : 687
# of NaN in Embarked : 2


In [40]:
# drop Cabin and Missing observations

train.drop('Cabin', axis=1, inplace=True)
train.dropna(subset=['Age', 'Embarked'], inplace=True)

Change Dtype of categorical variables

In [41]:
train.Survived = train.Survived.astype('category')
train.Pclass = train.Pclass.astype('category')
train.Sex = train.Sex.astype('category')
train.Embarked = train.Embarked.astype('category')

Summary of train

In [42]:
# reset the index of row

train = train.reset_index(drop=True, inplace=False)

In [43]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Survived  712 non-null    category
 1   Pclass    712 non-null    category
 2   Sex       712 non-null    category
 3   Age       712 non-null    float64 
 4   SibSp     712 non-null    int64   
 5   Parch     712 non-null    int64   
 6   Fare      712 non-null    float64 
 7   Embarked  712 non-null    category
dtypes: category(4), float64(2), int64(2)
memory usage: 25.5 KB


## data cleaning : test dataset

In [44]:
# claeaning test data to predict by Voting Classfier Model

test = pd.merge(test, test_label, on='PassengerId', how='left')

test.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
test.drop('Cabin', axis=1, inplace=True)
test.dropna(subset=['Age', 'Embarked', 'Pclass', 'Sex', 'SibSp', 'Fare', 'Embarked'], inplace=True)

In [45]:
# make dummy variable to apply machine learning

test['Pclass_1st'] = [1 if x==1 else 0 for x in test.Pclass]
test['Pclass_2nd'] = [1 if x==2 else 0 for x in test.Pclass]

test['Sex_male'] = [1 if x=='male' else 0 for x in test.Sex]

test['Embarked_Chb'] = [1 if x=='C' else 0 for x in test.Embarked]
test['Embarked_Sth'] = [1 if x=='S' else 0 for x in test.Embarked]

In [46]:
# split label and features

Y_test = test.Survived
X_test = test.loc[:, ['Pclass_1st', 'Pclass_2nd', 'Sex_male', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Chb', 'Embarked_Sth']]

In [47]:
# apply min_max scaling

import numpy as np

def min_max_scaler(vec):
  min_value = min(vec)
  max_value = max(vec)
  return (np.array(vec) - np.array(min_value)) / (np.array(max_value)-np.array(min_value))
  
X_test = X_test.apply(min_max_scaler, axis=0)

In [48]:
# convert label as categorical

Y_test = to_categorical(Y_test)

## Predict Survived

### Making features and label for training dataset

In [49]:
# make dummy variable for machine learning

train['Pclass_1st'] = [1 if x==1 else 0 for x in train.Pclass]
train['Pclass_2nd'] = [1 if x==2 else 0 for x in train.Pclass]

train['Sex_male'] = [1 if x=='male' else 0 for x in train.Sex]

train['Embarked_Chb'] = [1 if x=='C' else 0 for x in train.Embarked]
train['Embarked_Sth'] = [1 if x=='S' else 0 for x in train.Embarked]

In [50]:
# split label and features

Y = train.Survived
X = train.loc[:, ['Pclass_1st', 'Pclass_2nd', 'Sex_male', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Chb', 'Embarked_Sth']]

In [51]:
# apply min_max scaling to X

import numpy as np

def min_max_scaler(vec):
  min_value = min(vec)
  max_value = max(vec)
  return (np.array(vec) - np.array(min_value)) / (np.array(max_value)-np.array(min_value))

X = X.apply(min_max_scaler, axis=0)

In [52]:
from tensorflow.keras.utils import to_categorical

Y = to_categorical(Y)

### Multi-Layer Perceptron(ANN)

In [53]:
from keras.models import Sequential
from keras.layers import Activation, Dense

In [54]:
model_MLP = Sequential() 
model_MLP.add(Dense(512, input_shape=(9,), activation='sigmoid'))
model_MLP.add(Dense(10, activation='sigmoid'))
model_MLP.add(Dense(5, activation='sigmoid'))
model_MLP.add(Dense(2))

In [55]:
from keras import optimizers

model_MLP.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [59]:
model_MLP.fit(X, Y, batch_size = 50, epochs = 5 )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fdf6c475fd0>

Check the performance of MLP model

In [60]:
test_loss, test_acc = model_MLP.evaluate(X_test, Y_test)



In [61]:
print('Accuray of test : {}'.format(round(test_acc, 3)))

Accuray of test : 0.384
