<a href="https://colab.research.google.com/github/yusica09/seoul-AI-hub-study/blob/main/4%EC%A3%BC%EC%B0%A8/Proj)_Spaceship_Titanic_classification_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spaceship Titanic with DL
### 배경
* 우주의 미스터리를 풀기 위해 데이터 과학 기술이 필요한 2912년에 오신 것을 환영합니다. 4광년 떨어진 곳에서 전송을 받았는데 상태가 좋지 않습니다.

* 우주선 타이타닉은 한 달 전에 발사된 성간 여객선이었습니다. 약 13,000명의 승객을 태운 이 선박은 우리 태양계에서 가까운 별을 도는 새로 거주 가능한 세 개의 외계 행성으로 이민자들을 수송하는 첫 항해를 시작했습니다.

* 첫 번째 목적지인 55 Cancri E로 가는 도중 Alpha Centauri를 돌던 중 부주의한 우주선 Titanic이 먼지 구름 속에 숨겨진 시공간 변칙과 충돌했습니다. 안타깝게도 1000년 전의 이름과 비슷한 운명을 맞이했습니다. 배는 온전했지만 승객의 거의 절반이 다른 차원으로 이동했습니다!

### 데이터 정보
* *PassengerId*
    - 각 승객의 고유 ID. 각 Id는 승객이 함께 여행하는 그룹을 나타내고 그룹 내의 번호를 나타내는 형식을 취합니다 . 그룹의 사람들은 종종 가족 구성원이지만 항상 그런 것은 아닙니다.

* *HomePlanet*
    - 승객이 출발한 행성으로, 일반적으로 승객이 거주하는 행성입니다.

* *CryoSleep*
    - 승객이 항해 기간 동안 냉동 수면 선택했는지 여부를 나타냅니다. cryosleep의 승객은 객실에 갇혀 있습니다.

* *Cabin*
    - 승객이 머무르는 캐빈 번호. 형식을 취합니다 deck/num/side. 여기 에서 Port 또는 Starboard 가 side될 수 있습니다.

* *Destination*
    - 승객이 내릴 행성.

* *Age*
    - 승객의 나이.

* *VIP*
    - 승객이 항해 중 특별 VIP 서비스 비용을 지불했는지 여부.

* *RoomService, FoodCourt, ShoppingMall, Spa, VRDeck*
    - 승객이 Spaceship Titanic 의 다양한 고급 편의 시설 각각에 대해 청구한 금액입니다.

* *Name*
    - 승객의 성과 이름.

* *Transported*
    - 승객이 다른 차원으로 이동했는지 여부. 정답 데이터입니다.

## import library

In [3]:
import pandas as pd
import tensorflow as tf

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Data Load
### Read CSV files wit pandas

In [5]:
train_data = pd.read_csv("/content/drive/MyDrive/Datasets/spaceship_titanic_train_data.csv")
train_labels = pd.read_csv("/content/drive/MyDrive/Datasets/spaceship_titanic_train_labels.csv")

test_data = pd.read_csv("/content/drive/MyDrive/Datasets/spaceship_titanic_test_data.csv")
test_labels = pd.read_csv("/content/drive/MyDrive/Datasets/spaceship_titanic_test_labels.csv")

train = pd.concat([train_data, train_labels], axis=1)
test = pd.concat([test_data, test_labels], axis=1)

### Preprocessing
* 결측치 제거 후 데이터 로더에 연결

In [6]:
train = train.fillna(method='bfill')
test = test.fillna(method='bfill')

In [7]:
train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [8]:
# 일부 dtype은 tensor로 변경 불가
train['HomePlanet'] = train['HomePlanet'].astype('category')
train['CryoSleep'] = train['CryoSleep'].astype('int64')
train['VIP'] = train['VIP'].astype('int64')
train['Transported'] = train['Transported'].map({True: 1, False: 0})
train['RoomService'] = train['RoomService'].astype('float64')
train['FoodCourt'] = train['FoodCourt'].astype('float64')
train['Spa'] = train['Spa'].astype('float64')
train['VRDeck'] = train['VRDeck'].astype('float64')


test['HomePlanet'] = test['HomePlanet'].astype('category')
test['CryoSleep'] = test['CryoSleep'].astype('int64')
test['VIP'] = test['VIP'].astype('int64')
test['Transported'] = test['Transported'].map({True: 1, False: 0})
test['RoomService'] = test['RoomService'].astype('float64')
test['FoodCourt'] = test['FoodCourt'].astype('float64')
test['Spa'] = test['Spa'].astype('float64')
test['VRDeck'] = test['VRDeck'].astype('float64')

In [9]:
train

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,2513_01,Earth,0,F/575/P,TRAPPIST-1e,28.0,0,0.0,55.0,0.0,656.0,0.0,Loree Mathison,0
1,2774_02,Earth,0,F/575/P,TRAPPIST-1e,17.0,0,0.0,1195.0,31.0,0.0,0.0,Crisey Mcbriddley,0
2,8862_04,Europa,1,C/329/S,55 Cancri e,28.0,0,0.0,0.0,0.0,0.0,0.0,Alramix Myling,1
3,8736_02,Mars,0,F/1800/P,TRAPPIST-1e,20.0,0,0.0,2.0,289.0,976.0,0.0,Tros Pota,1
4,0539_02,Europa,1,C/18/P,55 Cancri e,36.0,0,0.0,0.0,0.0,0.0,0.0,Achyon Nalanet,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6949,6076_01,Earth,0,G/988/S,TRAPPIST-1e,18.0,0,14.0,2.0,144.0,610.0,0.0,Therry Cames,1
6950,5537_01,Mars,0,F/1063/S,TRAPPIST-1e,50.0,0,690.0,0.0,30.0,762.0,428.0,Herms Bancy,0
6951,5756_06,Earth,0,F/1194/P,PSO J318.5-22,22.0,0,158.0,0.0,476.0,0.0,26.0,Karena Briggston,0
6952,0925_01,Mars,0,F/191/P,TRAPPIST-1e,34.0,0,379.0,0.0,1626.0,0.0,0.0,Skix Kraie,0


In [10]:
mean_values = train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].mean()
variance_values = train[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].var()

# reduce_mean과 유사한 계산 수행
mean_values = mean_values.sum() / len(mean_values)
variance_values = variance_values.sum() / len(variance_values)

In [11]:
test

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0337_02,Mars,0,F/63/S,TRAPPIST-1e,19.0,0,417.0,349.0,634.0,3.0,1057.0,Weros Perle,1
1,2891_01,Earth,0,G/460/S,TRAPPIST-1e,18.0,0,4.0,904.0,0.0,0.0,1.0,Gleney Ortinericey,0
2,8998_01,Earth,1,G/1449/S,TRAPPIST-1e,41.0,0,0.0,0.0,0.0,0.0,0.0,Gerry Englence,0
3,1771_01,Earth,0,G/291/P,TRAPPIST-1e,35.0,0,0.0,338.0,436.0,0.0,0.0,Antone Cardner,1
4,9034_02,Europa,1,D/288/P,TRAPPIST-1e,43.0,0,0.0,0.0,0.0,0.0,0.0,Errairk Crakete,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1734,7656_01,Earth,1,G/1244/S,TRAPPIST-1e,16.0,0,0.0,0.0,0.0,0.0,0.0,Moniey Belley,0
1735,3437_02,Earth,1,G/553/S,TRAPPIST-1e,0.0,0,0.0,0.0,0.0,0.0,0.0,Carly Pager,1
1736,1384_01,Earth,0,E/105/S,TRAPPIST-1e,17.0,0,21.0,0.0,690.0,260.0,5.0,Violan Mayods,0
1737,6300_01,Mars,1,F/1303/P,TRAPPIST-1e,42.0,0,0.0,0.0,0.0,0.0,0.0,Risps Pure,1


### Data Loader

In [12]:
batch_size = 32

def df_to_dataset(dataframe, label_name="Transported", shuffle=True, batch_size=batch_size):
    dataframe = dataframe.copy()
    labels = dataframe.pop(label_name)
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
        ds = ds.repeat()
    ds = ds.batch(batch_size)

    return ds


In [13]:
train_ds = df_to_dataset(train)
train_ds

<_BatchDataset element_spec=({'PassengerId': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'HomePlanet': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'CryoSleep': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Cabin': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Destination': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Age': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VIP': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'RoomService': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'FoodCourt': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'ShoppingMall': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Spa': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VRDeck': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Name': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [14]:
test_ds = df_to_dataset(test, shuffle=False)
test_ds

<_BatchDataset element_spec=({'PassengerId': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'HomePlanet': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'CryoSleep': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'Cabin': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Destination': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'Age': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VIP': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'RoomService': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'FoodCourt': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'ShoppingMall': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Spa': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'VRDeck': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'Name': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [15]:
for t, l in train_ds:
  print(t, l)
  break

for t, l in test_ds:
  print(t, l)
  break


{'PassengerId': <tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'5205_01', b'5927_07', b'7406_01', b'7365_01', b'3946_01',
       b'6693_01', b'2022_01', b'8723_01', b'0571_06', b'8807_01',
       b'6837_01', b'3287_02', b'3181_01', b'5625_05', b'6016_02',
       b'4817_01', b'9280_02', b'3768_01', b'7082_01', b'7156_03',
       b'1154_02', b'2215_03', b'9132_01', b'8049_01', b'0012_01',
       b'6453_01', b'3745_01', b'6947_02', b'3212_01', b'2189_01',
       b'4668_01', b'4663_02'], dtype=object)>, 'HomePlanet': <tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'Earth', b'Europa', b'Earth', b'Earth', b'Mars', b'Earth',
       b'Earth', b'Earth', b'Europa', b'Earth', b'Mars', b'Earth',
       b'Earth', b'Mars', b'Earth', b'Mars', b'Europa', b'Earth',
       b'Earth', b'Mars', b'Europa', b'Europa', b'Mars', b'Earth',
       b'Earth', b'Earth', b'Earth', b'Europa', b'Earth', b'Earth',
       b'Europa', b'Earth'], dtype=object)>, 'CryoSleep': <tf.Tensor: shape=(32,), dtype=int6

## Preprocessing with layers

In [16]:
inputs = {
  'CryoSleep': tf.keras.Input(shape=(), dtype='int64'),
  'HomePlanet': tf.keras.Input(shape=(), dtype='string'),
  'RoomService': tf.keras.Input(shape=(), dtype='float64'),
  'FoodCourt': tf.keras.Input(shape=(), dtype='float64'),
  'ShoppingMall': tf.keras.Input(shape=(), dtype='float64'),
  'Spa': tf.keras.Input(shape=(), dtype='float64'),
  'VRDeck': tf.keras.Input(shape=(), dtype='float64'),
  'VIP': tf.keras.Input(shape=(), dtype='int64'),
  'Cabin': tf.keras.Input(shape=(), dtype='string'),
  "Destination": tf.keras.Input(shape=(), dtype='string')
}

# Convert index to one-hot; e.g. [2] -> [0,1].
type_output = tf.keras.layers.CategoryEncoding(num_tokens=2, output_mode='one_hot')(inputs['CryoSleep'])
dense_type = tf.keras.layers.Dense(2)(type_output)

vip = tf.keras.layers.CategoryEncoding(num_tokens=2, output_mode='one_hot')(inputs['VIP'])
dense_vip = tf.keras.layers.Dense(2)(vip)

# Convert size strings to indices; e.g. ['small'] -> [1].
home_output = tf.keras.layers.StringLookup(vocabulary=list(set(train['HomePlanet'])))(inputs['HomePlanet'])
home_output = tf.keras.layers.Reshape([-1])(home_output)
dense_home = tf.keras.layers.Dense(3)(home_output)

destination_output = tf.keras.layers.StringLookup(vocabulary=list(set(train['Destination'])))(inputs['Destination'])
destination_output = tf.keras.layers.Reshape([-1])(destination_output)
dense_destination = tf.keras.layers.Dense(4)(destination_output)

# Normalize the numeric inputs; e.g. [2.0] -> [0.0].
weight_sum = tf.keras.layers.Add()([inputs['RoomService'], inputs['FoodCourt'], inputs['ShoppingMall'], inputs['Spa'], inputs['VRDeck']])
weight_sum = tf.keras.layers.Reshape([-1])(weight_sum)
weight_output = tf.keras.layers.Normalization(
      axis=None, mean=mean_values, variance=variance_values)(weight_sum)
weight_output = tf.keras.layers.Reshape([-1])(weight_output)
dense_weight = tf.keras.layers.Dense(1)(weight_output)

# 캐빈 데이터를 분할합니다
cabin_split = tf.strings.split(inputs['Cabin'], "/")

# 마지막 요소만 선택합니다
cabin_last = cabin_split.to_tensor()[:, -1]

# StringLookup 레이어를 사용하여 one-hot 인코딩을 수행합니다
cabin_output = tf.keras.layers.StringLookup(vocabulary=["S", "P"], num_oov_indices=1, output_mode='one_hot')(cabin_last)
dense_cabin = tf.keras.layers.Dense(2)(cabin_output)

x = tf.concat([dense_type, dense_vip, dense_home, dense_destination, dense_weight, dense_cabin], -1) # batch, 특징 (여기로 합쳐라)
x = tf.keras.layers.Reshape([-1, 1])(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(1024, return_sequences=False))(x)

outputs = tf.keras.layers.Dense(1)(x) # Sigmoid, BCE loss

preprocessing_model = tf.keras.Model(inputs, outputs)

### Model Train

In [17]:
preprocessing_model.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [18]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     monitor='val_loss',
                                                     restore_best_weights=True,
                                                     verbose=1)

In [20]:
max_epochs = 20

history = preprocessing_model.fit(train_ds,
                                  epochs=max_epochs,
                                  steps_per_epoch=len(train) // batch_size,
                                  validation_data=test_ds,
                                  validation_steps=len(test) // batch_size,
                                  callbacks=[early_stopping_cb])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [21]:
preprocessing_model.evaluate(test_ds)



[0.5352100729942322, 0.7308798432350159]