<a href="https://colab.research.google.com/github/thomaschiari/Spaceship-Titanic-Kaggle-Competition/blob/main/ST_TFDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spaceship Titanic utilizando TensorFlow

Agora vamos utilizar as árvores de decisão da biblioteca TensorFlow para prever se um passageiro foi transportado ou não da nave Titanic.

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import logging
import os
import numpy as np
warnings.filterwarnings('ignore')
logging.disable(sys.maxsize)

### Carregando os dados

In [13]:
train = pd.read_csv(os.path.join('data', 'train.csv'))
test = pd.read_csv(os.path.join('data', 'test.csv'))

In [14]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### Tratando os dados

In [15]:
train_ = train.drop(['PassengerId', 'Name'], axis=1)

Nesse dataset, para os valores numéricos que estão nulos, vamos preencher com 0. Para os valores categóricos que estão nulos, vamos deixar a biblioteca TFDF lidar com isso.

In [16]:
num_cols = train_.select_dtypes(include=['int64', 'float64']).columns
train_[num_cols] = train_[num_cols].fillna(0)

A biblioteca TFDF não consegue lidar com booleanos, então ajustaremos a variável resposta para um inteiro.

In [17]:
train_.Transported = train.Transported.astype(int)
train_.VIP = train_.VIP.astype(bool).astype(int)
train_.CryoSleep = train_.CryoSleep.astype(bool).astype(int)

Outro fator importante é o número da cabine. Pela documentação do dataset, temos que a primeira parte da string presente em cada observação é o Deck em que o passageiro está presente, seguido pelo número da cabine e pelo lado da nave. Essas informações podem vir a ser importantes, e como estão separadas por uma barra, vamos criar novas colunas para cada uma delas.

In [18]:
train_[['Deck', 'Cabin', 'Side']] = train_.Cabin.str.split('/', expand=True)
try:
    train_ = train_.drop(['Cabin'], axis=1)
except:
    pass

In [19]:
train_.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Side
0,Europa,0,TRAPPIST-1e,39.0,0,0.0,0.0,0.0,0.0,0.0,0,B,P
1,Earth,0,TRAPPIST-1e,24.0,0,109.0,9.0,25.0,549.0,44.0,1,F,S
2,Europa,0,TRAPPIST-1e,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,A,S
3,Europa,0,TRAPPIST-1e,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,A,S
4,Earth,0,TRAPPIST-1e,16.0,0,303.0,70.0,151.0,565.0,2.0,1,F,S


### Criando o modelo

Para saber mais sobre o funcionamento do modelo, verificar a documentação, disponível [aqui](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel).

In [20]:
import tensorflow_decision_forests as tfdf

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_, label='Transported')

rf = tfdf.keras.RandomForestModel(hyperparameter_template='benchmark_rank1')

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpbrmpyiod as temporary training directory


In [21]:
rf.compile(metrics=['accuracy'])

In [22]:
rf.fit(x=train_ds)

Reading training dataset...
Training dataset read in 0:00:06.133710. Found 8693 examples.
Training model...
Model trained in 0:00:14.579127
Compiling model...
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


<keras.callbacks.History at 0x7f167eec0d30>

In [23]:
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

### Criando uma submissão

In [24]:
id = test.PassengerId
test_ = test.drop(['PassengerId', 'Name'], axis=1)

In [25]:
num_cols = test_.select_dtypes(include=['int64', 'float64']).columns
test_[num_cols] = test_[num_cols].fillna(0)

In [26]:
test_.VIP = test_.VIP.astype(bool).astype(int)
test_.CryoSleep = test_.CryoSleep.astype(bool).astype(int)

In [27]:
test_[['Deck', 'Cabin', 'Side']] = test_.Cabin.str.split('/', expand=True)
try:
    test_ = test_.drop(['Cabin'], axis=1)
except:
    pass

In [28]:
test_.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
0,Earth,1,TRAPPIST-1e,27.0,0,0.0,0.0,0.0,0.0,0.0,G,S
1,Earth,0,TRAPPIST-1e,19.0,0,0.0,9.0,0.0,2823.0,0.0,F,S
2,Europa,1,55 Cancri e,31.0,0,0.0,0.0,0.0,0.0,0.0,C,S
3,Europa,0,TRAPPIST-1e,38.0,0,0.0,6652.0,0.0,181.0,585.0,C,S
4,Earth,0,TRAPPIST-1e,20.0,0,10.0,0.0,635.0,0.0,0.0,F,S


In [29]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_)

In [30]:
predictions = rf.predict(test_ds)



In [31]:
predictions = (predictions > 0.5).astype(bool)

In [32]:
submission = pd.DataFrame({'PassengerId': id, 'Transported': predictions.squeeze()})

In [33]:
submission.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [35]:
submission.to_csv(os.path.join('submission', 'submission3.csv'), index=False)