<a href="https://colab.research.google.com/github/thomaschiari/Spaceship-Titanic-Kaggle-Competition/blob/main/ST_TFDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spaceship Titanic utilizando TensorFlow

Agora vamos utilizar as árvores de decisão da biblioteca TensorFlow para prever se um passageiro foi transportado ou não da nave Titanic.

In [25]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import logging
import os
import numpy as np
import tensorflow_decision_forests as tfdf
warnings.filterwarnings('ignore')
logging.disable(sys.maxsize)

In [26]:
train = pd.read_csv(os.path.join('data', 'train.csv'))
test = pd.read_csv(os.path.join('data', 'test.csv'))

In [27]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### Tratando os dados

In [28]:
train_ = train.drop(['PassengerId', 'Name'], axis=1)

Nesse dataset, para os valores numéricos que estão nulos, vamos preencher com 0. Para os valores categóricos que estão nulos, vamos deixar a biblioteca TFDF lidar com isso.

In [29]:
num_cols = train_.select_dtypes(include=['int64', 'float64']).columns
train_[num_cols] = train_[num_cols].fillna(0)

A biblioteca TFDF não consegue lidar com booleanos, então ajustaremos a variável resposta para um inteiro.

In [30]:
train_.Transported = train.Transported.astype(int)
train_.VIP = train_.VIP.astype(bool).astype(int)
train_.CryoSleep = train_.CryoSleep.astype(bool).astype(int)

Outro fator importante é o número da cabine. Pela documentação do dataset, temos que a primeira parte da string presente em cada observação é o Deck em que o passageiro está presente, seguido pelo número da cabine e pelo lado da nave. Essas informações podem vir a ser importantes, e como estão separadas por uma barra, vamos criar novas colunas para cada uma delas.

In [31]:
train_[['Deck', 'Cabin', 'Side']] = train_.Cabin.str.split('/', expand=True)
try:
    train_ = train_.drop(['Cabin'], axis=1)
except:
    pass

In [32]:
train_.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Side
0,Europa,0,TRAPPIST-1e,39.0,0,0.0,0.0,0.0,0.0,0.0,0,B,P
1,Earth,0,TRAPPIST-1e,24.0,0,109.0,9.0,25.0,549.0,44.0,1,F,S
2,Europa,0,TRAPPIST-1e,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,A,S
3,Europa,0,TRAPPIST-1e,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,A,S
4,Earth,0,TRAPPIST-1e,16.0,0,303.0,70.0,151.0,565.0,2.0,1,F,S


### Criando o modelo

Para saber mais sobre o funcionamento do modelo, verificar a documentação, disponível [aqui](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel).

In [33]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_, label='Transported')

rf = tfdf.keras.RandomForestModel(hyperparameter_template='benchmark_rank1')

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'winner_take_all': True, 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmpx3au3ixp as temporary training directory


In [34]:
rf.compile(metrics=['accuracy'])

In [35]:
rf.fit(x=train_ds)

Reading training dataset...


2023-04-26 20:19:07.141902: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype double and shape [8693]
	 [[{{node Placeholder/_0}}]]


Training dataset read in 0:00:00.348708. Found 8693 examples.
Training model...


[INFO 23-04-26 20:19:11.8442 -03 kernel.cc:1242] Loading model from path /tmp/tmpx3au3ixp/model/ with prefix 02827165eafc446a


Model trained in 0:00:05.402594
Compiling model...


[INFO 23-04-26 20:19:12.8253 -03 decision_forest.cc:660] Model loaded with 300 root(s), 258566 node(s), and 12 input feature(s).
[INFO 23-04-26 20:19:12.8253 -03 abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 23-04-26 20:19:12.8254 -03 kernel.cc:1074] Use fast generic engine
2023-04-26 20:19:12.965018: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_7' with dtype double and shape [8693]
	 [[{{node Placeholder/_7}}]]


Model compiled.


<keras.callbacks.History at 0x7f1e141341f0>

In [36]:
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

### Criando uma submissão

In [37]:
id = test.PassengerId
test_ = test.drop(['PassengerId', 'Name'], axis=1)

In [38]:
num_cols = test_.select_dtypes(include=['int64', 'float64']).columns
test_[num_cols] = test_[num_cols].fillna(0)

In [39]:
test_.VIP = test_.VIP.astype(bool).astype(int)
test_.CryoSleep = test_.CryoSleep.astype(bool).astype(int)

In [40]:
test_[['Deck', 'Cabin', 'Side']] = test_.Cabin.str.split('/', expand=True)
try:
    test_ = test_.drop(['Cabin'], axis=1)
except:
    pass

In [41]:
test_.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
0,Earth,1,TRAPPIST-1e,27.0,0,0.0,0.0,0.0,0.0,0.0,G,S
1,Earth,0,TRAPPIST-1e,19.0,0,0.0,9.0,0.0,2823.0,0.0,F,S
2,Europa,1,55 Cancri e,31.0,0,0.0,0.0,0.0,0.0,0.0,C,S
3,Europa,0,TRAPPIST-1e,38.0,0,0.0,6652.0,0.0,181.0,585.0,C,S
4,Earth,0,TRAPPIST-1e,20.0,0,10.0,0.0,635.0,0.0,0.0,F,S


In [42]:
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_)

In [43]:
predictions = rf.predict(test_ds)

1/5 [=====>........................] - ETA: 0s

2023-04-26 20:19:13.446424: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_10' with dtype int64 and shape [4277]
	 [[{{node Placeholder/_10}}]]




In [44]:
predictions = (predictions > 0.5).astype(bool)

In [45]:
submission = pd.DataFrame({'PassengerId': id, 'Transported': predictions.squeeze()})

In [46]:
submission.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [47]:
# submission.to_csv(os.path.join('submission', 'submission3.csv'), index=False)

---

### Utilizando uma Gradient Boosted Tree

In [48]:
gbt = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template='benchmark_rank1')

Resolve hyper-parameter template "benchmark_rank1" to "benchmark_rank1@v1" -> {'growing_strategy': 'BEST_FIRST_GLOBAL', 'categorical_algorithm': 'RANDOM', 'split_axis': 'SPARSE_OBLIQUE', 'sparse_oblique_normalization': 'MIN_MAX', 'sparse_oblique_num_projections_exponent': 1.0}.
Use /tmp/tmphihvda2f as temporary training directory




In [49]:
gbt.compile(metrics=['accuracy'])

In [50]:
gbt.fit(x=train_ds)

Reading training dataset...
Training dataset read in 0:00:00.263965. Found 8693 examples.
Training model...
Model trained in 0:00:03.674321
Compiling model...
Model compiled.


[INFO 23-04-26 20:19:38.0589 -03 kernel.cc:1242] Loading model from path /tmp/tmphihvda2f/model/ with prefix cd236a782a5c4cf0
[INFO 23-04-26 20:19:38.0861 -03 decision_forest.cc:660] Model loaded with 93 root(s), 5465 node(s), and 12 input feature(s).
[INFO 23-04-26 20:19:38.0863 -03 abstract_model.cc:1311] Engine "GradientBoostedTreesGeneric" built
[INFO 23-04-26 20:19:38.0865 -03 kernel.cc:1074] Use fast generic engine
2023-04-26 20:19:38.097442: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_6' with dtype double and shape [8693]
	 [[{{node Placeholder/_6}}]]


<keras.callbacks.History at 0x7f1e1424c9d0>

In [51]:
tfdf.model_plotter.plot_model_in_colab(gbt, tree_idx=0, max_depth=3)

In [52]:
predictions = gbt.predict(test_ds)



In [53]:
predictions = (predictions > 0.5).astype(bool)

In [54]:
submission = pd.DataFrame({'PassengerId': id, 'Transported': predictions.squeeze()})

In [55]:
submission.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [56]:
submission.to_csv(os.path.join('submission', 'submission4.csv'), index=False)