### Discription

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

[https://storage.googleapis.com/kaggle-media/competitions/Spaceship%20Titanic/joel-filipe-QwoNAhbmLLo-unsplash.jpg](http://)

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

## import library

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## load data

In [None]:
train_df = pd.read_csv('../input/spaceship-titanic/train.csv')
train_df.head()

In [None]:
test_df = pd.read_csv('../input/spaceship-titanic/test.csv')
test_df.head()

In [None]:
sub_df = pd.read_csv('../input/spaceship-titanic/sample_submission.csv')
sub_df.head()

## EDA

This notebook is based on the titanic competetion notebook "EDA To Prediction(DieTanic)". \
https://www.kaggle.com/code/ash316/eda-to-prediction-dietanic

In [None]:
train_df.isnull().sum() #checking for total null values

In [None]:
train_df['CryoSleep'] = train_df['CryoSleep'].astype('float')
train_df['VIP'] = train_df['VIP'].astype('float')    # change dtype from object to float

test_df['CryoSleep'] = test_df['CryoSleep'].astype('float')
test_df['VIP'] = test_df['VIP'].astype('float')    # change dtype from object to float

### How many transported?

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train_df['Transported'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Transported')
ax[0].set_ylabel('')
sns.countplot('Transported',data=train_df,ax=ax[1])
ax[1].set_title('Transported')
plt.show()

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train_df[['HomePlanet','Transported']].groupby(['HomePlanet']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Transported vs HomePlanet')
sns.countplot('HomePlanet',hue='Transported',data=train_df,ax=ax[1])
ax[1].set_title('HomePlanet:Transported vs not transported')
plt.show()

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train_df[['Destination','Transported']].groupby(['Destination']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Transported vs Destination')
sns.countplot('Destination',hue='Transported',data=train_df,ax=ax[1])
ax[1].set_title('Destination:Transported vs not transported')
plt.show()

### Correlation between data

In [None]:
train_df_corr = \
train_df[['CryoSleep', 'Age', 'VIP', 'RoomService','FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Transported']].corr()
sns.heatmap(train_df_corr,annot=True,cmap='RdYlGn',linewidths=0.2) #train_df_corr-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

## train model

In [None]:
import warnings
warnings.filterwarnings('ignore')
import lightgbm as lgb #LightGBM
from sklearn import datasets
from sklearn.model_selection import train_test_split # データセット分割用
from sklearn.metrics import accuracy_score # モデル評価用(正答率)
from sklearn.metrics import log_loss # モデル評価用(logloss)     
from sklearn.metrics import roc_auc_score # モデル評価用(auc)

# データフレームを綺麗に出力する関数
import IPython
def display(*dfs, head=True):
    for df in dfs:
        IPython.display.display(df.head(10) if head else df)

In [None]:
train_df = pd.get_dummies(train_df, columns=['HomePlanet','Destination'])
test_df = pd.get_dummies(test_df, columns=['HomePlanet','Destination'])
train_df

In [None]:
train_df = train_df.drop(['PassengerId', 'Cabin', 'Name'], axis=1)
test_df = test_df.drop(['PassengerId', 'Cabin', 'Name'], axis=1)

In [None]:
# check data
print(train_df.shape) # データサイズの確認(データ数,特徴量数)
display(train_df) # df.head()に同じ(文中に入れるときはdisplay()を使う)

# 説明変数,目的変数
X = train_df.drop(['Transported'],axis=1).values # 説明変数(target以外の特徴量)
y = train_df['Transported'].values # 目的変数(target)

# split training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=2)

### train LGBM model

In [None]:
# モデルの学習
model = lgb.LGBMClassifier() # モデルのインスタンスの作成
model.fit(X_train, y_train) # モデルの学習

# テストデータの予測クラス (予測クラス(0 or 1)を返す)
y_pred = model.predict(X_test)
# テストデータのクラス予測確率 (各クラスの予測確率 [クラス0の予測確率,クラス1の予測確率] を返す)
y_pred_prob = model.predict_proba(X_test)

### check prediction

In [None]:
# 真値と予測値の表示
df_pred = pd.DataFrame({'target':y_test,'target_pred':y_pred})
display(df_pred)

# 真値と予測確率の表示
df_pred_prob = pd.DataFrame({'target':y_test, 'target0_prob':y_pred_prob[:,0], 'target1_prob':y_pred_prob[:,1]})
display(df_pred_prob)

### evaluate model

In [None]:
# モデル評価
# acc : 正答率
acc = accuracy_score(y_test,y_pred)
print('Acc :', acc)

# logloss 
logloss =  log_loss(y_test,y_pred_prob) # 引数 : log_loss(正解クラス,[クラス0の予測確率,クラス1の予測確率])
print('logloss :', logloss)

# AUC 
auc = roc_auc_score(y_test,y_pred_prob[:,1]) # 引数 : roc_auc_score(正解クラス, クラス1の予測確率)
print('AUC :', auc) 

### ROC curve

In [None]:
# ROC曲線の描画
# cf : https://tips-memo.com/python-roc
from sklearn import metrics
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test,y_pred_prob[:,1])
auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.legend()
plt.xlabel('FPR: False positive rate')
plt.ylabel('TPR: True positive rate')
plt.grid()
plt.show()

In [None]:
y_prediction = model.predict(test_df.values)
y_prediction

## Submission

In [None]:
sub_df['Transported'] = y_prediction
sub_df

In [None]:
sub_df.to_csv('submission.csv', index=False)

- If you find this notebook useful, please upvote it!!