# Use Pycaret 

#### fork. https://www.kaggle.com/drcapa/spaceship-titanic-starter?scriptVersionId=88595626

# Intro
Welcome to the [Spaceship Titanic](https://www.kaggle.com/c/spaceship-titanic/overview) competition.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/34377/logos/header.png)

<font size="4"><span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span></font>

# Libraries

In [None]:
!tar -zxf ../input/pycaret-v235/wheelhouse.tar.gzy
!pip install -q -r wheelhouse/requirements.txt --no-index --find-links wheelhouse
!rm -r wheelhouse

In [None]:
import pycaret
pycaret.__version__

In [None]:
import os
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Path

In [None]:
path = "/kaggle/input/spaceship-titanic/"
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Overview
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
print('Number train samples:', len(train_data.index))
train_data.head()

**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

In [None]:
print('Number test samples:', len(test_data.index))
test_data.head()

**sample_submission.csv** - A submission file in the correct format.
* PassengerId - Id for each passenger in the test set.
* Transported - The target. For each passenger, predict either True or False.

In [None]:
samp_subm.head()

# Exploratory Data Analysis

In [None]:
train_data['Transported'].value_counts()

# Prepare Data

## Handle Missing Values

In [None]:
cols_with_missing_train = [col for col in train_data.columns if train_data[col].isnull().any()]
cols_with_missing_test = [col for col in test_data.columns if test_data[col].isnull().any()]
print('train columns with missing data:', cols_with_missing_train)
print('test columns with missing data:', cols_with_missing_test)

We fill missing values with the most frequent value of the feature:

In [None]:
for col in cols_with_missing_train:
    most_freq = train_data[col].value_counts().index[0]
    train_data[col] = train_data[col].fillna(most_freq)
    test_data[col] = test_data[col].fillna(most_freq)

## Split Cabin
We extract the feature deck, num and side of the cabin:


In [None]:
def extract_deck(s):
    return s.split('/')[0]

def extract_num(s):
    return s.split('/')[1]

def extract_side(s):
    return s.split('/')[2]

train_data['Deck'] = train_data['Cabin'].apply(extract_deck)
train_data['Num'] = train_data['Cabin'].apply(extract_num)
train_data['Side'] = train_data['Cabin'].apply(extract_side)

test_data['Deck'] = test_data['Cabin'].apply(extract_deck)
test_data['Num'] = test_data['Cabin'].apply(extract_num)
test_data['Side'] = test_data['Cabin'].apply(extract_side)

## Encode
We encode the categorical data:

In [None]:
data = pd.concat([train_data[test_data.columns], test_data])
features_cat = ['HomePlanet', 'Destination', 'Deck', 'Side']
for feature in features_cat:
    data[pd.get_dummies(data[feature], prefix=feature).columns] = pd.get_dummies(data[feature], prefix=feature)

We drop some features which we not want to use:

In [None]:
features_drop = features_cat+['Name', 'PassengerId', 'Cabin']
data.drop(features_drop, axis=1, inplace=True)

We cast the feature Num to integer:

In [None]:
data['Num'] = data['Num'].astype('int')

In [None]:
X = data[:len(train_data)]
y = train_data['Transported']
train_df = pd.concat([X,y],axis=1)
test_df = data[len(train_data):]

In [None]:
train_df.head()

In [None]:
test_df.info()

## Pycaret Setup

In [None]:
from pycaret.classification import *
target = 'Transported'
clf1 = setup(train_df, target = target, session_id=42, log_experiment=True, experiment_name='spaceship1',
                  normalize = True, 
                  transformation = True, 
                  ignore_low_variance = True, silent=True)

## Pycaret Compare Model


In [None]:
top_model = compare_models(fold=5, n_select=5, exclude=['svm','ridge'])

In [None]:
results = pull()
model_names = results.index[:5]
print(model_names)

In [None]:
tuned_models = []
for name in model_names:
    model = create_model(name)
    tune_model(model, fold=5, n_iter = 50)
    tuned_models.append(model)

In [None]:
evaluate_model(tuned_models[0])

In [None]:
f_model = stack_models(tuned_models)

## Predict Test Data

In [None]:
unseen_predictions = predict_model(f_model, data=test_df)

# Exportcolumns

In [None]:
# unseen_predictions

In [None]:
samp_subm['Transported'] = unseen_predictions['Label']
samp_subm['Transported'].value_counts()

In [None]:
samp_subm

In [None]:
samp_subm.to_csv('submission.csv', index=False)

In [None]:
## solve kaggle submit error
shutil.rmtree('catboost_info')
shutil.rmtree('mlruns')
os.remove('logs.log')