# ðŸŒŒ PyCaret, Spaceship Starter Model

Hello a Simple Starter Model, **Stay Tune for More Updates...**

### File and Data Field Descriptions

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

**sample_submission.csv** - A submission file in the correct format.

* PassengerId - Id for each passenger in the test set.
* Transported - The target. For each passenger, predict either True or False.

# Installing Some Libraries...

In [None]:
%%capture
!pip install pycaret

# Loading Libraries...

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Seeting Notebook Parameters...

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

# Loading Information from CSV...

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
tst_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

sub = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

# Exploring the Information Available...

In [None]:
%%time
trn_data.info()

In [None]:
%%time
trn_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
def describe_categ(df):
    for col in df.columns:
        unique_samples = list(df[col].unique())
        unique_values = df[col].nunique()

        print(f' {col}: {unique_values} Unique Values,  Data Sample >> {unique_samples[:5]}')
    print(' ...')
    return None

In [None]:
%%time
describe_categ(trn_data)

In [None]:
%%time
describe_categ(tst_data)

In [None]:
%%time
trn_data.isnull().sum()

In [None]:
%%time
tst_data.head()

In [None]:
%%time
tst_data.isnull().sum()

In [None]:
%%time
sub.sample(10)

# Exploring the Target Variable...

In [None]:
%%time
def analyse_categ_target(df, target = 'Transported'):
    
    transported = df[df[target] == True].shape[0]
    not_transported = df[df[target] == False].shape[0]
    total = transported + not_transported
    
    print(f'Transported     : {transported / total:.2f} %')
    print(f'Not Transported : {not_transported / total:.2f} %')
    print(f'Total Passengers: {total}')
    print('...')

In [None]:
%%time
analyse_categ_target(trn_data)

In [None]:
%%time
trn_passenger_ids = set(trn_data['PassengerId'].unique())
tst_passenger_ids = set(tst_data['PassengerId'].unique())
intersection = trn_passenger_ids.intersection(tst_passenger_ids)
print('Overlapped Passengers:', len(intersection))

# Feature Engineering...

In [None]:
trn_data.isnull().sum()

In [None]:
tst_data.isnull().sum()

In [None]:
%%time
def fill_missing(df):
    '''
    Fill nan values or missing data with mean or most commond value...
    
    '''
    
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_tmp = df.select_dtypes(include = numerics)
    categ_tmp = df.select_dtypes(exclude = numerics)

    for col in numeric_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mean())
        
    for col in categ_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mode()[0])
        
    print('...')
    
    return df

In [None]:
%%time
trn_data =  fill_missing(trn_data)
tst_data =  fill_missing(tst_data)

In [None]:
%%time
def total_billed(df):
    '''
    Calculates total amount billed in the trip to the passenger... 
    Args:
    Returns:
    
    '''
    
    df['Total_Billed'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']
    return df

In [None]:
%%time
trn_data = total_billed(trn_data)
tst_data = total_billed(tst_data)

In [None]:
%%time
def name_ext(df):
    '''
    Split the Name of the passenger into First and Family...
    
    '''
    
    df['FirstName'] = df['Name'].str.split(' ', expand=True)[0]
    df['FamilyName'] = df['Name'].str.split(' ', expand=True)[1]
    df.drop(columns = ['Name'], inplace = True)
    return df

In [None]:
%%time
trn_data = name_ext(trn_data)
tst_data = name_ext(tst_data)

In [None]:
%%time
trn_relatives = trn_data.groupby('FamilyName')['PassengerId'].count().reset_index()
tst_relatives = tst_data.groupby('FamilyName')['PassengerId'].count().reset_index()

In [None]:
%%time
trn_relatives = trn_relatives.rename(columns = {'PassengerId': 'NumRelatives'})
tst_relatives = tst_relatives.rename(columns = {'PassengerId': 'NumRelatives'})

In [None]:
%%time
trn_data = trn_data.merge(trn_relatives, how = 'left', on = ['FamilyName'])
tst_data = tst_data.merge(tst_relatives, how = 'left', on = ['FamilyName'])

In [None]:
%%time
def cabin_separation(df):
    '''
    Split the Cabin name into Deck, Number and Side
    
    '''
    
    df['CabinDeck'] = df['Cabin'].str.split('/', expand=True)[0]
    df['CabinNum'] = df['Cabin'].str.split('/', expand=True)[1]
    df['CabinSide'] = df['Cabin'].str.split('/', expand=True)[2]
    df.drop(columns = ['Cabin'], inplace = True)
    return df

In [None]:
%%time
trn_data = cabin_separation(trn_data)
tst_data = cabin_separation(tst_data)

In [None]:
%%time
def route(df):
    '''
    Calculate a combination of origin and destinations, creates a new feature for training.
    Args:
    Returns:
    '''
    
    df['Route'] = df['HomePlanet'] + df['Destination']
    return df

In [None]:
%%time
trn_data = route(trn_data)
tst_data = route(tst_data)

In [None]:
def age_groups(df):
    '''
    
    '''
    df['IsKid'] = np.where(df['Age'] <= 10, 1, 0)
    df['IsAdult'] = np.where(df['Age'] > 10, 1, 0)
    df['IsOlder'] = np.where(df['Age'] >= 60, 1, 0)
    return df

In [None]:
%%time
trn_data = age_groups(trn_data)
tst_data = age_groups(tst_data)

In [None]:
trn_data['Money_rank'] = trn_data['Total_Billed'].rank(method='max')
tst_data['Money_rank'] = tst_data['Total_Billed'].rank(method='max')

In [None]:
def aggregate_by_group(df):

# Pre-Processing for Training

In [None]:
%%time
# A list of the original variables from the dataset
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Total_Billed', 'NumRelatives']

categorical_features = ['FirstName',
                        'FamilyName',
                        'CabinNum',]


categorical_features_onehot = ['HomePlanet',
                               'CryoSleep',
                               'CabinDeck',
                               'CabinSide',
                               'Destination',
                               'VIP',]

target_feature = 'Transported'

In [None]:
%%time
ignore = ['PassengerId', 
          'Route',
          'FirstName',
          'CabinNum',
          'IsKid',
          'IsAdult',
          'IsOlder'
         ]
features = [feat for feat in trn_data.columns if feat not in ignore]

In [None]:
%%time
features

# PyCaret Model Development...

In [None]:
from pycaret.classification import *

In [None]:
clf = setup(data = trn_data,
            target = 'Transported',
            train_size = 0.99,
            fold_strategy = 'stratifiedkfold',
            fold = 5,
            fold_shuffle = True,
            numeric_features = numerical_features,
            ignore_low_variance=True,
            remove_multicollinearity = True,
            normalize = True,
            normalize_method = 'robust',
            data_split_stratify = True,
            ignore_features = ignore,
            silent = True)

remove_metric('kappa')
remove_metric('mcc')

In [None]:
best = compare_models(n_select = 10, include = ['catboost', 'lightgbm', 'xgboost'])

In [None]:
catboost = tune_model(create_model('catboost'), choose_better = True, n_iter = 20, early_stopping = True,  optimize = 'Accuracy')

In [None]:
lightgbm = tune_model(create_model('lightgbm'), choose_better = True, n_iter = 20, early_stopping = True,  optimize = 'Accuracy')

In [None]:
xgboost = tune_model(create_model('xgboost'), choose_better = True, n_iter = 20, early_stopping = True,  optimize = 'Accuracy')

In [None]:
blend_soft = blend_models(estimator_list = [catboost, lightgbm, xgboost], optimize = 'Accuracy', method = 'soft')

In [None]:
cali_model = calibrate_model(blend_soft)

In [None]:
plot_model(xgboost, plot = 'feature_all')

In [None]:
# Mean Accuracy = 0.8076 >> 0.8099 >> 0.8089 >> 0.8092

In [None]:
df_pred = predict_model(cali_model, tst_data)

In [None]:
df_pred

In [None]:
sub = df_pred.loc[:, ['PassengerId', 'Label']].rename(columns = {'Label':'Transported'})
sub.to_csv('py_caret_submission_03112022.csv', index = False)