# Welcome to Spaceship Titanic!

We will do a quick clean-up of the data and then show you how to stack models to better your score. If you want to see my EDA go here:
https://www.kaggle.com/code/crained/spaceship-titanic-pandas-profiling-eda

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import (
    ensemble,
    model_selection,    
    preprocessing,
    tree,
)

In [None]:
#load dataset
train=pd.read_csv("../input/spaceship-titanic/train.csv")
train.head()

# Drop columns

In [None]:
train = train.drop(['PassengerId','Cabin', 'Name'], axis=1)

# Create Features

We need to create dummy columns from string columns. This will create new columns for sex and embarked. Pandas has a convenient get_dummies function for that.

In [None]:
train = pd.get_dummies(train)

In [None]:
train.columns

At this point the "VIP_True" and "CryoSleep_True" columns are perfectly inverse correlated with False columns. Typically we remove any columns with perfect or very high positive or negative correlation. Multicollinearity can impact interpretation of feature importance and coefficients in some models.

In [None]:
train = train.drop(columns=["VIP_True",
                            "CryoSleep_True"])

In [None]:
# Transported is what we are trying to predict so we will make it our y variable
y = train.Transported
X = train.drop(columns="Transported")

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42
)

Many of the columns have missing values. We need to impute the numeric values. We only want to impute on the training set and then use that imputer to fill in the date for the test set. Otherwise we are leaking data (cheating by giving future information to the model).

In [None]:
# we can look at the data once more to see missing values
train.isnull().sum()

In [None]:
from sklearn.experimental import (
    enable_iterative_imputer,
)
from sklearn import impute
num_cols = [
    "Age",
    "RoomService",
    "FoodCourt",
    "ShoppingMall",
    "Spa",
    "VRDeck",
]

Use Sklearn impute to fill in the missing data.

In [None]:
imputer = impute.IterativeImputer()
imputed = imputer.fit_transform(
    X_train[num_cols]
)
X_train.loc[:, num_cols] = imputed
imputed = imputer.transform(X_test[num_cols])
X_test.loc[:, num_cols] = imputed

# Normalize the data

Normalizing or preprocessing the data will help many models perform better after this is done. Particularly those that depend on a distance metric to determine similarity.

In [None]:
cols = ["Age",
        "RoomService",
        "FoodCourt",
        "ShoppingMall",
        "Spa",
        "VRDeck",
        "HomePlanet_Earth",
        "HomePlanet_Europa",
       "HomePlanet_Mars",
        "CryoSleep_False",
        "Destination_55 Cancri e",
        "Destination_PSO J318.5-22",
        "Destination_TRAPPIST-1e",
        "VIP_False"
]

# Preprocessing

We are going to standardize the data for the preprocessing. Standardizing is translating the data so that it has a mean value of zero and a standard deviation of one. This way models don’t treat variables with larger scales as more important than smaller scaled variables. I’m going to stick the result (numpy array) back into a pandas DataFrame for easier manipulation (and to keep column names).

In [None]:
sca = preprocessing.StandardScaler()
X_train = sca.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols)
X_test = sca.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

# Baseline Model

Creating a baseline model that does something really simple can give us something to compare our model to. Note that using the default .score result gives us the accuracy which can be misleading. A problem where a positive case is 1 in 10,000 can easily get over 99% accuracy by always predicting negative.

In [None]:
from sklearn.dummy import DummyClassifier
bm = DummyClassifier()
bm.fit(X_train, y_train)
bm.score(X_test, y_test)

In [None]:
from sklearn import metrics
metrics.precision_score(
y_test, bm.predict(X_test))

# Model Tests

This code tries a variety of algorithm families. The “No Free Lunch” theorem states that no algorithm performs well on all data. However, for some finite set of data, there may be an algorithm that does well on that set. (A popular choice for structured learning these days is a tree-boosted algorithm such as XGBoost.)

In [None]:
# Because we are using k-fold cross-validation, 
# we will feed the model all of X and y:
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

In [None]:
from sklearn import model_selection
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import (
    LogisticRegression,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import (
    KNeighborsClassifier,
)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import (
    RandomForestClassifier,
)

# Build our models

In [None]:
for model in [
    DummyClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    KNeighborsClassifier,
    GaussianNB,
    SVC,
    RandomForestClassifier
]:
    cls = model()
    kfold = model_selection.KFold(
        n_splits=10
    )
    s = model_selection.cross_val_score(
        cls, X, y, scoring="accuracy", cv=kfold
    )
    print(
        f"{model.__name__:22} Accuracy: "
        f"{s.mean():.3f} STD: {s.std():.2f}"
    )

# Stacking 

We will now take the models above and stack them to build a more accurate model.

To learn more about this method see the documentation here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

In [None]:
from mlxtend.classifier import (
    StackingClassifier,
)

clfs = [
    x()
    for x in [
        LogisticRegression,
        DecisionTreeClassifier,
        KNeighborsClassifier,
        GaussianNB,
        SVC,
        RandomForestClassifier
    ]
    
    
]

stack = StackingClassifier(
    classifiers=clfs,
    meta_classifier=LogisticRegression(),
)
kfold = model_selection.KFold(
    n_splits=10
)

s = model_selection.cross_val_score(
    stack, X, y, scoring="accuracy", cv=kfold
)

print(
    f"{stack.__class__.__name__} "
    f"Accuracy: {s.mean():.3f}  STD: {s.std():.2f}"
)

# GridSearch

Lets try GridSearch to better our hyperparameters. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta_classifier__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

If this helped you please don't forget to upvote :)