# Welcome to Spaceship Titanic!

This is going to be a basic EDA using Pandas Profiling which will give you details about each feature. 

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import (
    ensemble,
    model_selection,    
    preprocessing,
    tree,
)

In [None]:
#load dataset
train=pd.read_csv("../input/spaceship-titanic/train.csv")
train.head()

In [None]:
train.shape

There are 8693 passengers and 14 features!

# Clean Data

The original Titanic dataset had leaky features which are variables that contain information about the future or target. There’s nothing bad in having data about the target, and we often have that data during model creation time. However, if those variables are not available when we perform a prediction on a new sample, we should remove them from the model as they are leaking data from the future.

In [None]:
# Lets have a look to see if there is missing data
missing = train.isnull().sum().sort_values(ascending=False)
missing

There are a lot of features that require deeper investigations into missing values. Those numbers aren't massive when there are 8693 passengers, but still need to be dealt with. 

In [None]:
# what types of data are we dealing with?
train.dtypes

**object** typically means that it is holding string data, though it could be a combination of string and other types.

**float64** is a numeric types.

**bool** is True/False or 0/1.

In [None]:
train.describe().iloc[:,:2]

The **count** statistic only includes values that are not NaN, so it is useful for checking whether a column is missing data. 

Spot-check the **minimum and maximum values** to see if there are outliers. 



In [None]:
# lets look at missing data in each column again so we can work on it
train.isnull().sum()

By default, calling these methods will apply the operation along axis 0, which is along the index. If you want to get the counts of missing features for each sample, you can apply this along axis 1 (along the columns):

In [None]:
mask = train.isnull().any(axis=1)
mask.head()

Now we will look at individual columns to better understand the data we have.

In [None]:
# If you want to include null or NaN values use dropna=False
train.Name.value_counts(dropna=False)

We can see that there are all individual names. Not sure yet if they are all unique. 

You could use NLP but more than likely, your model will not be able to take advantage of this column. The name column is an example of this.

# View Features

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities. Lets have a look.

In [None]:
train.VRDeck.value_counts(dropna=False)

That's a wide number of decks. Again, something that will have to be dealt with later and perhaps a decision is made that it's not that valuable.

In [None]:
train.Spa.value_counts(dropna=False)

In [None]:
train.ShoppingMall.value_counts(dropna=False)

In [None]:
train.FoodCourt.value_counts(dropna=False)

In [None]:
train.RoomService.value_counts(dropna=False)

In [None]:
train.VIP.value_counts(dropna=False)

In [None]:
train.Age.value_counts(dropna=False)

In [None]:
train.Destination.value_counts(dropna=False)

In [None]:
train.Cabin.value_counts(dropna=False)

In [None]:
train.CryoSleep.value_counts(dropna=False)

In [None]:
train.HomePlanet.value_counts(dropna=False)

In [None]:
train.PassengerId.value_counts(dropna=False)

# Drop columns

In [None]:
train = train.drop(['PassengerId','Cabin', 'Name'], axis=1)

# Create Features

We need to create dummy columns from string columns. This will create new columns for sex and embarked. Pandas has a convenient get_dummies function for that.

In [None]:
train = pd.get_dummies(train)

In [None]:
train.columns

In [None]:
train.head()

At this point the "VIP_True" and "CryoSleep_True" columns are perfectly inverse correlated with False columns. Typically we remove any columns with perfect or very high positive or negative correlation. Multicollinearity can impact interpretation of feature importance and coefficients in some models.

In [None]:
train = train.drop(columns=["VIP_True",
                            "CryoSleep_True"])

In [None]:
y = train.Transported
X = train.drop(columns="Transported")

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42
)

Many of the columns have missing values. We need to impute the numeric values. We only want to impute on the training set and then use that imputer to fill in the date for the test set. Otherwise we are leaking data (cheating by giving future information to the model).

In [None]:
# we can look at the data once more to see missing values
train.isnull().sum()

In [None]:
from sklearn.experimental import (
    enable_iterative_imputer,
)
from sklearn import impute
num_cols = [
    "Age",
    "RoomService",
    "FoodCourt",
    "ShoppingMall",
    "Spa",
    "VRDeck",
]

Use Sklearn impute to fill in the missing data.

In [None]:
imputer = impute.IterativeImputer()
imputed = imputer.fit_transform(
    X_train[num_cols]
)
X_train.loc[:, num_cols] = imputed
imputed = imputer.transform(X_test[num_cols])
X_test.loc[:, num_cols] = imputed

In [None]:
# look at all of our columns again
train.dtypes

# Normalize the data

Normalizing or preprocessing the data will help many models perform better after this is done. Particularly those that depend on a distance metric to determine similarity.

In [None]:
cols = ["Age",
        "RoomService",
        "FoodCourt",
        "ShoppingMall",
        "Spa",
        "VRDeck",
        "HomePlanet_Earth",
        "HomePlanet_Europa",
       "HomePlanet_Mars",
        "CryoSleep_False",
        "Destination_55 Cancri e",
        "Destination_PSO J318.5-22",
        "Destination_TRAPPIST-1e",
        "VIP_False"
]

# Preprocessing

We are going to standardize the data for the preprocessing. Standardizing is translating the data so that it has a mean value of zero and a standard deviation of one. This way models don’t treat variables with larger scales as more important than smaller scaled variables. I’m going to stick the result (numpy array) back into a pandas DataFrame for easier manipulation (and to keep column names).

In [None]:
sca = preprocessing.StandardScaler()
X_train = sca.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols)
X_test = sca.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

# Baseline Model

Creating a baseline model that does something really simple can give us something to compare our model to. Note that using the default .score result gives us the accuracy which can be misleading. A problem where a positive case is 1 in 10,000 can easily get over 99% accuracy by always predicting negative.

In [None]:
from sklearn.dummy import DummyClassifier
bm = DummyClassifier()
bm.fit(X_train, y_train)
bm.score(X_test, y_test)

In [None]:
from sklearn import metrics
metrics.precision_score(
y_test, bm.predict(X_test))

# Model Tests

This code tries a variety of algorithm families. The “No Free Lunch” theorem states that no algorithm performs well on all data. However, for some finite set of data, there may be an algorithm that does well on that set. (A popular choice for structured learning these days is a tree-boosted algorithm such as XGBoost.)

In [None]:
# Because we are using k-fold cross-validation, 
# we will feed the model all of X and y:
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

In [None]:
from sklearn import model_selection
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import (
    LogisticRegression,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import (
    KNeighborsClassifier,
)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import (
    RandomForestClassifier,
)
import xgboost

# Build our models

In [None]:
for model in [
    DummyClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    KNeighborsClassifier,
    GaussianNB,
    SVC,
    RandomForestClassifier,
    xgboost.XGBClassifier,
]:
    cls = model()
    kfold = model_selection.KFold(
        n_splits=10
    )
    s = model_selection.cross_val_score(
        cls, X, y, scoring="roc_auc", cv=kfold
    )
    print(
        f"{model.__name__:22} AUC: "
        f"{s.mean():.3f} STD: {s.std():.2f}"
    )

from sklearn.metrics import accuracy_score

In [None]:
for model in [
    DummyClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    KNeighborsClassifier,
    GaussianNB,
    SVC,
    RandomForestClassifier,
    xgboost.XGBClassifier,
]:
    cls = model()
    kfold = model_selection.KFold(
        n_splits=10
    )
    s = model_selection.cross_val_score(
        cls, X, y, scoring="accuracy", cv=kfold
    )
    print(
        f"{model.__name__:22} Accuracy: "
        f"{s.mean():.3f} STD: {s.std():.2f}"
    )

If this helped you please don't forget to upvote :)