<a id="0"></a>
# Survival in space

<b>CONTENTS</b>
<br>

- [1. Problem statement](#1)<br></li>
- [2. Cleaning data](#2)<br>
    - [2.1 Managing null values](#2.1)<br>
    - [2.2 Preprocessing data](#2.2)<br>
- [3. Visualizing data](#3)<br>
- [4. Classification](#4)<br>
    - [4.1 Logistic regression](#4.1)<br>
    - [4.2 Random forest](#4.2)<br>
    - [4.3 LGBM](#4.3)<br>
- [5. Model choice and submission](#5)<br>

</p>

<a id="1"></a>
## 1. Problem Statement
[Back to top](#0)<br>

We have a dataset containing the data of passenger aboard *Spaceship Titanic*. 
We need to predict whether passengers have been transported to another dimension or have "survived".

We can avail ourselves of a *labeled* training dataset, hence this is a typical **supervised classification problem**.

<a id="2"></a>
## 2. Cleaning data
[Back to top](#0)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

In [None]:
%matplotlib inline

In [None]:
# loading datasets
train_data, test_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv"), pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")

In [None]:
train_data.head()

So we have quite some **categorical** as well as **numerical** columns. 

* Categorical: `HomePlanet`, `CryoSleep`, `Destination`, `VIP`
* Numerical: `Age`, `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck`

The 'Name' columns might be useful to check whether people belong to the same family together with the group information encoded in the 'PassengerId' columns (the idea being if two people share a last name and are in the same group they are probably related).
Note that 'Age' might help with determining what family relation it is. But maybe it's an inference too far. 

Since 
> Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

it probably makes sense to split that column into "Deck" and "Side". I am still unsure whether the cabin number is relevant.

The *target* is encoded as a `bool` in the `Transported` column.

Finally, we not that there are **null values** in all the columns except for `PassengerId` and `Transported`. However, these should be ok for the numerical columns as they represent the fact that some passenger havent' spent any money on extra services.

In [None]:
train_data.describe()

<a id="2.1"></a>
### 2.1 Managing null values
[Back to top](#0)

Since we do have nulls, let's dig in and manage them.

In [None]:
train_data.isnull().sum()

In [None]:
train_data[train_data["Cabin"].isnull() == True].head()

Let's fill the missing `Destination` `HomePlanet` and `CryoSleep` proportionally.

In [None]:
def fill_proportionally(col, dataset):
    values = dataset[col].dropna().unique()
    # getting weights for probability weighting
    weights = dataset[col].value_counts().values / dataset[col].value_counts().values.sum()
    # filling
    dataset[col] = dataset[col].apply(lambda x: random.choices(values, weights=weights)[0] if pd.isnull(x) else x)
    # checking
    assert dataset[col].isna().sum() == 0

In [None]:
for column in ["Destination", "HomePlanet", "CryoSleep"]:
    fill_proportionally(column, train_data)
    fill_proportionally(column, test_data)

In [None]:
train_data.isna().sum()

For simplicity let's `fillna` `VIP` as `False` as it's residual anyway and let's use the *median* for the `Age` columns and the other numerical ones.

In [None]:
for col in ["RoomService", 
            "FoodCourt", 
            "ShoppingMall",
            "Spa", 
            "VRDeck",
            "Age"]:
                train_data[col].fillna(train_data[col].median(), inplace=True)
                test_data[col].fillna(train_data[col].median(), inplace=True)

In [None]:
test_data["VIP"].fillna(False, inplace=True)
train_data["VIP"].fillna(False, inplace=True)

In [None]:
print(train_data.isna().sum())
print("Missing data %: ", train_data.isna().sum().sum()/train_data.shape[0])

We still have ca. 400 NaNs. That's approx 4%, half of which in the `Name` column we are going to drop anyway. I could live with just filling those as empty strings and assigning the missing cabin to some dummy value such as `Z/0000/Z`.

In [None]:
train_data["Name"].fillna("", inplace=True)

In [None]:
train_data.shape, train_data.isna().sum().sum()

In [None]:
train_data["Cabin"].fillna("Z/0000/Z", inplace=True)
test_data["Cabin"].fillna("Z/0000/Z", inplace=True)

In [None]:
train_data["Cabin"].value_counts(dropna=False)

Interestingly more than up to 8 passengers stayed in the same cabin. 

In [None]:
train_data[train_data["Cabin"] == "G/734/S"]

Doesn't look like last name is giving any indication of parenthood worth pursuing.

In [None]:
train_data["Deck"] = train_data["Cabin"].apply(lambda x: str(x)[0])
test_data["Deck"] = test_data["Cabin"].apply(lambda x: str(x)[0]) 

In [None]:
train_data["Deck"].value_counts(dropna=False)

In [None]:
test_data["Deck"].value_counts(dropna=False)

In [None]:
train_data["Side"] = train_data["Cabin"].apply(lambda x: str(x).split("/")[2])
test_data["Side"] = test_data["Cabin"].apply(lambda x: str(x).split("/")[2])

In [None]:
print(test_data["Side"].value_counts(dropna=False))
print(train_data["Side"].value_counts(dropna=False))

Now let's find out how many passengers were travelling in groups. The first part of `PassengerId` is in fact a Group ID, with the second being the progressive number of the passenger in that group.

In [None]:
train_data["GroupId"] = train_data["PassengerId"].apply(lambda x: x.split("_")[0])
test_data["GroupId"] = test_data["PassengerId"].apply(lambda x: x.split("_")[0])

In [None]:
train_data["GroupIdProgNumber"] = train_data["PassengerId"].apply(lambda x: x.split("_")[1])
test_data["GroupIdProgNumber"] = test_data["PassengerId"].apply(lambda x: x.split("_")[1])

In [None]:
groups = train_data[train_data["GroupId"].duplicated()]["GroupId"]
train_data["InGroup"] = train_data["GroupId"].apply(lambda x: x in groups.values)

In [None]:
groups = test_data[test_data["GroupId"].duplicated()]["GroupId"]
test_data["InGroup"] = test_data["GroupId"].apply(lambda x: x in groups.values)

In [None]:
train_data["InGroup"].value_counts()

In [None]:
train_data["GroupSize"] = train_data["GroupId"].apply(lambda x: train_data["GroupId"].value_counts().loc[x])
test_data["GroupSize"] = test_data["GroupId"].apply(lambda x: test_data["GroupId"].value_counts().loc[x])

In [None]:
train_data["GroupSize"].value_counts()

It might be interesting to find out how many of the passengers in the same group were also in the same cabin. These are possibly also a family.

<a id="3" class="anchor"></a>
## 3. Visualizing data
[Back to top](#0)<br>

Let's first take a look at the distribution of categorical values.

In [None]:
columns_to_plot = ["Destination", "VIP", "HomePlanet", "InGroup", "CryoSleep", "Transported"]
rows = 3
columns = 2
ix = 0
fig, axes = plt.subplots(rows, columns, figsize=(9, 7))
for row in range(rows):
    for col in range(columns):
        try:
            sns.countplot(data=train_data, x=columns_to_plot[ix], ax=axes[row][col])
            sns.despine()
            ix += 1
        except Exception:
            axes[row][col].set_visible(False)
plt.tight_layout()

* By far the vast majority of people were heading to TRAPPIST-1e
* Very few passengers were VIPs
* Earth was the most common HomePlanet with Mars and Europa more or less equivalently represented
* More than half of the passengers travelled alone
* Approx. 35% of the passengers were in CryoSleep
* Passengers had an overall even chance of being transported

Importantly, the **target** column doesn't exhibit any noticeable **class imbalance** so we won't need to do extra work on that.

Now let's look at relations between these and being transported.

In [None]:
sns.countplot(data=train_data, x="HomePlanet", hue="Transported");

You are less likely to be transported if you are from Earth.

In [None]:
sns.countplot(data=train_data, x="Destination", hue="Transported");

Passengers headed to TRAPPIST-1e are less likely to have been transported. Passengers headed to 55 Cancri e are more likely.

In [None]:
sns.countplot(data=train_data, x="VIP", hue="Transported");

Being a VIP doesn't seem to significantly affect your chances of being transported.

In [None]:
sns.countplot(data=train_data, x="InGroup", hue="Transported");

Passengers travelling alone are less likely to have been transported

In [None]:
sns.countplot(data=train_data, x="CryoSleep", hue="Transported");

Passengers in cryosleep had less chances of being transported

In [None]:
sns.countplot(x="Side", data=train_data, hue="Transported");

Being Starboard-side meant more likelihood of being transported.

In [None]:
sns.countplot(x="Deck", data=train_data, hue="Transported");

The choice of decks does seem to have some effect on being transported (e.g for deck B increasing likelihood of transportation and deck F decreasing it). 
We'd expect:
- passengers on decks B *and* port-side to be much _more_ likely to be transported. 
- passengers on decks F *and* starboard-side to be much _less_ likely to be transported.

Let's see.

It  might also make sense to consider assigning passengers in deck "T" to deck "F" as it look like they're outliers or just a mistake.

In [None]:
train_data["Deck"] = train_data["Deck"].apply(lambda x: "F" if x=="T" else x)
test_data["Deck"] = test_data["Deck"].apply(lambda x: "F" if x=="T" else x)


In [None]:
sns.catplot(x="Deck", data=train_data, hue="Transported", col="Side", kind="count")
sns.despine()

In [None]:
sns.countplot(x="GroupSize", data=train_data, hue="Transported");

This fits with out previous finding up to an extent. 

Now let's take a look at numerical features.

In [None]:
sns.kdeplot(data=train_data, x="Age", hue="Transported", fill=True)
plt.title("Age distribution");

Age does seem to affect chances of being transported but not to a great extent. To turn this into a useful feature we could add an `AgeGroup` feature.

In [None]:
min_age, max_age = train_data["Age"].min(), train_data["Age"].max()
bins = np.linspace(min_age,max_age, 6)
print(bins)
labels = ["Child", "Young", "Middle", "Senior", "Elder"]
train_data["AgeGroup"] = pd.cut(train_data["Age"], bins=bins, labels=labels, include_lowest=True)
sns.countplot(data=train_data, x="AgeGroup", hue="Transported");

In [None]:
test_data["AgeGroup"] = pd.cut(test_data["Age"], bins=bins, labels=labels, include_lowest=True)

In [None]:
train_data["all"] = ""
sns.violinplot(data=train_data, y="Age", x="all", hue="Transported", split=True);
train_data.drop("all", axis=1, inplace=True)

In [None]:
data_to_plot = train_data.describe().columns
rows=3
cols=2

fig, axes = plt.subplots(rows, cols, figsize=(12,8)) 
ix = 0
for i in range(rows):
    for j in range(cols):
        sns.kdeplot(x=data_to_plot[ix], ax=axes[i][j], hue="Transported", data=train_data, fill=True)
        sns.despine()
        ix += 1
plt.tight_layout()

Expenditure does seem to affect chance of transportation, especially for certain types of expense.

In [None]:
data_to_plot = train_data.describe().columns
rows=3
cols=2

fig, axes = plt.subplots(rows, cols, figsize=(12,8)) 
ix = 0
for i in range(rows):
    for j in range(cols):
        sns.boxenplot(x=data_to_plot[ix], ax=axes[i][j], data=train_data)
        sns.despine()
        ix += 1
plt.tight_layout()

It might make sense to get rid of at least some of those **outliers** in the numerical columns.

In [None]:
sns.kdeplot(data=train_data, x="Spa");

In [None]:
capped = train_data.copy()
upper_limit = train_data["RoomService"].quantile(0.75)
lower_limit = test_data["RoomService"].quantile(0.25)
iqr = upper_limit - lower_limit
upper_limit += iqr * 1.5
lower_limit -= iqr * 1.5
capped["RoomService"] = np.where(capped["RoomService"] > upper_limit, upper_limit, capped["RoomService"])
capped["RoomService"] = np.where(capped["Spa"] < lower_limit, lower_limit, capped["RoomService"])

In [None]:
print(capped["RoomService"].skew(), train_data["RoomService"].skew())

In [None]:
fig, axes = plt.subplots(1, 2)
sns.kdeplot(data=capped, x="RoomService", ax=axes[0])
axes[0].set_title("With outliers")
sns.kdeplot(data=train_data, x="RoomService", ax=axes[1]);
axes[1].set_title("Without outliers");
plt.tight_layout()


In [None]:
sns.boxenplot(data=capped, x="RoomService");

In [None]:
'''
for col in ['RoomService','FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']:
    upper_limit = train_data[col].quantile(0.75)
    lower_limit = test_data[col].quantile(0.25)
    iqr = upper_limit - lower_limit
    upper_limit += iqr * 1.5
    lower_limit -= iqr * 1.5
    train_data[col] = np.where(train_data[col] > upper_limit, upper_limit, train_data[col])
    train_data[col] = np.where(train_data[col] < lower_limit, lower_limit, train_data[col])
'''

In [None]:
"""
for col in ['RoomService','FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']:
    train_data[col] = np.log(1 + train_data[col])
    test_data[col] = np.log(1+ test_data[col])
"""

Now let's drop columns we won't need and reorder our dataframes to get to the modelling part.

In [None]:
train_data = train_data[['HomePlanet', 'CryoSleep', 'Destination', 'AgeGroup', 'VIP', 'RoomService',
       'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Deck',
       'Side', 'InGroup', 'GroupSize', 'Age', 'Transported']]

In [None]:
test_data = test_data[['HomePlanet', 'CryoSleep', 'Destination', 'AgeGroup', 'VIP', 'RoomService',
       'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Deck',
       'Side', 'InGroup', 'GroupSize', 'Age']]

<a id="4"></a><br>
## 4. Classification
[Back to top](#0)<br>

I am initializing an empty list to store (as tuples= the models together with accuracy and f1-scores.

In [None]:
models = []

First I will make copies of the data. This will be useful to test different hypotheses without having to reprocess the data.

In [None]:
train_dataset = train_data.copy()
test_dataset = test_data.copy()

In [None]:
categoricals = ['HomePlanet', 'CryoSleep', 'Destination', 'Deck', 'Side', 'InGroup', 'AgeGroup']
numericals = ['VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'GroupSize', 'Age']

<a id="4.1"></a><br>
## 4.1 Logistic Regression
[Back to top](#0)<br>

### Training the model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve, KFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score, roc_curve

In [None]:
train_data.columns, test_data.columns

In [None]:
train_data.head()

In [None]:
transformer = ColumnTransformer([
    ("num", StandardScaler(), numericals),
    ("cat", OneHotEncoder(), categoricals),
])

In [None]:
pipeline = Pipeline([
    ('transformer', transformer),
    ('classifier', LogisticRegression(max_iter=500))
])

In [None]:
X = train_data.iloc[:, :-1]
y = train_data.iloc[:, -1]

In [None]:
X

In [None]:
cv = KFold(n_splits=10, random_state=42, shuffle=True)

In [None]:
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [None]:
print("Mean accuracy of K-fold cross validation: {:.2f} %".format(np.mean(scores) * 100))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

In [None]:
pipeline.fit(X_train, y_train)

### Analyzing the results

In [None]:
preds = pipeline.predict(X_test)

 Now we can look at the results of the model's predictions. Note that since the target values were balanced, we can use **f1-score** as the most relevant metric (even accuracy would be ok).

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, preds);

In [None]:
print(f"Training set accuracy score: { accuracy_score(y_train, pipeline.predict(X_train))*100:.2f}%")

In [None]:
print("Validation scores")
print("="*25)
print(f"Accuracy score: { accuracy_score(y_test, preds)*100:.2f}")
print(f"Precision score: { precision_score(y_test, preds)*100:.2f}")
print(f"Recall score: { recall_score(y_test, preds)*100:.2f}")
print(f"F-1 score: {f1_score(y_test, preds)*100:.2f}")

In [None]:
logreg_pipeline = pipeline
models.append((logreg_pipeline, accuracy_score(y_test, preds), f1_score(y_test, preds)))

We could also analyze false positives and false negatives

In [None]:
FN = X_test[(preds==False) & (y_test==True)]
FP = X_test[(preds==True) & (y_test==False)]

In [None]:
diff_fn = FN.describe()-X_test.describe()
diff_fn.loc[["mean", "std"]]

In [None]:
diff_fp = FP.describe()-X_test.describe()
diff_fp.loc[["mean", "std"]]

Let's see if there is any one feature patently responsible for the FN

In [None]:
fig, axes = plt.subplots(1,2)
sns.countplot(data=FN, x="VIP", ax=axes[0])
sns.countplot(data=test_data, x="VIP", ax=axes[1]);
axes[0].set_title("False Negatives")
axes[1].set_title("Test Data")
plt.tight_layout()

In [None]:
col = "Deck"
fig, axes = plt.subplots(1,2)
sns.countplot(data=FN, x=col, ax=axes[0], order=FN[col].value_counts().index)
sns.countplot(data=test_data, x=col, ax=axes[1], order=train_data[col].value_counts().index)
axes[0].set_title("False Negatives")
axes[1].set_title("Test Data")
plt.suptitle(col)
plt.tight_layout();

In [None]:
col="CryoSleep"
fig, axes = plt.subplots(1,2)
sns.countplot(data=FN, x=col, ax=axes[0], order=FN[col].value_counts().index)
sns.countplot(data=test_data, x=col, ax=axes[1], order=train_data[col].value_counts().index)
axes[0].set_title("False Negatives")
axes[1].set_title("Test Data")
plt.suptitle(col)
plt.tight_layout();

This could actually be interesting. It's as though `Cryosleep` was underweighted. I have no idea how to manage this.... 

And finally let's see what were the most relevant features

In [None]:
coefficient_importance = list(zip(pipeline["transformer"].get_feature_names_out(), pipeline["classifier"].coef_[0]))
coefficient_importance.sort(key=lambda x: x[1])
coefficient_importance

Most of these results are consistent with our EDA. 

In [None]:
sns.countplot(data=train_data, x="HomePlanet", hue="Transported");

In [None]:
sns.countplot(data=train_data, x="Deck", hue="Transported");

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,5))
sns.kdeplot(data=train_data, x="FoodCourt", hue="Transported", fill=True, ax=axes[0]);
sns.kdeplot(data=train_data, x="Spa", hue="Transported", fill=True, ax=axes[1]);
plt.tight_layout()

<a id="4.2"></a>
## 4.2 Random Forest Classifier
[Back to top](#0)<br>

Reference: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from pprint import pprint

In [None]:
X, y = train_dataset.copy().iloc[:, :-1], train_dataset.copy().iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

- We are going to start with a base estimator for benchmarking
- Then we are going to narrow down the sapce of hyperparams with RandomSearch
- Finally we will finetune the hyperparams using GridSearch

In [None]:
base_pipeline = Pipeline([
    ('transformer', 
         ColumnTransformer([
            ("cat", OneHotEncoder(), categoricals),
            ("num", StandardScaler(), numericals),
        ])
    ),
    ('classifier', RandomForestClassifier(n_estimators=10))
])
base_pipeline.fit(X_train, y_train)
base_preds = base_pipeline.predict(X_test)
base_accuracy = accuracy_score(y_test, base_preds)
print("Base model accuracy: {:.2f} %".format(base_accuracy*100))

In [None]:
rf_pipeline = Pipeline([
    ('transformer', transformer),
    ('classifier', RandomForestClassifier())
])

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 6]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'classifier__n_estimators': n_estimators,
               'classifier__max_features': max_features,
               'classifier__max_depth': max_depth,
               'classifier__min_samples_split': min_samples_split,
               'classifier__min_samples_leaf': min_samples_leaf,
               'classifier__bootstrap': bootstrap}
pprint(random_grid)

In [None]:
rf_pipeline = Pipeline([
    ('transformer', 
         ColumnTransformer([
            ("cat", OneHotEncoder(handle_unknown = "ignore"), categoricals),
            ("num", StandardScaler(), numericals),
        ])
    ),
    ('classifier', RandomForestClassifier())
])

In [None]:
rf_random = RandomizedSearchCV(estimator = rf_pipeline, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=1, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

In [None]:
best_random = rf_random.best_estimator_
best_random.fit(X_train, y_train)
random_preds = best_random.predict(X_test)
random_accuracy = accuracy_score(y_test, random_preds)
print("Random accuracy: {:.2f} %".format(random_accuracy*100))

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))

In [None]:
rf_random.best_params_

In [None]:
param_grid = {
    'classifier__bootstrap': [True],
    'classifier__max_depth': [110, 220, 80],
    'classifier__max_features': ['auto'],
    'classifier__min_samples_leaf': [4, 6, 8],
    'classifier__min_samples_split': [2, 4, 6, 12],
    'classifier__n_estimators': [377, 450, 800]
}


grid_pipeline = Pipeline([
    ('transformer', 
         ColumnTransformer([
            ("cat", OneHotEncoder(handle_unknown = "ignore"), categoricals),
            ("num", StandardScaler(), numericals),
        ])
    ),
    ('classifier', RandomForestClassifier())
])

In [None]:
grid_search = GridSearchCV(estimator = grid_pipeline, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 1)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
best_grid = grid_search.best_estimator_

In [None]:
grid_preds = best_grid.predict(X_test)
grid_accuracy = accuracy_score(y_test, grid_preds)
grid_f1 = f1_score(y_test, grid_preds)
print("Grid accuracy: {:.2f} %".format(grid_accuracy*100))
print("Grid f1: {:.2f} %".format(grid_f1*100))

In [None]:
rf_pipeline = best_grid
models.append((rf_pipeline, grid_accuracy, grid_f1))

In [None]:
print('Improvement of {:0.2f}% vs. base.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))
print('Improvement of {:0.2f}% vs. random.'.format( 100 * (grid_accuracy - random_accuracy) / random_accuracy))

<a id="4.3"></a>
## 4.3 Gradient Boosting Classifier
[Back to top](#0)<br>

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lgbm_pipeline = Pipeline([
    ("transformer", transformer),
    ("classifier", LGBMClassifier())
])

In [None]:
lgbm_pipeline.fit(X_train, y_train)

In [None]:
lgbm_preds = lgbm_pipeline.predict(X_test)

In [None]:
lgbm_accuracy = accuracy_score(y_test, lgbm_preds)
lgbm_f1 = f1_score(y_test, lgbm_preds)
print(lgbm_accuracy, lgbm_f1)

In [None]:
print("LGBM accuracy: {:.2f} %".format(grid_accuracy*100))
print("LGBM f1: {:.2f} %".format(grid_f1*100))

In [None]:
models.append((lgbm_pipeline, lgbm_accuracy, lgbm_f1))

<a id="5"></a>
## 5. Model choice and submission
[Back to top](#0)<br>

In [None]:
models.sort(key=lambda x: x[2], reverse=True)

In [None]:
models

In [None]:
best_model = models[0]
final_predictor = best_model[0]
final_predictor_name = str(final_predictor.get_params()["classifier"])
print(final_predictor_name + "\nf1 score: "+ str(best_model[2]))

In [None]:
models_df = pd.DataFrame.from_dict({
    "Models": [str(model[0].get_params()["classifier"]).split("(")[0] for model in models],
    "Accuracy":[ model[1] for model in models],
    "F1_score": [model[2] for model in models]
})

models_df

In [None]:
final_data = test_data

In [None]:
to_submit = final_predictor.predict(final_data)

In [None]:
to_submit = pd.DataFrame(to_submit, columns=["Transported"])
to_submit.head()

In [None]:
submission = pd.concat([pd.read_csv("/kaggle/input/spaceship-titanic/test.csv"), pd.DataFrame(to_submit)], axis=1)[["PassengerId", "Transported"]]

In [None]:
submission.to_csv("submission.csv", index=False)