# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [12]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [13]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [14]:
# 1. Decompose Cabin
if 'Cabin' in spaceship.columns:
    spaceship[['Deck', 'Num', 'Side']] = spaceship['Cabin'].str.split('/', expand=True)

# 2. Create Group ID and Group Size
spaceship['Group_ID'] = spaceship['PassengerId'].str.split('_').str[0]
group_sizes = spaceship['Group_ID'].value_counts()
spaceship['Group_Size'] = spaceship['Group_ID'].map(group_sizes)

# 3. Create Total Spent
spent_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship['Total_Spent'] = spaceship[spent_cols].sum(axis=1)

# 4. Label Encode VIP (from object/boolean to int)
spaceship['VIP'] = spaceship['VIP'].astype(float)

# --- Feature Selection ---
# Drop original ID and engineered/redundant columns
cols_to_drop = ['PassengerId', 'Name', 'Cabin', 'Group_ID', 'Num']
spaceship.drop(columns=cols_to_drop, inplace=True)

# Separate features (X) and target (y)
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Identify feature types for preprocessing
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include='object').columns.tolist()



In [15]:
# Define preprocessor steps for the pipeline
# Impute numericals with median, scale, and impute categoricals with mode, then One-Hot Encode
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create the preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns if any, but in this case, all are handled
)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [16]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 2. Define the model pipeline (Preproccessor + Classifier)
model_default = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42)) # Default hyperparameters
])

# Train the default model
model_default.fit(X_train, y_train)

- Evaluate your model

In [17]:
# 3. Evaluate the default model
y_pred_default = model_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f"✅ Baseline Random Forest Accuracy (Default HPs): {accuracy_default:.4f}")

✅ Baseline Random Forest Accuracy (Default HPs): 0.7982


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [19]:
# 5. Define the parameter grid for Grid Search
# Note: Hyperparameter names must be prefixed with the pipeline step name ('classifier__')
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, None], # None means nodes are expanded until all leaves are pure
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__criterion': ['gini', 'entropy']
}

# Define the full pipeline structure for Grid Search
# We use the same preprocessor as before
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

- Run Grid Search

In [None]:
# 6. Run Grid Search
grid_search = GridSearchCV(estimator=rf_pipeline,param_grid=param_grid,cv=5,scoring='accuracy',n_jobs=-1,verbose=1)

# Fit the Grid Search to the training data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


In [23]:
# Retrieve the best model and parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score_cv = grid_search.best_score_

print("---")
print(f"Best Cross-Validation Accuracy: {best_score_cv:.4f}")
print(f"Optimal Hyperparameters: {best_params}")
print("---")

---
Best Cross-Validation Accuracy: 0.8005
Optimal Hyperparameters: {'classifier__criterion': 'entropy', 'classifier__max_depth': None, 'classifier__min_samples_leaf': 4, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 300}
---


- Evaluate your model

In [24]:
# 7. Evaluate the best model on the test set
y_pred_tuned = best_model.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)

print(f"Tuned Random Forest Test Accuracy: {accuracy_tuned:.4f}")

# Comparison
print(f"\n Improvement from Default: {accuracy_tuned - accuracy_default:.4f}")

Tuned Random Forest Test Accuracy: 0.8068

 Improvement from Default: 0.0086


In [26]:
#Improvement: improvement of 0.0086 (or 0.86%). 
# In a competitive machine learning task, even a small gain like this is significant and worth the effort.
# The Tuned Random Forest Test Accuracy of 0.8068 is a very respectable score for the Spaceship Titanic dataset.

In [27]:
#Best model uses a large number of trees (300) that are allowed to grow very deep but are prevented from overfitting by requiring a minimum of 4 samples in any final leaf node.