# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [6]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [7]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

# 1. Preprocessing (Required to handle NaNs and Strings)
# Drop non-predictive columns
df = spaceship.drop(['PassengerId', 'Cabin', 'Name'], axis=1)

# Separate Features (X) and Target (y)
X = df.drop('Transported', axis=1)
y = df['Transported'].astype(int)

# Identify numeric and categorical columns for imputation
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()

# Impute missing values (median for numbers, most frequent for categories)
X[numeric_cols] = SimpleImputer(strategy='median').fit_transform(X[numeric_cols])
X[categorical_cols] = SimpleImputer(strategy='most_frequent').fit_transform(X[categorical_cols])

# Convert categorical variables to numeric using One-Hot Encoding
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Train-Test Split (Ensures we fit our scaler/selector only on training data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- STEP 1: FEATURE SCALING ---
# Standardizing features to have a mean of 0 and a variance of 1
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

# --- STEP 2: FEATURE SELECTION ---
# Selecting the top 10 most influential features using ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)

# Identify which features were kept
selected_features = X_train_scaled.columns[selector.get_support()].tolist()

print("Top 10 Selected Features:")
print(selected_features)

Top 10 Selected Features:
['Age', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck', 'HomePlanet_Europa', 'HomePlanet_Mars', 'CryoSleep_True', 'Destination_TRAPPIST-1e', 'VIP_True']


**Perform Train Test Split**

In [9]:
# 1. Define Features (X) and Target (y)
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# 2. Perform the split
# We'll use 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

# 3. Quick check of the results
print(f"Original dataset shape: {spaceship.shape}")
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")

Original dataset shape: (8693, 14)
Training features shape: (6954, 13)
Testing features shape: (1739, 13)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging (bootstrap=True)
bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1
)

# Pasting (bootstrap=False)
pasting_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=False, n_jobs=-1
)

- Random Forests

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the forest
# n_estimators = number of trees
# max_features = 'sqrt' is the standard for classification
rf_model = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

# Fit the model (using the processed data from previous steps)
rf_model.fit(X_train_scaled, y_train)

# Check Accuracy
print(f"Training Score: {rf_model.score(X_train_scaled, y_train):.2f}")
print(f"Test Score: {rf_model.score(X_test_scaled, y_test):.2f}")

Training Score: 0.86
Test Score: 0.49


- Gradient Boosting

In [12]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the model
gb_model = GradientBoostingClassifier(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=3, 
    random_state=42
)

# Fit the model
gb_model.fit(X_train_scaled, y_train)

# Evaluation
print(f"Gradient Boosting Test Accuracy: {gb_model.score(X_test_scaled, y_test):.4f}")

Gradient Boosting Test Accuracy: 0.4928


- Adaptive Boosting

In [13]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize AdaBoost
# n_estimators: Number of stumps to build
# learning_rate: Controls how much the weights are adjusted each step
ada_model = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

# Fit to our spaceship data
ada_model.fit(X_train_scaled, y_train)

print(f"AdaBoost Accuracy: {ada_model.score(X_test_scaled, y_test):.4f}")

AdaBoost Accuracy: 0.4836


Which model is the best and why?

For tabular data like this, Gradient Boosting (specifically variants like XGBoost or LightGBM) is generally considered the "best" choice. While Gradient Boosting usually wins on accuracy, Random Forest is often the "best" for a first-time build because it is incredibly hard to mess up. Gradient Boosting requires careful tuning of the learning rate and tree depth.