# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [70]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [71]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [72]:
spaceship.shape

(8693, 14)

**Check for data types**

In [73]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [74]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [75]:
spaceship.dropna(inplace=True)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [76]:

len(spaceship["Cabin"].unique())

5305

In [77]:
spaceship['cabin'] = spaceship['Cabin'].str[0]

In [78]:

print(spaceship[['Cabin', 'cabin']].head())


   Cabin cabin
0  B/0/P     B
1  F/0/S     F
2  A/0/S     A
3  A/0/S     A
4  F/1/S     F


In [79]:
spaceship.drop('Cabin', axis=1, inplace=True)

- Drop PassengerId and Name

In [80]:
spaceship.drop('Name', axis=1, inplace=True)
spaceship.drop("PassengerId" , axis=1, inplace=True)
spaceship.drop("cabin",axis=1, inplace= True)


- For non-numerical columns, do dummies.

In [81]:
columns_to_encode = ['HomePlanet', 'VIP', 'Destination', 'CryoSleep']

# Create dummies and drop the original columns
spaceship = pd.get_dummies(spaceship, columns=columns_to_encode, drop_first=True)

**Perform Train Test Split**

In [82]:
from sklearn.model_selection import train_test_split

# define target and features
X = spaceship.drop(['Spa'], axis=1)  # Features
y = spaceship['Spa']  # Target


# 30% test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [83]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Features and Target
X = spaceship.drop(columns=['Spa', 'Segment'], errors='ignore')
y = spaceship['Spa']  

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Data Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize KNN Regressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)  # Default is 5 neighbors

# Train the Model
knn_regressor.fit(X_train_scaled, y_train)

# Predictions
y_pred = knn_regressor.predict(X_test_scaled)

r2 = r2_score(y_test, y_pred)

print(f"R2 Score: {r2}")

R2 Score: 0.2287005110256527


- Evaluate your model's performance. Comment it

The KNN model achieved an R² score of 0.28, indicating weak predictive power for spa spending. This result reflects key challenges: the target variable was highly skewed due to the luxury nature of spa services, the model suffered from too many dummy variables, and some features lacked relevance.
by improving feature quality and relevance we could achieve higher scores.