# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [5]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [6]:
#your code here
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [None]:
#your code here
# Drop rows with any missing values
spaceship_clean = spaceship.dropna().reset_index(drop=True)

# Quick check
print("Before:", spaceship.shape, "After:", spaceship_clean.shape)
print("Remaining missing values:", spaceship_clean.isna().sum().sum())

Before: (8693, 14) After: (6606, 14)
Remaining missing values: 0


**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [8]:
#your code here
# Keep only numeric feature columns for KNN
num_cols = spaceship_clean.select_dtypes(include=[np.number]).columns
X = spaceship_clean[num_cols].copy()

print(num_cols.tolist())
print("X shape:", X.shape)


['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
X shape: (6606, 6)


And also lets define our target.

In [9]:
#your code here
# Target column (binary: True/False -> 1/0)
target_col = "Transported"
y = spaceship_clean[target_col].astype(int)   # make it numeric for KNN

print(y.value_counts().rename(index={1:'Transported', 0:'Not transported'}))
print("y shape:", y.shape)


Transported
Transported        3327
Not transported    3279
Name: count, dtype: int64
y shape: (6606,)


**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [10]:
#your code here
# 80/20 train–test split (stratified for class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(5284, 6) (1322, 6) (5284,) (1322,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [11]:
#your code here
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# KNN CLASSIFIER (scale features inside a pipeline)
knn_clf = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=7)  # you can tune this
)

# train
knn_clf.fit(X_train, y_train)

# quick check
acc = knn_clf.score(X_test, y_test)
print(f"Test accuracy: {acc:.3f}")


Test accuracy: 0.752


Fit the model to your data.

In [12]:
#your code here
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# KNN classifier (scale features inside the pipeline)
knn = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=7)   # tune later if you like
)

# fit
knn.fit(X_train, y_train)

# quick eval
print("Train accuracy:", knn.score(X_train, y_train))
print("Test accuracy :", knn.score(X_test, y_test))



Train accuracy: 0.7973126419379258
Test accuracy : 0.7518910741301059


Evaluate your model.

In [None]:
#your code here
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

# predictions
y_pred = knn.predict(X_test)
y_proba = knn.predict_proba(X_test)[:, 1]  # for AUC

# metrics
acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)

print(f"Accuracy : {acc:.4f}")
print(f"ROC AUC  : {auc:.4f}")

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion matrix:\n", cm)

print("\nClassification report:\n",
      classification_report(y_test, y_pred, digits=4))

Accuracy : 0.7519
ROC AUC  : 0.8130

Confusion matrix:
 [[467 189]
 [139 527]]

Classification report:
               precision    recall  f1-score   support

           0     0.7706    0.7119    0.7401       656
           1     0.7360    0.7913    0.7627       666

    accuracy                         0.7519      1322
   macro avg     0.7533    0.7516    0.7514      1322
weighted avg     0.7532    0.7519    0.7515      1322



**Congratulations, you have just developed your first Machine Learning model!**