# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [2]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [4]:
# Check the shape of the data
print(spaceship.shape)

(8693, 14)


**Check for data types**

In [5]:
# Check data types for all columns
print(spaceship.dtypes)

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

In [7]:
# Count missing values per column
missing_values = spaceship.isnull().sum()

# Display the counts
print(missing_values)

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [9]:
#your code # Drop rows with any missing values
spaceship_cleaned = spaceship.dropna()

# Verify the new shape
print(f"Original shape: {spaceship.shape}")
print(f"Cleaned shape: {spaceship_cleaned.shape}")

# Confirm no missing values remain
print(f"Missing values remaining: {spaceship_cleaned.isnull().sum().sum()}")

Original shape: (8693, 14)
Cleaned shape: (6606, 14)
Missing values remaining: 0


**KNN**

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [10]:
# 1. Separate the target (y)
y = spaceship_cleaned['Transported']

# 2. Select only numerical columns for features (X)
# This excludes PassengerId, HomePlanet, Cabin, etc.
X = spaceship_cleaned.select_dtypes(include=['int64', 'float64'])

# 3. Check the results
print("Numerical features selected:")
print(X.columns.tolist())
print(f"\nShape of X: {X.shape}")

Numerical features selected:
['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

Shape of X: (6606, 6)


And also lets define our target.

In [11]:
#your code here# 1. Define the target (y)
y = spaceship_cleaned['Transported']

# 2. Define the features (X) 
# We are only using the numerical columns as you mentioned
X = spaceship_cleaned.select_dtypes(include=['int64', 'float64'])

# Verify the definitions
print(f"Target (y) shape: {y.shape}")
print(f"Features (X) shape: {X.shape}")
print(f"Features being used: {X.columns.tolist()}")

Target (y) shape: (6606,)
Features (X) shape: (6606, 6)
Features being used: ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']


**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [12]:
# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Verify the split
print(f"Training features (X_train): {X_train.shape}")
print(f"Testing features (X_test):   {X_test.shape}")
print(f"Training target (y_train):   {y_train.shape}")
print(f"Testing target (y_test):     {y_test.shape}")

Training features (X_train): (5284, 6)
Testing features (X_test):   (1322, 6)
Training target (y_train):   (5284,)
Testing target (y_test):     (1322,)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

In [13]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN Classifier with default hyperparameters
knn = KNeighborsClassifier()

# Verify the instance
print(knn)

KNeighborsClassifier()


Fit the model to your data.

In [14]:
# Fit the model using the training data
knn.fit(X_train, y_train)

print("The KNN model has been successfully trained.")

The KNN model has been successfully trained.


Evaluate your model.

In [15]:
# Predict the target for the test set
y_pred = knn.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Display the Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Display the Classification Report (Precision, Recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 77.16%

Confusion Matrix:
[[483 170]
 [132 537]]

Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.74      0.76       653
        True       0.76      0.80      0.78       669

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



**Congratulations, you have just developed your first Machine Learning model!**