**Competition Goal:** Predict which passengers were transported to an alternate dimension during the Spaceship Titanic's collision with a spacetime anomaly.

**What You Will Learn in This Notebook:**
1. How to explore and understand a dataset before building any model
2. How to clean messy data (handle missing values, encode categories, scale numbers)
3. How to train five different machine learning models and compare them fairly
4. How to pick the best model and generate a Kaggle submission file

**Why This Order Matters:**
Think of machine learning like cooking. You cannot just throw raw ingredients into an oven and hope for the best. You need to wash them, chop them, measure them, and then cook them properly. Each step in this notebook exists for a reason, and skipping any step can ruin the final result. We will explain every step as we go.

---

## Section 1 - Import Libraries

Before writing any analysis code, we load the tools (libraries) we will need.

| Library | Purpose |
|---------|---------|
| pandas | Reading CSVs, manipulating tables of data |
| numpy | Fast math operations on arrays |
| matplotlib / seaborn | Drawing charts and visualizations |
| scikit-learn | Machine learning models, preprocessing, evaluation |
| xgboost | Gradient boosted tree model (often the top performer on tabular data) |

**What happens if you skip this?** Nothing else in the notebook will work. This is like plugging in your appliances before you can use them.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier

import time

# Make charts look clean
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print("All libraries loaded successfully.")

All libraries loaded successfully.


## Section 2 - Load the Data

We have three files from Kaggle:

| File | Rows | Purpose |
|------|------|---------|
| train.csv | ~8700 | Has the answer (Transported column). We learn from this. |
| test.csv | ~4300 | No answer column. We predict on this and submit to Kaggle. |
| sample_submission.csv | ~4300 | Shows the exact format Kaggle expects. |

**Why load all three now?** We need to understand the full picture. The test set may have categories or patterns the training set does not, and vice versa.

In [2]:
# Load the datasets
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
test  = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
sample_submission = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

print(f"Training set   : {train.shape[0]} rows, {train.shape[1]} columns")
print(f"Test set       : {test.shape[0]} rows, {test.shape[1]} columns")
print(f"Submission file: {sample_submission.shape[0]} rows, {sample_submission.shape[1]} columns")

Training set   : 8693 rows, 14 columns
Test set       : 4277 rows, 13 columns
Submission file: 4277 rows, 2 columns


## Section 3 - First Look at the Data

Before doing anything fancy, we simply look at the data. This is like a doctor checking your vitals before running tests. We want to know:
- What columns exist and what type each one is (number vs text)
- How many values are missing
- What the first few rows look like

**What happens if you skip this?** You might build a model on data you do not understand. You could accidentally treat a text column as a number, or miss a column that is 90 percent empty.

In [3]:
print("=== First 5 rows of training data ===")
train.head()

=== First 5 rows of training data ===


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
print("=== Data types and non-null counts ===")
train.info()

=== Data types and non-null counts ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [5]:
print("=== Basic statistics for numeric columns ===")
train.describe()

=== Basic statistics for numeric columns ===


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [6]:
print("=== Basic statistics for categorical (text) columns ===")
train.describe(include='object')

=== Basic statistics for categorical (text) columns ===


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name
count,8693,8492,8476,8494,8511,8490,8493
unique,8693,3,2,6560,3,2,8473
top,9280_02,Earth,False,G/734/S,TRAPPIST-1e,False,Ankalik Nateansive
freq,1,4602,5439,8,5915,8291,2


## Section 4 - Understand Missing Values

Real-world data is almost never complete. Passengers may not have filled in every field, or the damaged computer system lost some records. We need to know exactly how much is missing and where.

**Why this matters:** Most machine learning algorithms cannot handle empty cells. If we feed them missing data, they will either crash or silently produce garbage predictions.