## Welcome to Your ML Journey!

In this notebook, we'll predict which passengers were transported to an alternate dimension during the Spaceship Titanic's collision with a spacetime anomaly.

### What You'll Learn:
1. **Data Exploration** - Understanding what we're working with
2. **Data Cleaning** - Handling missing values
3. **Feature Engineering** - Creating useful features
4. **Model Training** - Testing 5 different algorithms
5. **Model Comparison** - Finding the best performer
6. **Making Predictions** - Submitting to Kaggle

### The 5 Algorithms We'll Compare:
1. **Logistic Regression** - Simple, fast baseline
2. **Random Forest** - Powerful tree-based ensemble
3. **XGBoost** - Advanced gradient boosting
4. **Support Vector Machine (SVM)** - Finds optimal boundaries
5. **K-Nearest Neighbors (KNN)** - Learns from similar examples

## Step 1: Import Libraries

Think of libraries as toolboxes. Each one contains specialized tools for different tasks:
- **pandas**: Data manipulation (like Excel on steroids)
- **numpy**: Mathematical operations
- **matplotlib/seaborn**: Data visualization
- **sklearn**: Machine learning algorithms

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning algorithms
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# The 5 algorithms we'll compare
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

# Evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Settings for nice visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 2: Load the Data

**What's happening here:**
- We're loading the CSV files into pandas DataFrames (think of them as smart spreadsheets)
- `train.csv`: Data with known outcomes (who was transported)
- `test.csv`: Data where we need to predict the outcomes

In [2]:
# Load the datasets
train_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"\nThis means we have {train_df.shape[0]:,} passengers to learn from")
print(f"And {test_df.shape[0]:,} passengers to make predictions for")

Training data shape: (8693, 14)
Test data shape: (4277, 13)

This means we have 8,693 passengers to learn from
And 4,277 passengers to make predictions for


## Step 3: Understanding the Data Structure

**Why this matters:**
Before building any model, we need to understand our data. It's like studying a map before going on a journey.

In [3]:
# Display first 5 rows
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### Understanding Each Column

Let's understand what each feature means:

| Column | Type | Description |
|--------|------|-------------|
| **PassengerId** | ID | Format: `gggg_pp` (group_personNumber) |
| **HomePlanet** | Categorical | Earth, Europa, or Mars |
| **CryoSleep** | Boolean | Was passenger in suspended animation? |
| **Cabin** | Categorical | Format: `deck/num/side` (P=Port, S=Starboard) |
| **Destination** | Categorical | Which planet they're going to |
| **Age** | Numerical | Passenger's age |
| **VIP** | Boolean | Did they pay for VIP service? |
| **RoomService** | Numerical | Amount spent on room service |
| **FoodCourt** | Numerical | Amount spent at food court |
| **ShoppingMall** | Numerical | Amount spent shopping |
| **Spa** | Numerical | Amount spent at spa |
| **VRDeck** | Numerical | Amount spent on VR entertainment |
| **Name** | Text | First and last name |
| **Transported** | Target | **This is what we're predicting!** |

In [4]:
print("="*70)
print("DATA LOADING SUMMARY")
print("="*70)
print(f"\nTraining data shape: {train_df.shape}")
print(f"   → {train_df.shape[0]:,} passengers (rows)")
print(f"   → {train_df.shape[1]} features/columns")

print(f"\nTest data shape: {test_df.shape}")
print(f"   → {test_df.shape[0]:,} passengers (rows)")
print(f"   → {test_df.shape[1]} features/columns")

print(f"\nGoal: Predict 'Transported' for {test_df.shape[0]:,} test passengers")
print("="*70)

DATA LOADING SUMMARY

Training data shape: (8693, 14)
   → 8,693 passengers (rows)
   → 14 features/columns

Test data shape: (4277, 13)
   → 4,277 passengers (rows)
   → 13 features/columns

Goal: Predict 'Transported' for 4,277 test passengers


In [5]:
# Get information about the dataset
print("Dataset Info:")
print("=" * 60)
train_df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
print("\n" + "=" * 60)
print("Missing Values Count:")
print("=" * 60)
missing = train_df.isnull().sum()
missing_pct = (missing / len(train_df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
print(missing_df[missing_df['Missing Count'] > 0])


Missing Values Count:
              Missing Count  Percentage
CryoSleep               217    2.496261
ShoppingMall            208    2.392730
VIP                     203    2.335212
HomePlanet              201    2.312205
Name                    200    2.300702
Cabin                   199    2.289198
VRDeck                  188    2.162660
Spa                     183    2.105142
FoodCourt               183    2.105142
Destination             182    2.093639
RoomService             181    2.082135
Age                     179    2.059128


## Step 4: Explore the Data (EDA - Exploratory Data Analysis)

### Target Variable Distribution

**Key Question:** How many passengers were transported vs not transported?

This tells us if our data is **balanced** (roughly equal) or **imbalanced** (one class dominates).