# SpaceX Falcon 9 First Stage Landing Prediction  
**IBM Data Science Professional Certificate – Capstone Project**

This notebook replicates the end‑to‑end workflow used in the capstone project:

1. Data collection (SpaceX API / web data)
2. Data wrangling and feature engineering
3. Exploratory data analysis (EDA)
4. SQL‑based exploration
5. Interactive visual analytics (Folium and Plotly)
6. Predictive analysis with several classification models
7. Model evaluation and comparison


In [None]:
# Import core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Display settings
plt.style.use('seaborn-v0_8')
sns.set(font_scale=1.1)


## 1. Data Collection

In the original labs the data was collected from:
- The public SpaceX REST API
- A web page listing Falcon 9 launches
- Provided CSV files on IBM Skills Network

In this consolidated notebook we assume the cleaned CSV produced in the data‑wrangling labs is available from a URL or local file.


In [None]:
# URL to the cleaned dataset used in the capstone labs.
# Replace this with the actual URL from Skills Network or with a local path.
DATA_URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_2.csv"

df = pd.read_csv(DATA_URL)
df.head()

## 2. Data Wrangling and Feature Engineering

We keep the most relevant columns for predicting whether the first stage will land,
create the target label `Class`, and apply one‑hot encoding to categorical features.


In [None]:
# Keep only a subset of relevant columns (adjust to match your dataset)
columns_to_keep = [
    'FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite',
    'FlightStatus', 'BoosterVersion', 'GridFins', 'Reused',
    'Legs', 'LandingPad', 'Block', 'ReusedCount', 'SERIAL',
    'Class'
]

df_model = df[columns_to_keep].copy()
df_model.head()

In [None]:
# Check for missing values
df_model.isna().sum()

In [None]:
# Simple example: fill missing numeric values with the median
numeric_cols = df_model.select_dtypes(include=['int64', 'float64']).columns
df_model[numeric_cols] = df_model[numeric_cols].fillna(df_model[numeric_cols].median())

# For categorical columns we can fill missing with a special category
categorical_cols = df_model.select_dtypes(include=['object', 'bool']).columns
df_model[categorical_cols] = df_model[categorical_cols].fillna('Unknown')

df_model.head()

In [None]:
# Separate features and target label
Y = df_model['Class']
X = df_model.drop('Class', axis=1)

# One‑hot encoding
X_one_hot = pd.get_dummies(X, drop_first=True)
X_one_hot.shape

## 3. Exploratory Data Analysis (EDA)

We now explore relationships between features and the landing outcome using
descriptive statistics and visualizations.


In [None]:
# Basic statistics
df_model.describe(include='all').transpose().head()

In [None]:
# Example: distribution of payload mass
plt.figure(figsize=(8,5))
sns.histplot(data=df_model, x='PayloadMass', hue='Class', kde=True)
plt.title('Payload Mass Distribution by Landing Outcome')
plt.show()

In [None]:
# Example: success rate by launch site
plt.figure(figsize=(8,5))
sns.barplot(data=df_model, x='LaunchSite', y='Class')
plt.title('Average Landing Success by Launch Site')
plt.xticks(rotation=45)
plt.show()

## 4. SQL‑Based Exploration

In the labs, a SQLite database was created from the SpaceX dataset and queried using SQL.
Here we reproduce a few typical queries using `pandas` and `sqlite3`.


In [None]:
import sqlite3

conn = sqlite3.connect(':memory:')
df_model.to_sql('SPACEX', conn, index=False, if_exists='replace')

query = """
SELECT LaunchSite,
       COUNT(*) AS TotalLaunches,
       AVG(Class) AS SuccessRate
FROM SPACEX
GROUP BY LaunchSite
ORDER BY SuccessRate DESC;
"""

pd.read_sql(query, conn)

## 5. Machine Learning – Model Training

We build several classification models to predict whether the first stage will land:

- Logistic Regression
- Support Vector Machine (SVM)
- Decision Tree
- k‑Nearest Neighbors (KNN)

We standardize the features, perform a train/test split, and use GridSearchCV
to tune hyperparameters for each model.


In [None]:
# Train/test split
X_train, X_test, Y_train, Y_test = train_test_split(
    X_one_hot, Y, test_size=0.2, random_state=2, stratify=Y
)

# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape

### 5.1 Logistic Regression

In [None]:
logreg_params = {
    'C': [0.01, 0.1, 1.0, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}
logreg = LogisticRegression(max_iter=1000)
logreg_cv = GridSearchCV(logreg, logreg_params, cv=10)
logreg_cv.fit(X_train_scaled, Y_train)

print("Best parameters:", logreg_cv.best_params_)
print("CV accuracy:", logreg_cv.best_score_)

In [None]:
logreg_test_acc = logreg_cv.score(X_test_scaled, Y_test)
yhat_logreg = logreg_cv.predict(X_test_scaled)

print("Test accuracy:", logreg_test_acc)
print(classification_report(Y_test, yhat_logreg))

### 5.2 Support Vector Machine

In [None]:
svm_params = {
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'C': np.logspace(-3, 3, 5),
    'gamma': np.logspace(-3, 3, 5)
}
svm_model = SVC()
svm_cv = GridSearchCV(svm_model, svm_params, cv=10)
svm_cv.fit(X_train_scaled, Y_train)

print("Best parameters:", svm_cv.best_params_)
print("CV accuracy:", svm_cv.best_score_)

In [None]:
svm_test_acc = svm_cv.score(X_test_scaled, Y_test)
yhat_svm = svm_cv.predict(X_test_scaled)

print("Test accuracy:", svm_test_acc)
print(classification_report(Y_test, yhat_svm))

### 5.3 Decision Tree

In [None]:
tree_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2*n for n in range(1, 10)],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10]
}
tree = DecisionTreeClassifier(random_state=2)
tree_cv = GridSearchCV(tree, tree_params, cv=10)
tree_cv.fit(X_train_scaled, Y_train)

print("Best parameters:", tree_cv.best_params_)
print("CV accuracy:", tree_cv.best_score_)

In [None]:
tree_test_acc = tree_cv.score(X_test_scaled, Y_test)
yhat_tree = tree_cv.predict(X_test_scaled)

print("Test accuracy:", tree_test_acc)
print(classification_report(Y_test, yhat_tree))

### 5.4 k‑Nearest Neighbors (KNN)

In [None]:
knn_params = {
    'n_neighbors': list(range(1, 11)),
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2]
}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, knn_params, cv=10)
knn_cv.fit(X_train_scaled, Y_train)

print("Best parameters:", knn_cv.best_params_)
print("CV accuracy:", knn_cv.best_score_)

In [None]:
knn_test_acc = knn_cv.score(X_test_scaled, Y_test)
yhat_knn = knn_cv.predict(X_test_scaled)

print("Test accuracy:", knn_test_acc)
print(classification_report(Y_test, yhat_knn))

## 6. Model Comparison

In [None]:
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'SVM', 'Decision Tree', 'KNN'],
    'TestAccuracy': [logreg_test_acc, svm_test_acc, tree_test_acc, knn_test_acc]
}).sort_values('TestAccuracy', ascending=False)

results

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(data=results, x='Model', y='TestAccuracy')
plt.title('Model Comparison – Test Accuracy')
plt.ylim(0, 1)
plt.xticks(rotation=30)
plt.show()

## 7. Conclusion

- We built several classification models to predict Falcon 9 first stage landing success.
- The workflow included data wrangling, EDA, SQL queries, and hyperparameter tuning.
- The decision tree model typically achieves the best performance on this dataset
  (around 83% accuracy on the held‑out test set in the original labs).
- These predictions can support estimating launch costs and assessing mission risk.

You can extend this notebook by:
- Adding Folium maps showing launch sites and outcomes.
- Integrating Plotly Dash for fully interactive dashboards.
- Enriching the dataset with weather and mission‑specific metadata.
