# SpaceX Falcon 9 First Stage Landing Prediction - Web Scraping

## Project Overview
This notebook collects SpaceX Falcon 9 launch data through web scraping from Wikipedia. Unlike API-based collection, web scraping allows us to extract structured data from HTML tables, providing historical launch information that complements our dataset for predicting first stage landing success.

## Table of Contents
1. Import Libraries and Define Helper Functions
2. Web Scraping Setup and Data Extraction
3. Parse HTML Tables and Extract Launch Records
4. Create DataFrame and Export Data

---

## 1. Import Libraries and Define Helper Functions

### Import Required Libraries
We'll use BeautifulSoup for HTML parsing, requests for web requests, and pandas for data manipulation.

In [None]:
# Import system utilities
import sys

# Import web scraping libraries
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content
import re  # For regular expressions
import unicodedata  # For Unicode string normalization

# Import data manipulation library
import pandas as pd

### Helper Function: Extract Date and Time
This function parses date and time information from table cells, returning them as a clean list.

In [None]:
def date_time(table_cells):
    """
    Extract date and time from table cell strings.
    
    Args:
        table_cells: BeautifulSoup table cell containing date and time data
    
    Returns:
        list: First two elements containing date and time strings
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

### Helper Function: Extract Booster Version
Extracts and formats the booster version information from table cells.

In [None]:
def booster_version(table_cells):
    """
    Extract booster version by joining every other string element.
    
    Args:
        table_cells: BeautifulSoup table cell containing booster version data
    
    Returns:
        str: Cleaned booster version string
    """
    # Join every even-indexed string (skipping odd indices) and exclude the last element
    out = ''.join([booster_version for i, booster_version in enumerate(table_cells.strings) if i % 2 == 0][0:-1])
    return out

### Helper Function: Extract Landing Status
Retrieves the landing status text from table cells.

In [None]:
def landing_status(table_cells):
    """
    Extract the first string from table cells representing landing status.
    
    Args:
        table_cells: BeautifulSoup table cell containing landing status
    
    Returns:
        str: Landing status string
    """
    out = [i for i in table_cells.strings][0]
    return out

### Helper Function: Extract Payload Mass
Parses and normalizes payload mass data, extracting the kilogram value.

In [None]:
def get_mass(table_cells):
    """
    Extract and normalize payload mass from table cells.
    
    Args:
        table_cells: BeautifulSoup table cell containing mass data
    
    Returns:
        str or int: Mass value with 'kg' unit, or 0 if not found
    """
    # Normalize Unicode characters to standard format
    mass = unicodedata.normalize("NFKD", table_cells.text).strip()
    
    if mass:
        # Find the position of 'kg' and extract everything up to and including it
        mass.find("kg")
        new_mass = mass[0:mass.find("kg") + 2]
    else:
        new_mass = 0
    
    return new_mass

### Helper Function: Extract Column Names from Headers
Cleans and extracts column names from HTML table headers by removing unwanted tags.

In [None]:
def extract_column_from_header(row):
    """
    Clean and extract column name from HTML table header row.
    Removes br, a, and sup tags, then returns the cleaned text.
    
    Args:
        row: BeautifulSoup row element from table header
    
    Returns:
        str or None: Cleaned column name, or None if invalid
    """
    # Remove line break tags
    if (row.br):
        row.br.extract()
    
    # Remove anchor (link) tags
    if row.a:
        row.a.extract()
    
    # Remove superscript tags
    if row.sup:
        row.sup.extract()
    
    # Join all remaining content into column name
    colunm_name = ' '.join(row.contents)
    
    # Filter out digit-only names and return cleaned name
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name

---

## 2. Web Scraping Setup and Data Extraction

### Configure Wikipedia URL and Headers
We'll scrape the Wikipedia page containing Falcon 9 and Falcon Heavy launch records. Custom headers are used to mimic a browser request.

In [None]:
# Define the static Wikipedia URL containing SpaceX launch data
# This is a specific revision to ensure data consistency
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

# Set up headers to mimic a browser request (some websites block requests without proper headers)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/91.0.4472.124 Safari/537.36"
}

### Request and Parse HTML Content
Make an HTTP GET request to fetch the webpage and parse it using BeautifulSoup.

In [None]:
# Send GET request to the Wikipedia page
response = requests.get(static_url, headers=headers)

# Parse the HTML content using BeautifulSoup with the html.parser
soup = BeautifulSoup(response.text, 'html.parser')

### Extract All HTML Tables
Find all table elements in the parsed HTML. The launch data is contained within these tables.

In [None]:
# Use find_all to locate all table elements in the HTML
html_tables = soup.find_all('table')

---

## 3. Parse HTML Tables and Extract Launch Records

### Extract Column Names from Table Headers
Iterate through the table headers to extract clean column names for our dataset.

In [None]:
# Initialize an empty list to store column names
column_names = []

# Access the third table (index 2) which contains the launch data
# Find all table header (th) elements
for row in html_tables[2].find_all('th'):
    # Apply our helper function to clean and extract the column name
    name = extract_column_from_header(row)
    
    # Only append non-empty column names
    if name is not None and len(name) > 0:
        column_names.append(name)

# Display the extracted column names
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


### Initialize Data Dictionary
Create a dictionary structure to store all launch data, with each column initialized as an empty list.

In [None]:
# Create a dictionary with keys from column_names, all values initially None
launch_dict = dict.fromkeys(column_names)

# Remove an irrelevant column that we don't need
del launch_dict['Date and time ( )']

# Initialize each column as an empty list to store data
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []

# Add additional columns for our analysis
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []

### Parse Launch Tables and Extract Data
This is the core scraping logic that iterates through all launch tables, extracts data from each row, and populates our dictionary with launch information.

In [None]:
# Track the number of successfully extracted rows
extracted_row = 0

# Iterate through each table with class "wikitable plainrowheaders collapsible"
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    
    # Iterate through each row in the current table
    for rows in table.find_all("tr"):
        
        # Check if the first cell is a table heading (th) with a flight number
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                # Verify if it's a numeric flight number
                flag = flight_number.isdigit()
        else:
            flag = False
        
        # Extract all table data cells (td) from the row
        row = rows.find_all('td')
        
        # Process the row only if it contains a valid flight number
        if flag:
            extracted_row += 1
            
            # Extract date and time
            datatimelist = date_time(row[0])
            
            # Extract and clean date (remove trailing comma)
            date = datatimelist[0].strip(',')
            
            # Extract time
            time = datatimelist[1]
            
            # Extract booster version
            bv = booster_version(row[1])
            # If booster version is empty, try to get it from the anchor tag
            if not(bv):
                bv = row[1].a.string
            print(bv)
            
            # Extract launch site name from anchor tag
            launch_site = row[2].a.string if row[2].a else None
            
            # Extract payload name from anchor tag
            payload = row[3].a.string if row[3].a else None
            
            # Extract and normalize payload mass
            payload_mass = get_mass(row[4])
            
            # Extract target orbit from anchor tag
            orbit = row[5].a.string if row[5].a else None
            
            # Extract customer name from anchor tag
            customer = row[6].a.string if row[6].a else None
            
            # Extract launch outcome (first string in the cell)
            launch_outcome = list(row[7].strings)[0]
            
            # Extract booster landing status
            booster_landing = landing_status(row[8])
            
            # Append all extracted data to the launch dictionary
            launch_dict['Flight No.'].append(flight_number)
            launch_dict['Date'].append(date)
            launch_dict['Time'].append(time)
            launch_dict['Version Booster'].append(bv)
            launch_dict['Launch site'].append(launch_site)
            launch_dict['Payload'].append(payload)
            launch_dict['Payload mass'].append(payload_mass)
            launch_dict['Orbit'].append(orbit)
            launch_dict['Customer'].append(customer)
            launch_dict['Launch outcome'].append(launch_outcome)
            launch_dict['Booster landing'].append(booster_landing)

F9 v1.07B0003.18
F9 v1.07B0004.18
F9 v1.07B0005.18
F9 v1.07B0006.18
F9 v1.07B0007.18
F9 v1.17B10038
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 v1.1[
F9 FT[
F9 v1.1[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT[
F9 FT♺[
F9 FT[
F9 FT[
F9 FT[
F9 FTB1029.2195
F9 FT[
F9 FT[
F9 B4[
F9 FT[
F9 B4[
F9 B4[
F9 FTB1031.2220
F9 B4[
F9 FTB1035.2227
F9 FTB1036.2227
F9 B4[
F9 FTB1032.2245
F9 FTB1038.2268
F9 B4[
F9 B4B1041.2268
F9 B4B1039.2292
F9 B4[
F9 B5311B1046.1268
F9 B4B1043.2322
F9 B4B1040.2268
F9 B4B1045.2336
F9 B5
F9 B5349B1048[
F9 B5B1046.2354
F9 B5[
F9 B5B1048.2364
F9 B5B1047.2268
F9 B5B1046.3268
F9 B5[
F9 B5[
F9 B5B1049.2397
F9 B5B1048.3399
F9 B5[]413
F9 B5[
F9 B5B1049.3434
F9 B5B1051.2420
F9 B5B1056.2465
F9 B5B1047.3472
F9 B5
F9 B5[
F9 B5B1056.3482
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5
F9 B5[
F9 B5
F9 B5
F9 B5
F9 B5B1058.2544
F9 B5
F9 B5B1049.6544
F9 B5
F9 B5B1060.2563
F9 B5B1058.3565
F9 B5B1051.6568


---

## 4. Create DataFrame and Export Data

### Convert Dictionary to DataFrame
Transform the populated dictionary into a pandas DataFrame for analysis and manipulation.

In [None]:
# Create DataFrame from the launch_dict, converting each list to a pandas Series
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})

# Display basic information about the DataFrame
print(f"Total launches scraped: {len(df)}")
df.head()

### Export to CSV
Save the scraped dataset to a CSV file for future analysis and model training.

In [None]:
# Save the DataFrame to a CSV file
df.to_csv('spacex_web_scraped_data.csv', index=False)
print("✓ Data successfully saved to 'spacex_web_scraped_data.csv'")

---

## Summary

### Data Collection Complete
We successfully scraped SpaceX Falcon 9 launch data from Wikipedia, extracting the following information:

**Key Features Collected:**
- Flight number and launch date/time
- Booster version details
- Launch site locations
- Payload information and mass
- Target orbits
- Customer details
- Launch outcomes
- Booster landing status

The scraped data has been saved to `spacex_web_scraped_data.csv` and is ready for further analysis.