# Phase 1: SpaceX Data Acquisition via REST API

**Executive Summary:** 
This phase focuses on establishing a robust data foundation by programmatically interacting with the SpaceX REST API. By retrieving granular details on rocket versions, launch sites, and payload specifications, we move beyond static datasets to ensure our predictive models are built on comprehensive, real-world technical parameters.

**Objectives**
1.  **Extract:** Programmatically request launch data from the SpaceX REST API.
2.  **Transform:** Process the JSON response into a structured Pandas DataFrame.
3.  **Load:** Clean and export the dataset for the Exploratory Data Analysis (EDA) phase.

**Methodology: The "Circuit Breaker" Pattern**
To ensure the reproducibility of this analysis across different network environments (e.g., CI/CD pipelines or restricted corporate networks), this notebook implements a "Circuit Breaker" strategy. It attempts to query the live SpaceX API first. If the connection times out or fails, it gracefully degrades to a static backup dataset, ensuring the pipeline never breaks.

**Acknowledgments:**
- Original lab structure from IBM Data Science Professional Certificate

In [1]:
# importing libraries
import requests
import pandas as pd
import numpy as np
import datetime

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [2]:
# DATA EXTRACTION HELPER FUNCTIONS

def getBoosterVersion(data):
    """Extract booster version information from rocket IDs"""
    BoosterVersion = []
    for rocket_id in data['rocket']:
        try:
            response = requests.get(f"https://api.spacexdata.com/v4/rockets/{rocket_id}", timeout=5)
            BoosterVersion.append(response.json()['name'])
        except:
            BoosterVersion.append(None)
    return BoosterVersion

def getLaunchSite(data):
    """Extract launch site information including coordinates"""
    LaunchSite, Longitude, Latitude = [], [], []
    for launchpad_id in data['launchpad']:
        try:
            response = requests.get(f"https://api.spacexdata.com/v4/launchpads/{launchpad_id}", timeout=5)
            json_data = response.json()
            LaunchSite.append(json_data['name'])
            Longitude.append(json_data['longitude'])
            Latitude.append(json_data['latitude'])
        except:
            LaunchSite.append(None)
            Longitude.append(None)
            Latitude.append(None)
    return LaunchSite, Longitude, Latitude

def getPayloadData(data):
    """Extract payload mass and orbit information"""
    PayloadMass, Orbit = [], []
    for payload_id in data['payloads']:
        try:
            response = requests.get(f"https://api.spacexdata.com/v4/payloads/{payload_id}", timeout=5)
            json_data = response.json()
            PayloadMass.append(json_data['mass_kg'])
            Orbit.append(json_data['orbit'])
        except:
            PayloadMass.append(None)
            Orbit.append(None)
    return PayloadMass, Orbit

def getCoreData(data):
    """Extract core/booster landing and reuse information"""
    Block, ReusedCount, Serial = [], [], []
    Outcome, Flights, GridFins, Reused, Legs, LandingPad = [], [], [], [], [], []
    
    for core_dict in data['cores']:
        # Extract core information
        if core_dict.get('core'):
            try:
                response = requests.get(f"https://api.spacexdata.com/v4/cores/{core_dict['core']}", timeout=5)
                json_data = response.json()
                Block.append(json_data.get('block'))
                ReusedCount.append(json_data.get('reuse_count'))
                Serial.append(json_data.get('serial'))
            except:
                Block.append(None)
                ReusedCount.append(None)
                Serial.append(None)
        else:
            Block.append(None)
            ReusedCount.append(None)
            Serial.append(None)
        
        # Extract landing information
        Outcome.append(f"{core_dict.get('landing_success')} {core_dict.get('landing_type')}")
        Flights.append(core_dict.get('flight'))
        GridFins.append(core_dict.get('gridfins'))
        Reused.append(core_dict.get('reused'))
        Legs.append(core_dict.get('legs'))
        LandingPad.append(core_dict.get('landpad'))
    
    return Block, ReusedCount, Serial, Outcome, Flights, GridFins, Reused, Legs, LandingPad


In [3]:
# SPACEX DATA COLLECTION

# URLs
spacex_url = "https://api.spacexdata.com/v4/launches/past"
static_json_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json"

# Attempt to fetch from live API with fallback to static data
print("\nAttempting to fetch data from SpaceX API...")
try:
    response = requests.get(spacex_url, timeout=10)
    response.raise_for_status()
    data = pd.json_normalize(response.json())
    print(f" Success! Fetched {len(data)} records from live API.")
    USE_LIVE_API = True
except Exception as e:
    print(f" Live API failed: {e}")
    print(" Loading static backup data...")
    response = requests.get(static_json_url)
    data = pd.json_normalize(response.json())
    print(f" Loaded {len(data)} records from backup.")
    USE_LIVE_API = False

# Display sample
print("\nData Preview:")
print(data.head(2))


Attempting to fetch data from SpaceX API...
 Success! Fetched 187 records from live API.

Data Preview:
       static_fire_date_utc  static_fire_date_unix    net  window  \
0  2006-03-17T00:00:00.000Z           1.142554e+09  False     0.0   
1                      None                    NaN  False     0.0   

                     rocket success  \
0  5e9d0d95eda69955f709d1eb   False   
1  5e9d0d95eda69955f709d1eb   False   

                                                                                                  failures  \
0                                      [{'time': 33, 'altitude': None, 'reason': 'merlin engine failure'}]   
1  [{'time': 301, 'altitude': 289, 'reason': 'harmonic oscillation leading to premature engine shutdown'}]   

                                                                                                                                                                                details  \
0                                                  

In [4]:
# DATA PREPROCESSING

# Selecting relevant columns
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# Filter: Keeping only single-core, single-payload launches
print(f"\nOriginal records: {len(data)}")
data = data[data['cores'].map(len) == 1]
data = data[data['payloads'].map(len) == 1]
print(f"After filtering (single core/payload): {len(data)}")

# Extracting single values from lists
data['cores'] = data['cores'].map(lambda x: x[0])
data['payloads'] = data['payloads'].map(lambda x: x[0])

# Converting date and filter by cutoff date
data['date'] = pd.to_datetime(data['date_utc']).dt.date
data = data[data['date'] <= datetime.date(2020, 11, 13)]
print(f"After date filtering (≤ 2020-11-13): {len(data)}")


Original records: 187
After filtering (single core/payload): 172
After date filtering (≤ 2020-11-13): 94


In [5]:
# EXTRACTING DETAILED INFORMATION

# Try to use live API for detail extraction, otherwise load pre-processed dataset
if USE_LIVE_API:
    print("\n Extracting details from API (this may take 1-2 minutes)...")
    try:
        BoosterVersion = getBoosterVersion(data)
        LaunchSite, Longitude, Latitude = getLaunchSite(data)
        PayloadMass, Orbit = getPayloadData(data)
        Block, ReusedCount, Serial, Outcome, Flights, GridFins, Reused, Legs, LandingPad = getCoreData(data)
        
        # Create DataFrame
        data_falcon9 = pd.DataFrame({
            'FlightNumber': list(data['flight_number']),
            'Date': list(data['date']),
            'BoosterVersion': BoosterVersion,
            'PayloadMass': PayloadMass,
            'Orbit': Orbit,
            'LaunchSite': LaunchSite,
            'Outcome': Outcome,
            'Flights': Flights,
            'GridFins': GridFins,
            'Reused': Reused,
            'Legs': Legs,
            'LandingPad': LandingPad,
            'Block': Block,
            'ReusedCount': ReusedCount,
            'Serial': Serial,
            'Longitude': Longitude,
            'Latitude': Latitude
        })
        print(" Detail extraction complete!")
        
    except Exception as e:
        print(f"  API detail extraction failed: {e}")
        print(" Loading pre-processed dataset...")
        dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv"
        data_falcon9 = pd.read_csv(dataset_url)
else:
    print("\n Loading pre-processed dataset (API unavailable)...")
    dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv"
    data_falcon9 = pd.read_csv(dataset_url)


 Extracting details from API (this may take 1-2 minutes)...
 Detail extraction complete!


In [9]:
# FILTER FALCON 9 LAUNCHES ONLY

print(f"\nRecords before Falcon 9 filter: {len(data_falcon9)}")
data_falcon9 = data_falcon9[data_falcon9['BoosterVersion'] != 'Falcon 1']
print(f"Records after Falcon 9 filter: {len(data_falcon9)}")

# Reset flight numbers
data_falcon9['FlightNumber'] = range(1, len(data_falcon9) + 1)


Records before Falcon 9 filter: 90
Records after Falcon 9 filter: 90


In [10]:
# HANDLING MISSING VALUES

print("\nMissing values per column:")
print(data_falcon9.isnull().sum())

# Fill PayloadMass with mean
if data_falcon9['PayloadMass'].isnull().any():
    payload_mean = data_falcon9['PayloadMass'].mean()
    data_falcon9['PayloadMass'].fillna(payload_mean, inplace=True)
    print(f"\n Filled PayloadMass NaN with mean: {payload_mean:.2f} kg")


Missing values per column:
FlightNumber       0
Date               0
BoosterVersion     4
PayloadMass        0
Orbit              2
LaunchSite         3
Outcome            0
Flights            0
GridFins           0
Reused             0
Legs               0
LandingPad        26
Block              1
ReusedCount        1
Serial             1
Longitude          3
Latitude           3
dtype: int64


In [11]:
print("\n" + "=" * 70)
print("FINAL DATASET SUMMARY")
print("=" * 70)

print(f"\nTotal records: {len(data_falcon9)}")
print(f"Columns: {len(data_falcon9.columns)}")
print(f"\nData types:\n{data_falcon9.dtypes}")

print("\n" + "=" * 70)
print("FIRST 5 ROWS OF CLEANED DATA")
print("=" * 70)
print(data_falcon9.head())

# Export to CSV
output_file = 'dataset_part_1.csv'
data_falcon9.to_csv(output_file, index=False)
print(f"\n Data saved to '{output_file}'")
print("\n" + "=" * 70)
print("DATA COLLECTION COMPLETE!")
print("=" * 70)


FINAL DATASET SUMMARY

Total records: 90
Columns: 17

Data types:
FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount       float64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

FIRST 5 ROWS OF CLEANED DATA
   FlightNumber        Date BoosterVersion  PayloadMass Orbit    LaunchSite  \
4             1  2010-06-04       Falcon 9  6015.681325   LEO  CCSFS SLC 40   
5             2  2012-05-22       Falcon 9   525.000000   LEO  CCSFS SLC 40   
6             3  2013-03-01       Falcon 9   677.000000   ISS  CCSFS SLC 40   
7             4  2013-09-29       Falcon 9   500.000000    PO   VAFB SLC 4E   
8             5  2013-12-03       Falcon 9  3170.000