# Notebook

# Step 2: Data Wrangling

In this notebook, we will perform the data wrangling phase. Our goal is to take the raw data collected in the previous step and transform it into a clean, structured dataset ready for analysis and modeling.

**Key Tasks:**
1. Load the raw data from `data/raw/`.
2. Clean the API data using our robust custom function from `src/wrangle.py`.
3. Verify the cleaned data has no missing values.
4. Save the final cleaned dataset to the `data/interim/` directory.

## 2.1: Setup and Imports

In [7]:
import sys
import os
import pandas as pd
import numpy as np
import yaml

# Get the current working directory of the notebook
notebook_dir = os.getcwd() 
# Go up one level to get to the project's root directory
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))

# Add the project root to the Python path
if project_root not in sys.path:
    sys.path.insert(0, project_root) # Use insert(0,...) for high priority

# Now, this import will work correctly
from src.wrangle import clean_api_data

## 2.2: Load Configuration and Raw Data

In [8]:
# Load config file
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Define file paths from config
RAW_DATA_PATH = os.path.join('..', config['data_paths']['raw'])
INTERIM_DATA_PATH = os.path.join('..', config['data_paths']['interim'])
API_FILE = os.path.join(RAW_DATA_PATH, config['data_files']['raw_api_data'])

# Load the raw dataset
api_df = pd.read_json(API_FILE)

print("Raw API data shape:", api_df.shape)
api_df.head()

Raw API data shape: (205, 43)


Unnamed: 0,static_fire_date_utc,static_fire_date_unix,net,window,rocket,success,failures,details,crew,ships,...,links.reddit.media,links.reddit.recovery,links.flickr.small,links.flickr.original,links.presskit,links.webcast,links.youtube_id,links.article,links.wikipedia,fairings
0,2006-03-17T00:00:00.000Z,1142554000.0,False,0.0,5e9d0d95eda69955f709d1eb,0.0,"[{'time': 33, 'altitude': None, 'reason': 'mer...",Engine failure at 33 seconds and loss of vehicle,[],[],...,,,[],[],,https://www.youtube.com/watch?v=0a_00nJ_Y88,0a_00nJ_Y88,https://www.space.com/2196-spacex-inaugural-fa...,https://en.wikipedia.org/wiki/DemoSat,
1,,,False,0.0,5e9d0d95eda69955f709d1eb,0.0,"[{'time': 301, 'altitude': 289, 'reason': 'har...",Successful first stage burn and transition to ...,[],[],...,,,[],[],,https://www.youtube.com/watch?v=Lk4zQ2wP-Nc,Lk4zQ2wP-Nc,https://www.space.com/3590-spacex-falcon-1-roc...,https://en.wikipedia.org/wiki/DemoSat,
2,,,False,0.0,5e9d0d95eda69955f709d1eb,0.0,"[{'time': 140, 'altitude': 35, 'reason': 'resi...",Residual stage 1 thrust led to collision betwe...,[],[],...,,,[],[],,https://www.youtube.com/watch?v=v0w9p3U8860,v0w9p3U8860,http://www.spacex.com/news/2013/02/11/falcon-1...,https://en.wikipedia.org/wiki/Trailblazer_(sat...,
3,2008-09-20T00:00:00.000Z,1221869000.0,False,0.0,5e9d0d95eda69955f709d1eb,1.0,[],Ratsat was carried to orbit on the first succe...,[],[],...,,,[],[],,https://www.youtube.com/watch?v=dLQ2tZEH6G0,dLQ2tZEH6G0,https://en.wikipedia.org/wiki/Ratsat,https://en.wikipedia.org/wiki/Ratsat,
4,,,False,0.0,5e9d0d95eda69955f709d1eb,1.0,[],,[],[],...,,,[],[],http://www.spacex.com/press/2012/12/19/spacexs...,https://www.youtube.com/watch?v=yTaIDooc8Og,yTaIDooc8Og,http://www.spacex.com/news/2013/02/12/falcon-1...,https://en.wikipedia.org/wiki/RazakSAT,


In [9]:
# --- GROUND TRUTH DIAGNOSTIC ---
# Print all available column names from the raw dataframe
print("--- ALL AVAILABLE COLUMNS IN THE RAW DATA ---")
print(api_df.columns.tolist())

--- ALL AVAILABLE COLUMNS IN THE RAW DATA ---
['static_fire_date_utc', 'static_fire_date_unix', 'net', 'window', 'rocket', 'success', 'failures', 'details', 'crew', 'ships', 'capsules', 'payloads', 'launchpad', 'flight_number', 'name', 'date_utc', 'date_unix', 'date_local', 'date_precision', 'upcoming', 'cores', 'auto_update', 'tbd', 'launch_library_id', 'id', 'fairings.reused', 'fairings.recovery_attempt', 'fairings.recovered', 'fairings.ships', 'links.patch.small', 'links.patch.large', 'links.reddit.campaign', 'links.reddit.launch', 'links.reddit.media', 'links.reddit.recovery', 'links.flickr.small', 'links.flickr.original', 'links.presskit', 'links.webcast', 'links.youtube_id', 'links.article', 'links.wikipedia', 'fairings']


## 2.3: Clean the SpaceX API Data

Now we will call our robust cleaning function from `src/wrangle.py`. This single function will handle all the complex data extraction and cleaning logic.

In [10]:
# Use our custom function to clean the API data
cleaned_spacex_df = clean_api_data(api_df)

print("Cleaned SpaceX data shape:", cleaned_spacex_df.shape)
cleaned_spacex_df.head()

Fetching payload details from API...
Successfully created payload lookup map.
Cleaned SpaceX data shape: (205, 12)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['PayloadMass'].fillna(value=mean_payload, inplace=True)
  data.fillna({


Unnamed: 0,flight_number,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,class
0,1,2006-03-24 22:30:00+00:00,FalconSat,20.0,LEO,5e9e4502f5090995de566f86,False,1.0,False,False,False,0
1,2,2007-03-21 01:10:00+00:00,DemoSat,7842.388855,LEO,5e9e4502f5090995de566f86,False,1.0,False,False,False,0
2,3,2008-08-03 03:34:00+00:00,Trailblazer,7842.388855,LEO,5e9e4502f5090995de566f86,False,1.0,False,False,False,0
3,4,2008-09-28 23:15:00+00:00,RatSat,165.0,LEO,5e9e4502f5090995de566f86,False,1.0,False,False,False,0
4,5,2009-07-13 03:35:00+00:00,RazakSat,200.0,LEO,5e9e4502f5090995de566f86,False,1.0,False,False,False,0


In [11]:
# Check for null values in the cleaned data. The goal is to see all zeros.
cleaned_spacex_df.isnull().sum()

flight_number     0
Date              0
BoosterVersion    0
PayloadMass       0
Orbit             0
LaunchSite        0
Outcome           0
Flights           0
GridFins          0
Reused            0
Legs              0
class             0
dtype: int64

## 2.4: Save the Cleaned Data

The dataset is now clean and ready for analysis. We will save it to the `data/interim` folder. This file will be the input for our next notebook (EDA).

In [12]:
# Define the output path
WRANGLED_DATA_FILE = os.path.join(INTERIM_DATA_PATH, config['data_files']['wrangled_data'])

# Save the dataframe to a CSV file
cleaned_spacex_df.to_csv(WRANGLED_DATA_FILE, index=False)

print(f"Cleaned data saved to {WRANGLED_DATA_FILE}")

Cleaned data saved to ..\data/interim\cleaned_launches.csv
