# Data Pipeline Refresh Playbook

**Purpose:** Re-run source ingestion and keep derived datasets/docs in sync when upstream CSVs change.

**Checklist:**
1) Pull latest raw sources into `data/original` (SATCAT, UCS, countries.geojson).
2) Quick diff: row counts + schema drift; snapshot versions/date ranges.
3) Re-run cleaning: `01_ucs_cleanup` ‚Üí `02_satcat_cleanup` ‚Üí `03_orbital_risk_synthesis`.
4) Regenerate outputs: `ucs_cleaned.csv`, `satcat_cleaned.csv`, `kinetic_master.csv`.
5) Refresh visuals: rerun plotting cells to update `/images` exports.
6) Update docs: README stats (objects, mass, KE, zombies, velocity) + figures captions if changed.
7) Log run metadata (source dates, hashes) in this notebook for traceability.

**Next step:** wire a small automation cell here to run the above sequence end-to-end.

In [1]:
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

DATA_DIR = "../data/original"

# create the data/original folder if it doesnt already exist
os.makedirs(DATA_DIR, exist_ok=True)

# print a message to actually show the path so it can be verified
print(f"üìÇ Saving data to: {os.path.abspath(DATA_DIR)}")

üìÇ Saving data to: d:\repos\orbital-debris-assessment\data\original


### Fetch CelesTrak

**CelesTrak** SATCAT.csv

In [2]:
def fetch_celestrak():
    """
    Updates the local copy of satcat.csv
    """
    print("--- Fetching CelesTrak (SATCAT) ---")
    url = "https://celestrak.org/pub/satcat.csv"

    # join the file paths
    save_path = os.path.join(DATA_DIR, "satcat.csv")

    try:
        # use requests to download the file, use stream=True for large files
        response = requests.get(url, stream=True)
        
        # triggers an error if the link is broken
        response.raise_for_status()
        
        # get the date the last time the file was updated
        last_modified = response.headers.get("Last-Modified")
        if last_modified:
            print(f"üìÖ Server Last Update: {last_modified}")

        # no error has been thrown were good to save it.
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)

        # output save directory.
        print(f"‚úÖ Success! SATCAT saved to: {save_path}")
    except Exception as e:
        # output the error message.
        print(f"‚ùå Error downloading CelesTrak: {e}")

### Fetch CelesTrak

**UCS** UCS-Satellite-Database 5-1-2023.csv

In [3]:
# Scraps the UCS satellite-database page to find the most update-to-date dataset download link.
# Dataset filename changes with every update so its best to save the original copy with a standardized for later loading.
def fetch_ucs():
    print("\n--- Fetching UCS Satellite Database ---")
    landing_page = "https://www.ucsusa.org/resources/satellite-database"

    # trying to scrap all of the a links wouldnt work without identifying the headers.
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    try:
        print("   üëÄ Scouting the landing page...")
        response = requests.get(landing_page, headers=headers)
        
        # did it work or fail? meh
        if response.status_code != 200:
            print(f"   ‚ùå Blocked! Status Code: {response.status_code}")
            return

        soup = BeautifulSoup(response.text, 'html.parser')
        
        target_link = None
        
        for link in soup.find_all('a', href=True):
            # strip spaces and convert to lower
            text = link.text.strip().lower()
            
            # we want the link that says "database" but NOT "official names only"
            if text == "database":
                target_link = link['href']
                print(f"   üéØ Found the link: {target_link}")
                break
        
        if not target_link:
            print("   ‚ùå Still could not find the link.")
            return

        full_url = urljoin(landing_page, target_link)

        print("   ‚¨áÔ∏è  Downloading the Excel file (this might take a moment)...")
        
        # UCS link is often a redirect to a file server, requests handles this automatically
        # i like how io/file downloads and api query response stuff is significantly easier with python
        # ucs provides excel file format for download but no csv so we will have to convert it when were done
        excel_path = os.path.join(DATA_DIR, "UCS_raw.xlsx")
        
        file_response = requests.get(full_url, headers=headers, stream=True)
        
        with open(excel_path, 'wb') as f:
            for chunk in file_response.iter_content(chunk_size=1024):
                f.write(chunk)

        # now that weve saved the excel, we can go ahead and convert to csv and export with a standardized name
        print("   üîÑ Converting to CSV...")

        # pandas makes this so trivial, you have to love it ... literally.
        df = pd.read_excel(excel_path)        
        csv_path = os.path.join(DATA_DIR, "UCS-Satellite-Database.csv")
        df.to_csv(csv_path, index=False)
        
        # i like howe we can use images directly in the editor and it will output the image properly in the cell display
        print(f"   ‚úÖ Success! Saved to: {csv_path}")

    except Exception as e:
        print(f"   ‚ùå Error with UCS data: {e}")

### Execute Fetch

In [5]:
fetch_ucs()
print()
fetch_celestrak()


--- Fetching UCS Satellite Database ---
   üëÄ Scouting the landing page...
   üéØ Found the link: /media/11492
   ‚¨áÔ∏è  Downloading the Excel file (this might take a moment)...
   üîÑ Converting to CSV...
   ‚úÖ Success! Saved to: ../data/original\UCS-Satellite-Database.csv

--- Fetching CelesTrak (SATCAT) ---
üìÖ Server Last Update: Sat, 24 Jan 2026 18:41:57 GMT
‚úÖ Success! SATCAT saved to: ../data/original\satcat.csv
