# Data Pipeline Refresh Playbook

**Purpose:** Re-run source ingestion and keep derived datasets/docs in sync when upstream CSVs change.

**Checklist:**
1) Pull latest raw sources into `data/original` (SATCAT, UCS, countries.geojson).
2) Quick diff: row counts + schema drift; snapshot versions/date ranges.
3) Re-run cleaning: `01_ucs_cleanup` ‚Üí `02_satcat_cleanup` ‚Üí `03_orbital_risk_synthesis`.
4) Regenerate outputs: `ucs_cleaned.csv`, `satcat_cleaned.csv`, `kinetic_master.csv`.
5) Refresh visuals: rerun plotting cells to update `/images` exports.
6) Update docs: README stats (objects, mass, KE, zombies, velocity) + figures captions if changed.
7) Log run metadata (source dates, hashes) in this notebook for traceability.

**Next step:** wire a small automation cell here to run the above sequence end-to-end.

In [1]:
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

DATA_DIR = "../data/original"

# create the data/original folder if it doesnt already exist
os.makedirs(DATA_DIR, exist_ok=True)

# print a message to actually show the path so it can be verified
print(f"üìÇ Saving data to: {os.path.abspath(DATA_DIR)}")

üìÇ Saving data to: d:\repos\orbital-debris-assessment\data\original


### Fetch CelesTrak

**CelesTrak** SATCAT.csv

In [2]:
def fetch_celestrak():
    """
    Updates the local copy of satcat.csv
    """
    print("--- Fetching CelesTrak (SATCAT) ---")
    url = "https://celestrak.org/pub/satcat.csv"

    # join the file paths
    save_path = os.path.join(DATA_DIR, "satcat.csv")

    try:
        # use requests to download the file, use stream=True for large files
        response = requests.get(url, stream=True)
        
        # triggers an error if the link is broken
        response.raise_for_status()
        
        # get the date the last time the file was updated
        last_modified = response.headers.get("Last-Modified")
        if last_modified:
            print(f"üìÖ Server Last Update: {last_modified}")

        # no error has been thrown were good to save it.
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)

        # output save directory.
        print(f"‚úÖ Success! SATCAT saved to: {save_path}")
    except Exception as e:
        # output the error message.
        print(f"‚ùå Error downloading CelesTrak: {e}")

### Fetch CelesTrak

**UCS** UCS-Satellite-Database 5-1-2023.csv

In [3]:
def fetch_ucs():
    print("\n--- Fetching UCS Satellite Database ---")
    landing_page = "https://www.ucsusa.org/resources/satellite-database"

    # we have to define these headers so the download can pretend to be a real browser/person
    # then we use soup to read every line of the html
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Connection": "keep-alive",
        "Referer": "https://www.google.com/" 
    }

    try:
        print("   üëÄ Scouting the landing page...")
        response = requests.get(landing_page, headers=headers)
        
        if response.status_code != 200:
            print(f"   ‚ùå Blocked! Status Code: {response.status_code}")
            return

        soup = BeautifulSoup(response.text, 'html.parser')
        
        target_link = None
        
        for link in soup.find_all('a', href=True):
            text = link.text.strip().lower()
            if text == "database":
                target_link = link['href']
                print(f"   üéØ Found the link: {target_link}")
                break
        
        if not target_link:
            print("   ‚ùå Still could not find the link.")
            return

        full_url = urljoin(landing_page, target_link)

        print("   ‚¨áÔ∏è  Downloading the Excel file...")
        
        file_response = requests.get(full_url, headers=headers, stream=True)
        
        last_modified = file_response.headers.get("Last-Modified")
        if last_modified:
            print(f"   üìÖ Server Last Update: {last_modified}")
            
        filename_from_url = full_url.split("/")[-1]
        print(f"   üè∑Ô∏è Remote Filename: {filename_from_url}")
        # ------------------------------------
        
        excel_path = os.path.join(DATA_DIR, "UCS_raw.xlsx")
        
        with open(excel_path, 'wb') as f:
            for chunk in file_response.iter_content(chunk_size=1024):
                f.write(chunk)

        print("   üîÑ Converting to CSV...")
        df = pd.read_excel(excel_path)        
        csv_path = os.path.join(DATA_DIR, "UCS-Satellite-Database.csv")
        df.to_csv(csv_path, index=False)
        
        print(f"   ‚úÖ Success! Saved to: {csv_path}")

    except Exception as e:
        print(f"   ‚ùå Error with UCS data: {e}")

In [4]:
# Fetch baseline countries GeoJSON for geopandas maps
# thanksfully no scraping is required
def fetch_countries_geojson():
    print("\n--- Fetching countries.geojson ---")

    url = "https://raw.githubusercontent.com/johan/world.geo.json/master/countries.geo.json"
    save_path = os.path.join(DATA_DIR, "countries.geojson")

    try:
        response = requests.get(url, stream=True, timeout=30)
        response.raise_for_status()

        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)

        last_modified = response.headers.get("Last-Modified")

        if last_modified:
            print(f"üìÖ Server Last Update: {last_modified}")

        print(f"‚úÖ Success! GeoJSON saved to: {save_path}")
    except Exception as e:
        print(f"‚ùå Error downloading GeoJSON: {e}")

### Execute Fetch

In [5]:
fetch_ucs()
print()
fetch_countries_geojson()
print()
fetch_celestrak()


--- Fetching UCS Satellite Database ---
   üëÄ Scouting the landing page...
   üéØ Found the link: /media/11492
   ‚¨áÔ∏è  Downloading the Excel file...
   üìÖ Server Last Update: Tue, 02 Jan 2024 14:39:30 GMT
   üè∑Ô∏è Remote Filename: 11492
   üîÑ Converting to CSV...
   ‚úÖ Success! Saved to: ../data/original\UCS-Satellite-Database.csv


--- Fetching countries.geojson ---
‚úÖ Success! GeoJSON saved to: ../data/original\countries.geojson

--- Fetching CelesTrak (SATCAT) ---
üìÖ Server Last Update: Sun, 25 Jan 2026 20:32:22 GMT
‚úÖ Success! SATCAT saved to: ../data/original\satcat.csv
