# UCS Data Pipeline: Standardization & Normalization

**Dataset:** Union of Concerned Scientists (UCS) Satellite Database  
**Objective:** Prepare active satellite registry data for merger with SATCAT.

### **The Engineering Challenge**
The UCS database is human-maintained, leading to significant inconsistencies in categorical fields. To make this data machine-readable for our "Kessler Syndrome" analysis, we must implement a strict cleaning pipeline:
1.  **Ingestion & Sanitization:** Load raw data and neutralize whitespace/character artifacts.
2.  **Normalization:** Standardize "Country of Operator" and "Users" to ensure categorical consistency.
3.  **Physical Validation:** Enforce orbital mechanics constraints (e.g., Apogee vs. Perigee).
4.  **Mass Imputation:** Address missing values using the "ISS Exception" and grouped median fills.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import Markdown, display

### **Stage 1: Ingestion & String Sanitization**
**The Problem:** Raw human-maintained data often contains hidden whitespace and non-numeric characters (like commas) in mathematical fields, which causes errors during calculation.

**The Solution:** * Use a **lambda-based stripping** operation to clean all text-based columns.
* Remove string delimiters (commas) from `Perigee`, `Apogee`, and `Launch Mass` to enable float conversion.

In [2]:
ucs_sats_messy = pd.read_csv('../data/original/UCS-Satellite-Database 5-1-2023.csv')
text_cols = ucs_sats_messy.select_dtypes(['object']).columns
ucs_sats_messy[text_cols] = ucs_sats_messy[text_cols].apply(lambda x: x.str.strip())

### **Stage 1.1: Strategic Feature Selection**
**The Problem:** The raw UCS export contains various unpopulated placeholders (e.g., `Unnamed` columns) and internal metrics such as `Dry Mass` or `Power` that lack sufficient data density for this analysis.

**The Solution:** Implement a targeted drop of auxiliary columns to optimize the dataframe's memory footprint and focus strictly on variables essential for orbital tracking and kinetic modeling.

In [3]:
ucs_sats_messy.drop( columns=['Unnamed: 28', 'Unnamed: 37',
       'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41',
       'Unnamed: 42', 'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45',
       'Unnamed: 46', 'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49',
       'Unnamed: 50', 'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53',
       'Unnamed: 54', 'Unnamed: 55', 'Unnamed: 56', 'Unnamed: 57',
       'Unnamed: 58', 'Unnamed: 59', 'Unnamed: 60', 'Unnamed: 61',
       'Unnamed: 62', 'Unnamed: 63', 'Unnamed: 64', 'Unnamed: 65',
       'Unnamed: 66', 'Unnamed: 67'], inplace=True)

### **Stage 2: Enforcing Orbital Mechanics**
**The Problem:** Observational or data-entry errors can result in physically impossible trajectories, such as a satellite having an apogee lower than its perigee.

**The Solution:** * Implement a logical filter to ensure **Apogee (km) >= Perigee (km)**.
* Coerce errors to `NaN` and drop invalid entries to maintain a "Physics-Ready" dataset.

In [4]:
ucs_sats_messy['Perigee (km)'] = ucs_sats_messy['Perigee (km)'].astype(str).str.replace(',', '', regex=False)
ucs_sats_messy['Apogee (km)'] = ucs_sats_messy['Apogee (km)'].astype(str).str.replace(',', '', regex=False)

ucs_sats_messy['Perigee (km)'] = pd.to_numeric(ucs_sats_messy['Perigee (km)'], errors='coerce')
ucs_sats_messy['Apogee (km)'] = pd.to_numeric(ucs_sats_messy['Apogee (km)'], errors='coerce')

ucs_sats_messy.dropna(subset=['Perigee (km)', 'Apogee (km)'], inplace=True)

ucs_sats_messy = ucs_sats_messy[ucs_sats_messy['Apogee (km)'] >= ucs_sats_messy['Perigee (km)']]

### **Stage 2.1: Categorical Normalization (User Hierarchy)**
**The Logic:** In the UCS schema, the order of user types (e.g., *Government/Commercial* vs. *Commercial/Government*) indicates the primary stakeholder. 

**Standardization:**
* We preserve this specific order to maintain the distinction between primary, secondary, and tertiary users.
* This ensures that ownership-based risk assessments reflect the actual operational hierarchy of the satellite.

### **Stage 3: Metadata Decoupling (Source Preservation)**
**The Problem:** Carrying extensive "Comments" and "Source" columns creates "Wide Data" that is inefficient for large-scale physics modeling.

**The Solution:**
* **Source Archive:** Extract and save metadata into a secondary file (`ucs_dropped.csv`).
* **Relational Key:** Retain the `norad_id` as a primary key to allow for future re-integration of this context if needed.

In [5]:
# Stage 3: Consolidate all metadata and hardware specs into a single archive
# This ensures we don't lose the "Dry Mass" and "Power" data for future stories
sources = ucs_sats_messy[[
    'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 
    'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments',
    'Detailed Purpose', ' Dry Mass (kg.) ', 'Power (watts)'
]].copy()

# Add primary keys for relational mapping
sources.insert(0, 'norad_id', ucs_sats_messy['NORAD Number'])

# Export the archive
sources = sources.sort_values(by='norad_id')
sources.to_csv('../data/clean/ucs_dropped.csv', index=False)

In [6]:
# Stage 3.1: Final Purge of all archived and non-essential columns
# We use the exact same list from the archive step to maintain consistency
archived_columns = [
    'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 
    'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments',
    'Detailed Purpose', ' Dry Mass (kg.) ', 'Power (watts)'
]

ucs_sats_messy.drop(columns=archived_columns, inplace=True)

### **Verification: Intermediate Pipeline Audit**
**Objective:** Confirm the successful execution of both the initial feature selection and the final metadata purge. This audit ensures that the dataframe is "Lean" and that all critical physical fields are correctly typed before beginning the final mass imputation and feature renaming.

In [7]:
# --- Intermediate Validation Report ---

# 1. Define the full list of expected drops from Stages 1.1 and 3.1
unnamed_placeholders = [f'Unnamed: {i}' for i in range(37, 68)] + ['Unnamed: 28']
metadata_and_hardware = [
    'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 
    'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments',
    'Detailed Purpose', ' Dry Mass (kg.) ', 'Power (watts)'
]

expected_dropped = unnamed_placeholders + metadata_and_hardware

# 2. Verify remaining columns
remaining_invalid = [col for col in expected_dropped if col in ucs_sats_messy.columns]

# 3. Physics Check
physics_ready = pd.api.types.is_numeric_dtype(ucs_sats_messy['Perigee (km)'])

print(f"--- Intermediate Validation Report ---")
print(f"Full Feature Purge:      {'‚úÖ SUCCESS' if not remaining_invalid else f'‚ùå FAILED: {remaining_invalid}'}")
print(f"Physics Fields Numeric:  {'‚úÖ SUCCESS' if physics_ready else '‚ùå FAILED'}")
print(f"Current Registry Count:  {len(ucs_sats_messy):,} satellites")

--- Intermediate Validation Report ---
Full Feature Purge:      ‚úÖ SUCCESS
Physics Fields Numeric:  ‚úÖ SUCCESS
Current Registry Count:  7,552 satellites


### **Stage 3.1: Categorical and Temporal Standardization**
**The Problem:** Mixed-case strings in orbital classifications and string-formatted dates prevent accurate grouping and time-series analysis.

**The Solution:** * **Case Normalization:** Force `Class of Orbit` to uppercase to ensure "LEO" and "leo" are treated as a single category.
* **Temporal Conversion:** Parse `Date of Launch` into standard datetime objects to support historical trend modeling.

In [8]:
ucs_sats_messy['Class of Orbit'] = ucs_sats_messy['Class of Orbit'].str.upper()
ucs_sats_messy['Date of Launch'] = pd.to_datetime(ucs_sats_messy['Date of Launch'], errors='coerce')
ucs_sats_messy = ucs_sats_messy.dropna(subset=['Date of Launch'])

### **Stage 3.3: Numeric Sanitization (Launch Mass)**
**The Problem:** The `Launch Mass` field contains human-entered string artifacts (commas) that prevent direct conversion to numeric types.

**The Solution:**
* Implement string-replacement logic to neutralize delimiters.
* Coerce the field to a float type, preparing it for the **Stage 4 Imputation** strategy.

In [9]:
# Neutralize string delimiters and convert to numeric
ucs_sats_messy['Launch Mass (kg.)'] = ucs_sats_messy['Launch Mass (kg.)'].astype(str).str.replace(',', '', regex=False)
ucs_sats_messy['Launch Mass (kg.)'] = pd.to_numeric(ucs_sats_messy['Launch Mass (kg.)'], errors='coerce')

### **Stage 4: Addressing the Mass Transparency Gap (Active Fleet)**
**The Problem:** Even active satellite registries contain missing mass values. To calculate the total "Kinetic Fuel" in orbit, every object needs a plausible mass estimate.

**The Solution:**
* **Manual Outlier Enforcement:** Manually set the International Space Station (NORAD 25544) to **450,000 kg** to prevent standard medians from skewing this massive outlier.
* **Grouped Median Imputation:** Fill remaining gaps using the median mass of satellites with the same **Class of Orbit** and **Purpose**.

In [10]:
ucs_sats_messy.loc[ucs_sats_messy['NORAD Number'] == 25544, 'Launch Mass (kg.)'] = 450000
medians = ucs_sats_messy.groupby(['Class of Orbit', 'Purpose'])['Launch Mass (kg.)'].transform('median')

ucs_sats_messy['Launch Mass (kg.)'] = ucs_sats_messy['Launch Mass (kg.)'].fillna(medians)

In [11]:
print(ucs_sats_messy['Launch Mass (kg.)'].isnull().value_counts())

Launch Mass (kg.)
False    7550
Name: count, dtype: int64


In [12]:
missing_count = ucs_sats_messy['Launch Mass (kg.)'].isnull().sum()
print(f"Remaining missing masses: {missing_count}")

Remaining missing masses: 0


### **Stage 5: Schema Alignment & Programmatic Efficiency**
**The Problem:** Raw UCS headers are non-standardized (containing spaces, periods, and parentheses) and do not align with the naming conventions established in the SATCAT pipeline. This inconsistency increases the risk of "silent errors" during multi-dataset joins and complicates statistical modeling.

**The Solution:** Implement a global **Renaming Schema** to transition the dataset into a strict **snake_case** format. This serves two strategic purposes:
1. **Programmatic Accessibility:** Enables dot-notation access in pandas and ensures compatibility with visualization libraries (Seaborn/Matplotlib).
2. **Relational Synchronization:** Aligns the primary key (`norad_id`) and secondary attributes with the naming conventions used in `satcat_cleanup.ipynb` for a seamless merge.

In [13]:
column_mapping = {
    'Name of Satellite, Alternate Names': 'satellite_name',
    'Current Official Name of Satellite': 'official_name',
    'Country/Org of UN Registry': 'un_registry',
    'Country of Operator/Owner': 'country_operator',
    'Operator/Owner': 'owner',
    'Users': 'users',
    'Purpose': 'purpose',
    'Class of Orbit': 'orbit_class',
    'Type of Orbit': 'orbit_type',
    'Longitude of GEO (degrees)': 'geo_longitude',
    'Perigee (km)': 'perigee_km',
    'Apogee (km)': 'apogee_km',
    'Eccentricity': 'eccentricity',
    'Inclination (degrees)': 'inclination_degrees',
    'Period (minutes)': 'period_minutes',
    'Launch Mass (kg.)': 'launch_mass_kg',
    'Date of Launch': 'launch_date',
    'Expected Lifetime (yrs.)': 'lifetime_years',
    'Contractor': 'contractor',
    'Country of Contractor': 'contractor_country',
    'Launch Site': 'launch_site',
    'Launch Vehicle': 'launch_vehicle',
    'COSPAR Number': 'cospar_id',
    'NORAD Number': 'norad_id'
}

ucs_sats_messy.rename(columns=column_mapping, inplace=True)

messy_headers = [col for col in ucs_sats_messy.columns if ' ' in col or '(' in col]

print(f"--- Schema Finalization Report ---")
print(f"Total Columns Standardized: {len(ucs_sats_messy.columns)}")
print(f"Messy Headers Remaining:    {'None (Full Clean)' if not messy_headers else messy_headers}")

--- Schema Finalization Report ---
Total Columns Standardized: 24
Messy Headers Remaining:    None (Full Clean)


In [14]:
output_path = '../data/clean/ucs_cleaned.csv'
ucs_sats_messy.to_csv(output_path, index=False)

### **Status: Pipeline Execution & Final Audit**
**Objective:** Confirm the normalization of the active population. 

Having addressed the "Mass Transparency Gap" for the active fleet through the **ISS Exception** and **Grouped Median Imputation**, this dataset is now the high-fidelity mass reference for our global collision models.

In [15]:
total_rows = len(ucs_sats_messy)
comm_count = ucs_sats_messy[ucs_sats_messy['users'].str.contains('Commercial', na=False)].shape[0]
usa_count = ucs_sats_messy[ucs_sats_messy['country_operator'] == 'USA'].shape[0]

report = f"""
### **UCS Pipeline Completion Report**
| Metric | Result | Context |
| :--- | :--- | :--- |
| **Total Active Registry** | {total_rows:,} | Final validated objects |
| **Commercial Dominance** | {comm_count/total_rows:.1%} | {comm_count:,} satellites |
| **US Operations** | {usa_count/total_rows:.1%} | {usa_count:,} satellites |
| **Temporal Range** | {ucs_sats_messy['launch_date'].min().year} - {ucs_sats_messy['launch_date'].max().year} | ~50 years of data |

üíæ File Saved: {output_path}
"""
display(Markdown(report))


### **UCS Pipeline Completion Report**
| Metric | Result | Context |
| :--- | :--- | :--- |
| **Total Active Registry** | 7,550 | Final validated objects |
| **Commercial Dominance** | 82.9% | 6,260 satellites |
| **US Operations** | 68.4% | 5,163 satellites |
| **Temporal Range** | 1974 - 2023 | ~50 years of data |

üíæ File Saved: ../data/clean/ucs_cleaned.csv
