# UCS Data Pipeline: Standardization & Normalization

**Dataset:** Union of Concerned Scientists (UCS) Satellite Database  
**Objective:** Prepare active satellite registry data for merger with SATCAT.

### **The Engineering Challenge**
The UCS database is human-maintained, leading to significant inconsistencies in categorical fields. To make this data machine-readable for our "Kessler Syndrome" analysis, we must implement a strict cleaning pipeline:
1.  **Ingestion & Sanitization:** Load raw data and neutralize whitespace/character artifacts.
2.  **Normalization:** Standardize "Country of Operator" and "Users" to ensure categorical consistency.
3.  **Physical Validation:** Enforce orbital mechanics constraints (e.g., Apogee vs. Perigee).
4.  **Mass Imputation:** Address missing values using the "ISS Exception" and grouped median fills.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import Markdown, display

### **Stage 1: Ingestion & String Sanitization**
**The Problem:** Raw human-maintained data often contains hidden whitespace and character artifacts (e.g., " USA " vs "USA"), which causes silent failures during categorical grouping.

**The Solution:** Use a **lambda-based stripping** operation to strictly trim whitespace from all text-based columns and headers, ensuring a clean baseline for the pipeline.

In [2]:
ucs_sats_messy = pd.read_csv('../data/original/UCS-Satellite-Database 5-1-2023.csv')
text_cols = ucs_sats_messy.select_dtypes(['object']).columns

ucs_sats_messy[text_cols] = ucs_sats_messy[text_cols].apply(lambda x: x.str.strip())
ucs_sats_messy.columns = ucs_sats_messy.columns.str.strip()

### **Stage 1.1: Strategic Feature Selection**
**The Problem:** The raw UCS export contains numerous unpopulated placeholders (e.g., `Unnamed` columns) created by formatting artifacts in the original Excel file. These "Ghost Columns" inflate memory usage without adding information.

**The Solution:** Implement a **dynamic filter** to identify and drop all columns matching the `Unnamed` pattern, effectively sanitizing the dataframe structure.

In [3]:
unnamed_columns_dropped = [col for col in ucs_sats_messy.columns if 'Unnamed' in col]

if unnamed_columns_dropped:
    ucs_sats_messy.drop(columns=unnamed_columns_dropped, inplace=True)
    print(f"Dropped {len(unnamed_columns_dropped)} artifact columns (e.g., {unnamed_columns_dropped[0]}).")
else:
    print("No artifact columns found.")

Dropped 32 artifact columns (e.g., Unnamed: 28).


### **Stage 2: Enforcing Orbital Mechanics**
**The Problem:** Observational errors can result in physically impossible trajectories (Apogee < Perigee), and raw data often contains formatting artifacts (commas) that prevent numeric analysis.

**The Solution:**
* **Sanitize Numerics:** Remove string delimiters (commas) from `Perigee` and `Apogee`.
* **Physical Validation:** Implement a logical filter to ensure **Apogee (km) >= Perigee (km)**.

In [4]:
# Sanitize Numerics (Remove commas from strings)
ucs_sats_messy['Perigee (km)'] = ucs_sats_messy['Perigee (km)'].astype(str).str.replace(',', '', regex=False)
ucs_sats_messy['Apogee (km)'] = ucs_sats_messy['Apogee (km)'].astype(str).str.replace(',', '', regex=False)

# Convert to Float (Coerce errors to NaN)
ucs_sats_messy['Perigee (km)'] = pd.to_numeric(ucs_sats_messy['Perigee (km)'], errors='coerce')
ucs_sats_messy['Apogee (km)'] = pd.to_numeric(ucs_sats_messy['Apogee (km)'], errors='coerce')

# Drop missing values (We can't check physics if numbers are missing)
ucs_sats_messy.dropna(subset=['Perigee (km)', 'Apogee (km)'], inplace=True)

print("--- PRE-PATCH DIAGNOSTIC ---")
impossible_orbits = ucs_sats_messy[ucs_sats_messy['Apogee (km)'] < ucs_sats_messy['Perigee (km)']]
print(f"Satellites Violating Physics: {len(impossible_orbits)}")

if not impossible_orbits.empty:
    print("Violations Found:")
    print(impossible_orbits[['Name of Satellite, Alternate Names', 'Apogee (km)', 'Perigee (km)']].head(5))

# Fix known typo for Yaogan 35-5-1 (49.0 -> 499.0)
print("\n... Applying Manual Patch for Yaogan 35-5-1 ...\n")
typo_mask = ucs_sats_messy['Name of Satellite, Alternate Names'] == 'Yaogan 35-5-1'
ucs_sats_messy.loc[typo_mask, 'Apogee (km)'] = 499.0

print("--- POST-PATCH DIAGNOSTIC ---")
impossible_orbits_after = ucs_sats_messy[ucs_sats_messy['Apogee (km)'] < ucs_sats_messy['Perigee (km)']]
print(f"Satellites Violating Physics: {len(impossible_orbits_after)}")

if impossible_orbits_after.empty:
    print("‚úÖ SUCCESS: All physics violations resolved.")

# Only keeps valid rows. Since we fixed the error, we lose 0 satellites here.
ucs_sats_messy = ucs_sats_messy[ucs_sats_messy['Apogee (km)'] >= ucs_sats_messy['Perigee (km)']]

print(f"\nTotal Satellites Retained: {len(ucs_sats_messy)}")

--- PRE-PATCH DIAGNOSTIC ---
Satellites Violating Physics: 1
Violations Found:
     Name of Satellite, Alternate Names  Apogee (km)  Perigee (km)
7473                      Yaogan 35-5-1         49.0         493.0

... Applying Manual Patch for Yaogan 35-5-1 ...

--- POST-PATCH DIAGNOSTIC ---
Satellites Violating Physics: 0
‚úÖ SUCCESS: All physics violations resolved.

Total Satellites Retained: 7553


### **Stage 3: Metadata Decoupling (Source Preservation)**
**The Problem:** Carrying extensive "Comments" and "Source" columns creates "Wide Data" that is inefficient for large-scale physics modeling.

**The Solution:**
* **Source Archive:** Extract and save metadata into a secondary file (`ucs_dropped.csv`).
* **Relational Key:** Retain the `norad_id` as a primary key to allow for future re-integration of this context if needed.

In [5]:
# We define this list so we never accidentally drop something we didn't save
archived_columns = [
    'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 
    'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments'
]

# We grab exactly the columns defined above
sources = ucs_sats_messy[archived_columns].copy()

# We insert the 'norad_id' now so it matches the future clean dataset
sources.insert(0, 'norad_id', ucs_sats_messy['NORAD Number'])

# 4. Save metadata into a secondary file, ucs_dropped.csv
sources = sources.sort_values(by='norad_id')
sources.to_csv('../data/clean/ucs_dropped.csv', index=False)
print(f"Archived {sources.shape[1]} columns to 'ucs_dropped.csv'")

# Use the archived_columns varible we defined earlier to drop only what weve already exported.
ucs_sats_messy.drop(columns=archived_columns, inplace=True)

print(f"Dropped {len(archived_columns)} columns from active memory.")

Archived 10 columns to 'ucs_dropped.csv'
Dropped 9 columns from active memory.


### **Verification: Intermediate Pipeline Audit**
**Objective:** Confirm the successful execution of both the initial feature selection and the final metadata purge. This audit ensures that the dataframe is "Lean" and that all critical physical fields are correctly typed before beginning the final mass imputation and feature renaming.

In [6]:
# Use the archived_columns from the previous step
# for "Unnamed" we just dynamically check if any exist
remaining_unnamed = [col for col in ucs_sats_messy.columns if 'Unnamed' in col]

metadata_dropped_check = [
    'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 
    'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments'
]

remaining_metadata = [col for col in metadata_dropped_check if col in ucs_sats_messy.columns]

# Physics Check
physics_ready = pd.api.types.is_numeric_dtype(ucs_sats_messy['Perigee (km)'])

print(f"--- Intermediate Validation Report ---")
print(f"Ghost Columns Remaining: {'None (Clean)' if not remaining_unnamed else f'‚ùå Found: {remaining_unnamed}'}")
print(f"Metadata Purge:          {'‚úÖ SUCCESS' if not remaining_metadata else f'‚ùå FAILED: {remaining_metadata}'}")
print(f"Physics Fields Numeric:  {'‚úÖ SUCCESS' if physics_ready else '‚ùå FAILED'}")
print(f"Current Registry Count:  {len(ucs_sats_messy):,} satellites")

--- Intermediate Validation Report ---
Ghost Columns Remaining: None (Clean)
Metadata Purge:          ‚úÖ SUCCESS
Physics Fields Numeric:  ‚úÖ SUCCESS
Current Registry Count:  7,553 satellites


### **Stage 3.1: Categorical and Temporal Standardization**
**The Problem:** Mixed-case strings in orbital classifications and string-formatted dates prevent accurate grouping and time-series analysis.

**The Solution:**
* **Case Normalization:** Force `Class of Orbit` to uppercase to ensure "LEO" and "leo" are treated as a single category.
* **Temporal Conversion:** Parse `Date of Launch` into standard datetime objects to support historical trend modeling.

In [7]:
ucs_sats_messy['Class of Orbit'] = ucs_sats_messy['Class of Orbit'].str.upper()
ucs_sats_messy['Date of Launch'] = pd.to_datetime(ucs_sats_messy['Date of Launch'], errors='coerce')
ucs_sats_messy = ucs_sats_messy.dropna(subset=['Date of Launch'])

### **Stage 3.2: Numeric Sanitization (Launch Mass)**
**The Problem:** The `Launch Mass` field contains human-entered string artifacts (commas) that prevent direct conversion to numeric types.

**The Solution:**
* **Sanitize:** Implement string-replacement logic to neutralize delimiters.
* **Convert:** Coerce the field to a float type, preparing it for the **Stage 4 Imputation** strategy.

In [8]:
# The list of columns that must be numeric
columns_to_sanitize = [
    'Launch Mass (kg.)', 
    'Dry Mass (kg.)', 
    'Power (watts)',
    'Period (minutes)',
    'Expected Lifetime (yrs.)',
    'Perigee (km)',
    'Apogee (km)',
    'Eccentricity',
    'Inclination (degrees)',
    'Longitude of GEO (degrees)'
]

print(f"Sanitizing {len(columns_to_sanitize)} physics columns...")

# THE CLEANING LOOP
for col in columns_to_sanitize:
    if col in ucs_sats_messy.columns:
        # Force to string, remove commas, and coerce to float
        ucs_sats_messy[col] = ucs_sats_messy[col].astype(str).str.replace(',', '', regex=False)
        ucs_sats_messy[col] = pd.to_numeric(ucs_sats_messy[col], errors='coerce')

print("Numeric Sanitization Complete.")

# IMMEDIATE VERIFICATION (Diagnostic)
print(f"\n{'Column (Original Name)':<30} | {'Current Type':<15} | {'Sample Value'}")
print("-" * 75)

for col in columns_to_sanitize:
    if col in ucs_sats_messy.columns:
        dtype = str(ucs_sats_messy[col].dtype)
        # Get a sample value (first non-null)
        sample = ucs_sats_messy[col].dropna().iloc[0] if not ucs_sats_messy[col].dropna().empty else "Empty"
        
        # VISUAL ALARM: If it's an object, mark it with '!!!'
        # We want to see 'float64' or 'int64'. 'object' is a failure.
        status_marker = "!!!" if 'object' in dtype else ""
        
        print(f"{col:<30} | {dtype:<15} | {sample} {status_marker}")
    else:
        print(f"{col:<30} | NOT FOUND")

Sanitizing 10 physics columns...
Numeric Sanitization Complete.

Column (Original Name)         | Current Type    | Sample Value
---------------------------------------------------------------------------
Launch Mass (kg.)              | float64         | 22.0 
Dry Mass (kg.)                 | float64         | 4.0 
Power (watts)                  | float64         | 4.5 
Period (minutes)               | float64         | 96.08 
Expected Lifetime (yrs.)       | float64         | 0.5 
Perigee (km)                   | float64         | 566.0 
Apogee (km)                    | float64         | 576.0 
Eccentricity                   | float64         | 0.00151 
Inclination (degrees)          | float64         | 36.9 
Longitude of GEO (degrees)     | float64         | 0.0 


### **Stage 4: Addressing the Physics Transparency Gap (Imputation)**
**The Problem:** Critical physical properties (`Launch Mass`, `Dry Mass`, `Power`) are missing for significant portions of the registry. Deleting these rows would hide risk; leaving them empty breaks kinetic modeling.

**The Solution:**
* **The "White Whale" Exception:** Manually set the **ISS** mass (450,000 kg) to prevent it from skewing statistical medians.
* **Grouped Median Imputation:** Fill `Launch Mass` and `Power` using the median of satellites with similar **Orbit** and **Purpose**.
* **Physics-Informed Ratio:** Derive `Dry Mass` by calculating the typical *Dry-to-Wet Ratio* for each orbit class and applying it to the satellite's launch mass.

In [9]:
# The ISS Exception (The "White Whale")
# We manually set the station mass first because it is a unique outlier.
ucs_sats_messy.loc[ucs_sats_messy['NORAD Number'] == 25544, 'Launch Mass (kg.)'] = 450000

# Impute LAUNCH MASS & POWER (Grouped Medians)
# Logic: Satellites with the same mission (Purpose) in the same region (Orbit) usually share chassis types.
print("Imputing Launch Mass & Power via Grouped Medians...")
fill_cols = ['Launch Mass (kg.)', 'Power (watts)']

for col in fill_cols:
    medians = ucs_sats_messy.groupby(['Class of Orbit', 'Purpose'])[col].transform('median')
    ucs_sats_messy[col] = ucs_sats_messy[col].fillna(medians)

# Broad Fallback (Orbit Only)
orbit_medians = ucs_sats_messy.groupby('Class of Orbit')['Power (watts)'].transform('median')
ucs_sats_messy['Power (watts)'] = ucs_sats_messy['Power (watts)'].fillna(orbit_medians)

# Global Fallback (The "Safety Net" for any remaining weird orbits)
global_median = ucs_sats_messy['Power (watts)'].median()
ucs_sats_messy['Power (watts)'] = ucs_sats_messy['Power (watts)'].fillna(global_median)

# Impute DRY MASS (Ratio-Derived)
# Logic: We can't use simple medians for Dry Mass because a small sat shouldn't get a large median.
# Instead, calculate the Dry/Wet ratio and apply it to the specific satellite's Launch Mass.
print("Imputing Dry Mass via Orbit-Specific Mass Ratios...")

# Calculate ratios for known data (Mass Ratio = Dry Mass / Launch Mass)
ucs_sats_messy['mass_ratio'] = ucs_sats_messy['Dry Mass (kg.)'] / ucs_sats_messy['Launch Mass (kg.)']

# Get median ratio per Orbit Class (e.g., GEO sats are usually ~60% dry mass)
ratio_medians = ucs_sats_messy.groupby('Class of Orbit')['mass_ratio'].transform('median')
ucs_sats_messy['mass_ratio'] = ucs_sats_messy['mass_ratio'].fillna(ratio_medians)

# Apply Ratio (Launch Mass * Ratio = Dry Mass)
estimated_dry_mass = ucs_sats_messy['Launch Mass (kg.)'] * ucs_sats_messy['mass_ratio']
ucs_sats_messy['Dry Mass (kg.)'] = ucs_sats_messy['Dry Mass (kg.)'].fillna(estimated_dry_mass)

# Cleanup
ucs_sats_messy.drop(columns=['mass_ratio'], inplace=True)

# Final Physics Audit
print("\n--- Physics Gap Audit (Remaining Missing Values) ---")
print(f"Launch Mass: {ucs_sats_messy['Launch Mass (kg.)'].isnull().sum()}")
print(f"Dry Mass:    {ucs_sats_messy['Dry Mass (kg.)'].isnull().sum()}")
print(f"Power:       {ucs_sats_messy['Power (watts)'].isnull().sum()}")

Imputing Launch Mass & Power via Grouped Medians...
Imputing Dry Mass via Orbit-Specific Mass Ratios...

--- Physics Gap Audit (Remaining Missing Values) ---
Launch Mass: 0
Dry Mass:    0
Power:       0


### **Stage 5: Schema Alignment & Programmatic Efficiency**
**The Problem:** Raw UCS headers are non-standardized (containing spaces, periods, and parentheses). This inconsistency increases the risk of "silent errors" during multi-dataset joins.

**The Solution:** Implement a global **Renaming Schema** to transition the dataset into a strict **snake_case** format. This aligns the primary key (`norad_id`) with the naming conventions used in the SATCAT pipeline for a seamless future merge.

In [10]:
# Create rename mapping for the column names ( the schema )
column_mapping = {
    'Name of Satellite, Alternate Names': 'satellite_name',
    'Current Official Name of Satellite': 'official_name',
    'Country/Org of UN Registry': 'un_registry',
    'Country of Operator/Owner': 'country_operator',
    'Operator/Owner': 'owner',
    'Users': 'users',
    'Purpose': 'purpose',
    'Class of Orbit': 'orbit_class',
    'Type of Orbit': 'orbit_type',
    'Longitude of GEO (degrees)': 'geo_longitude',
    'Perigee (km)': 'perigee_km',
    'Apogee (km)': 'apogee_km',
    'Eccentricity': 'eccentricity',
    'Inclination (degrees)': 'inclination_degrees',
    'Period (minutes)': 'period_minutes',
    'Launch Mass (kg.)': 'launch_mass_kg',
    'Date of Launch': 'launch_date',
    'Expected Lifetime (yrs.)': 'lifetime_years',
    'Contractor': 'contractor',
    'Country of Contractor': 'contractor_country',
    'Launch Site': 'launch_site',
    'Launch Vehicle': 'launch_vehicle',
    'COSPAR Number': 'cospar_id',
    'NORAD Number': 'norad_id',
    'Detailed Purpose': 'detailed_purpose',
    'Dry Mass (kg.)': 'dry_mass_kg',
    'Power (watts)': 'power_watts'
}

# Apply the Rename
ucs_sats_messy.rename(columns=column_mapping, inplace=True)

# Verification: Check for any remaining messy headers
# We look for spaces or parenthesis which indicate a column we missed.
messy_headers = [col for col in ucs_sats_messy.columns if ' ' in col or '(' in col]

print(f"--- Schema Finalization Report ---")
print(f"Total Columns Standardized: {len(ucs_sats_messy.columns)}")
print(f"Messy Headers Remaining:    {'None (Full Clean)' if not messy_headers else messy_headers}")

--- Schema Finalization Report ---
Total Columns Standardized: 27
Messy Headers Remaining:    None (Full Clean)


### **Stage 6: Feature Engineering (Boolean Flags & Mission Standardization)**
**The Problem:** The `users` column contains complex multi-stakeholder strings (e.g., "Government/Commercial/Military"), and the `purpose` column contains inconsistent terminology (e.g., "Earth Science" vs. "Earth Observation").

**The Solution:**
* **Boolean Flags:** Decompose the `users` column into binary indicators (`is_commercial`, `is_government`, `is_military`, `is_civil`) to enable precise sector-based querying.
* **Mission Standardization:** Map diverse mission descriptions to a controlled vocabulary (e.g., Mapping "Surveillance" and "Meteorological" to **"Earth Observation"**).

In [11]:
# Create User Boolean Flags (The "Democratization" Columns)
# These flags allow queries like: "Show me Civil satellites with NO Government involvement"
ucs_sats_messy['is_commercial'] = ucs_sats_messy['users'].str.contains('Commercial', case=False, na=False).astype(int)
ucs_sats_messy['is_government'] = ucs_sats_messy['users'].str.contains('Government', case=False, na=False).astype(int)
ucs_sats_messy['is_military'] = ucs_sats_messy['users'].str.contains('Military', case=False, na=False).astype(int)
ucs_sats_messy['is_civil'] = ucs_sats_messy['users'].str.contains('Civil', case=False, na=False).astype(int)

# Standardize Primary Purpose (The "Mission")
def standardize_purpose(text):
    if pd.isna(text) or text == 'Unknown':
        return 'Unknown'
    
    # Take the first primary term if there are multiple (e.g. "Comms/Nav")
    primary = text.split('/')[0].strip()
    
    mapping = {
        'Earth Science': 'Earth Observation',
        'Meteorological': 'Earth Observation',
        'Surveillance': 'Earth Observation',
        'Earth': 'Earth Observation',
        'Earth/Space Observation': 'Earth Observation',
        'Space Observation': 'Space Science',
        'Technology Demonstration': 'Technology Development',
        'Mission Extension Technology': 'Technology Development',
        'Platform': 'Technology Development',
        'Satellite Positioning': 'Navigation',
        'Navigation': 'Navigation',
        'Communications': 'Communications',
        'Space Science': 'Space Science',
        'Educational': 'Educational'
    }
    return mapping.get(primary, primary)

ucs_sats_messy['primary_purpose'] = ucs_sats_messy['purpose'].apply(standardize_purpose)

# Logical Reordering (Move primary_purpose next to purpose for easy checking)
cols = list(ucs_sats_messy.columns)
cols.remove('primary_purpose')
target_index = cols.index('purpose')
cols.insert(target_index + 1, 'primary_purpose')
ucs_sats_messy = ucs_sats_messy[cols]

print("\n--- Mission Standardization Audit (Diff View) ---")
# Show only rows where the purpose was actually changed/cleaned
columns_to_show = ['satellite_name', 'purpose', 'primary_purpose']
diff_view = ucs_sats_messy[ucs_sats_messy['purpose'] != ucs_sats_messy['primary_purpose']][columns_to_show]

if not diff_view.empty:
    print(f"Standardized {len(diff_view)} complex mission labels.")
    display(diff_view.head(5))
else:
    print("No complex labels found (Data was already clean).")


--- Mission Standardization Audit (Diff View) ---
Standardized 324 complex mission labels.


Unnamed: 0,satellite_name,purpose,primary_purpose
12,ADLER-2,Earth Science,Earth Observation
51,ALE-2 (Astro Live Experiences-2),Earth Science,Earth Observation
75,ANDESITE Mule (Ad-Hoc Network Demonstration fo...,Space Science/Technology Demonstration,Space Science
83,AprizeSat 1 (LatinSat-C),Communications/Technology Development,Communications
84,AprizeSat 10 (exactView-13),Communications/Maritime Tracking,Communications


### **Stage 7: Pipeline Serialization & Executive Summary**
**Objective:** Finalize the active population for export.

We have successfully addressed two critical data gaps:
1. **The "Mass Transparency Gap":** Addressed via the **ISS Exception**, **Grouped Median Imputation**, and **Physics-Informed Ratios**, making this the high-fidelity mass reference for collision models.
2. **The "Metadata Consistency Gap":** Addressed via **Boolean Sector Flags** (e.g., `is_military`) and **Mission Standardization**, transforming raw text into machine-readable categories.

**Outcome:** This dataset is now normalized, validated, and ready for export.

In [12]:
# Define Output Path
output_path = '../data/clean/ucs_cleaned.csv'

# Calculate Sector Metrics (The "User" Flags)
total_rows = len(ucs_sats_messy)
comm_count = ucs_sats_messy['is_commercial'].sum()
mil_count  = ucs_sats_messy['is_military'].sum()
gov_count  = ucs_sats_messy['is_government'].sum()
civ_count  = ucs_sats_messy['is_civil'].sum()

# Calculate Mission Metrics (The Top 3 Purposes)
top_missions = ucs_sats_messy['primary_purpose'].value_counts().head(3)
m1_n, m1_c = top_missions.index[0], top_missions.values[0]
m2_n, m2_c = top_missions.index[1], top_missions.values[1]
m3_n, m3_c = top_missions.index[2], top_missions.values[2]

# Calculate Data Quality (Physics Completeness)
# We expect nearly 100% because of our Imputation Engine
launch_mass_cov = ucs_sats_messy['launch_mass_kg'].notna().sum()
dry_mass_cov = ucs_sats_messy['dry_mass_kg'].notna().sum()
power_cov = ucs_sats_messy['power_watts'].notna().sum()

# Generate Comprehensive Report
report = f"""
### **UCS Pipeline Completion Report: Platinum Edition**
**Total Active Registry:** {total_rows:,} satellites

#### **1. Sector Composition**
| Sector | Count | Share |
| :--- | :--- | :--- |
| **Commercial** | {comm_count:,} | {comm_count/total_rows:.1%} |
| **Military** | {mil_count:,} | {mil_count/total_rows:.1%} |
| **Government** | {gov_count:,} | {gov_count/total_rows:.1%} |
| **Civil** | {civ_count:,} | {civ_count/total_rows:.1%} |

#### **2. Primary Mission Breakdown**
| Top Mission | Count | Share |
| :--- | :--- | :--- |
| **1. {m1_n}** | {m1_c:,} | {m1_c/total_rows:.1%} |
| **2. {m2_n}** | {m2_c:,} | {m2_c/total_rows:.1%} |
| **3. {m3_n}** | {m3_c:,} | {m3_c/total_rows:.1%} |

#### **3. Data Quality Audit**
| Feature | Available Data | Completeness | Status |
| :--- | :--- | :--- | :--- |
| **Launch Mass** | {launch_mass_cov:,} | **{launch_mass_cov/total_rows:.1%}** | ‚úÖ **Imputed (Primary)** |
| **Dry Mass** | {dry_mass_cov:,} | **{dry_mass_cov/total_rows:.1%}** | ‚úÖ **Imputed (Derived)** |
| **Power** | {power_cov:,} | **{power_cov/total_rows:.1%}** | ‚úÖ **Imputed (Median)** |

üíæ **File Saved:** `{output_path}`
"""

display(Markdown(report))
ucs_sats_messy.to_csv(output_path, index=False)


### **UCS Pipeline Completion Report: Platinum Edition**
**Total Active Registry:** 7,551 satellites

#### **1. Sector Composition**
| Sector | Count | Share |
| :--- | :--- | :--- |
| **Commercial** | 6,260 | 82.9% |
| **Military** | 613 | 8.1% |
| **Government** | 762 | 10.1% |
| **Civil** | 219 | 2.9% |

#### **2. Primary Mission Breakdown**
| Top Mission | Count | Share |
| :--- | :--- | :--- |
| **1. Communications** | 5,523 | 73.1% |
| **2. Earth Observation** | 1,310 | 17.3% |
| **3. Technology Development** | 441 | 5.8% |

#### **3. Data Quality Audit**
| Feature | Available Data | Completeness | Status |
| :--- | :--- | :--- | :--- |
| **Launch Mass** | 7,551 | **100.0%** | ‚úÖ **Imputed (Primary)** |
| **Dry Mass** | 7,551 | **100.0%** | ‚úÖ **Imputed (Derived)** |
| **Power** | 7,551 | **100.0%** | ‚úÖ **Imputed (Median)** |

üíæ **File Saved:** `../data/clean/ucs_cleaned.csv`
