# UCS Data Pipeline: Standardization & Normalization

**Dataset:** Union of Concerned Scientists (UCS) Satellite Database  
**Objective:** Prepare active satellite registry data for merger with SATCAT.

### **The Engineering Challenge**
The UCS database is human-maintained, leading to significant inconsistencies in categorical fields. To make this data machine-readable for our "Kessler Syndrome" analysis, we must implement a strict cleaning pipeline:
1.  **Ingestion:** Load raw CSV with proper encoding handling.
2.  **Normalization:** Standardize "Country of Operator" (e.g., "USA" vs "United States") and "Users" (e.g., "Com/Mil" vs "Commercial/Military").
3.  **Date Parsing:** Convert "Launch Date" to datetime objects for time-series analysis.
4.  **Validation:** Ensure primary keys (COSPAR ID) are unique and valid.

In [1]:
import pandas as pd

In [2]:
ucs_sats_messy = pd.read_csv('../data/original/UCS-Satellite-Database 5-1-2023.csv')

ucs_sats_messy.head(10)

Unnamed: 0,"Name of Satellite, Alternate Names",Current Official Name of Satellite,Country/Org of UN Registry,Country of Operator/Owner,Operator/Owner,Users,Purpose,Detailed Purpose,Class of Orbit,Type of Orbit,...,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67
0,1HOPSAT-TD (1st-generation High Optical Perfor...,1HOPSAT-TD,NR,USA,Hera Systems,Commercial,Earth Observation,Infrared Imaging,LEO,Non-Polar Inclined,...,,,,,,,,,,
1,AAC AIS-Sat1 (Kelpie 1),AAC AIS-Sat1Â (Kelpie 1),United Kingdom,United Kingdom,AAC Clyde Space,Commercial,Earth Observation,Automatic Identification System (AIS),LEO,Sun-Synchronous,...,,,,,,,,,,
2,Aalto-1,Aalto-1,Finland,Finland,Aalto University,Civil,Technology Development,,LEO,Sun-Synchronous,...,,,,,,,,,,
3,AAt-4,AAt-4,Denmark,Denmark,University of Aalborg,Civil,Earth Observation,Automatic Identification System (AIS),LEO,Sun-Synchronous,...,,,,,,,,,,
4,"ABS-2 (Koreasat-8, ST-3)",ABS-2,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,,GEO,,...,,,,,,,,,,
5,ABS-2A,ABS-2A,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,,GEO,,...,,,,,,,,,,
6,ABS-3A,ABS-3A,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,,GEO,,...,,,,,,,,,,
7,"ABS-4 (ABS-2i, MBSat, Mobile Broadcasting Sate...",ABS-4,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,,GEO,,...,,,,,,,,,,
8,"ABS-6 (ABS-1, LMI-1, Lockheed Martin-Intersput...",ABS-6,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,,GEO,,...,,,,,,,,,,
9,Adelis-Sampson 1,Adelis-Sampson 1,NR,Israel,Asher Space Research Institute at Technion/Isr...,Government,Technology Development,,LEO,Sun-Synchronous,...,,,,,,,,,,


In [3]:
print(ucs_sats_messy.columns)

Index(['Name of Satellite, Alternate Names',
       'Current Official Name of Satellite', 'Country/Org of UN Registry',
       'Country of Operator/Owner', 'Operator/Owner', 'Users', 'Purpose',
       'Detailed Purpose', 'Class of Orbit', 'Type of Orbit',
       'Longitude of GEO (degrees)', 'Perigee (km)', 'Apogee (km)',
       'Eccentricity', 'Inclination (degrees)', 'Period (minutes)',
       'Launch Mass (kg.)', ' Dry Mass (kg.) ', 'Power (watts)',
       'Date of Launch', 'Expected Lifetime (yrs.)', 'Contractor',
       'Country of Contractor', 'Launch Site', 'Launch Vehicle',
       'COSPAR Number', 'NORAD Number', 'Comments', 'Unnamed: 28',
       'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2',
       'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Unnamed: 37',
       'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41',
       'Unnamed: 42', 'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45',
       'Unnamed: 46', 'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49',


In [4]:
ucs_sats_messy.shape

(7560, 68)

In [5]:
ucs_sats_messy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7560 entries, 0 to 7559
Data columns (total 68 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Name of Satellite, Alternate Names  7560 non-null   object 
 1   Current Official Name of Satellite  7560 non-null   object 
 2   Country/Org of UN Registry          7559 non-null   object 
 3   Country of Operator/Owner           7560 non-null   object 
 4   Operator/Owner                      7560 non-null   object 
 5   Users                               7560 non-null   object 
 6   Purpose                             7560 non-null   object 
 7   Detailed Purpose                    1254 non-null   object 
 8   Class of Orbit                      7560 non-null   object 
 9   Type of Orbit                       6909 non-null   object 
 10  Longitude of GEO (degrees)          7557 non-null   float64
 11  Perigee (km)                        7553 no

Strip white space from object dtypes using lambda. I moved this closer to the top because I want this done before further cleaning.

In [6]:
print(f"Users: \n\r{ucs_sats_messy['Users'].unique()}\n\r")

Users: 
['Commercial' 'Civil' 'Government' 'Military' 'Military/Commercial'
 'Government/Military' 'Military/Government' 'Government/Civil'
 'Military/Civil' 'Commercial/Civil' 'Civil/Commercial'
 'Government/Commercial' 'Commercial/Government'
 'Government/Commercial/Military' 'Civil/Government' 'Civil/Military'
 'Commercial ' 'Commercial/Military' 'Government ' 'Military ']



In [7]:
text_cols = ucs_sats_messy.select_dtypes(['object']).columns
ucs_sats_messy[text_cols] = ucs_sats_messy[text_cols].apply(lambda x: x.str.strip())

We need to drop useless columns (unnamed, etc). Start with the unnamed columns which are almost all empty.  I assume this is place savers for future data but its useless to us at this point.

In [8]:
ucs_sats_messy.drop( columns=['Unnamed: 28', 'Unnamed: 37',
       'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41',
       'Unnamed: 42', 'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45',
       'Unnamed: 46', 'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49',
       'Unnamed: 50', 'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53',
       'Unnamed: 54', 'Unnamed: 55', 'Unnamed: 56', 'Unnamed: 57',
       'Unnamed: 58', 'Unnamed: 59', 'Unnamed: 60', 'Unnamed: 61',
       'Unnamed: 62', 'Unnamed: 63', 'Unnamed: 64', 'Unnamed: 65',
       'Unnamed: 66', 'Unnamed: 67',' Dry Mass (kg.) ', 'Power (watts)'], inplace=True)

Need to clean up Perigee and Apogee. Strip ',' and convert object dtype to numeric dtype (ends up being float64). Make sure we dropna and invalid data.  Appears to be at least 1 row that has an invalid apogee of less than 100km (not possibly given all satellite perigee's are greater than 150km and the satellite's apogee must be greater than the satellite's perigee ).

In [9]:
ucs_sats_messy['Perigee (km)'] = ucs_sats_messy['Perigee (km)'].astype(str).str.replace(',', '', regex=False)
ucs_sats_messy['Apogee (km)'] = ucs_sats_messy['Apogee (km)'].astype(str).str.replace(',', '', regex=False)

ucs_sats_messy['Perigee (km)'] = pd.to_numeric(ucs_sats_messy['Perigee (km)'], errors='coerce')
ucs_sats_messy['Apogee (km)'] = pd.to_numeric(ucs_sats_messy['Apogee (km)'], errors='coerce')

ucs_sats_messy.dropna(subset=['Perigee (km)', 'Apogee (km)'], inplace=True)

ucs_sats_messy = ucs_sats_messy[ucs_sats_messy['Apogee (km)'] >= ucs_sats_messy['Perigee (km)']]

Government/Commercial may 'seem' to be the same thing as Commercial/Government but it is not.  The order of the listing matters. Duplicates from original data that had leading/training white space has been cleaned up previously.

Primary Users/Secondary Users/Tertiary Users

I would like to drop the sources columns from the main csv but I want to maintain a usable list of this data incase I need it in the future. Output source data to a new csv with noradid added for a primary key for later comparison/cross referencing.

In [10]:
sources = ucs_sats_messy[['Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments']]

sources

Unnamed: 0,Source Used for Orbital Data,Source,Source.1,Source.2,Source.3,Source.4,Source.5,Source.6,Comments
0,JMSatcat/3_20,https://spaceflightnow.com/2019/12/11/indias-5...,https://www.herasys.com/,,,,,,Pathfinder for planned earth observation const...
1,JMSatcat/9_23,https://www.aac-clyde.space/articles/aac-clyde...,,,,,,,Provide AIS information to Orbcomm.
2,JMSatcat/10_17,https://directory.eoportal.org/web/eoportal/sa...,,http://www.planet4589.org/space/log/satcat.txt,,,,,Technology development and education.
3,Space50,http://spaceflightnow.com/2016/04/26/soyuz-bla...,,http://space50.org/objekt.php?mot=2016-025E&ja...,,,,,Carries AIS system.
4,ZARYA,http://www.absatellite.net/satellite-fleet/?sa...,,http://www.zarya.info/Diaries/Launches/Launche...,http://www.absatellite.net/2010/10/13/asia-bro...,http://www.spacenews.com/article/satellite-tel...,,,"32 C-band, 51 Ku-band, and 6 Ka-band transpond..."
...,...,...,...,...,...,...,...,...,...
7555,www.spacedebris.net 12/12,http://www.spaceflightnow.com/news/n1201/09lon...,,https://spacenews.com/china-launches-five-comm...,,,,,Can acquire high-resolution data through remot...
7556,Space50,http://spaceflightnow.com/2016/05/31/long-marc...,,https://www.planet4589.org/space/log/satcat.txt,http://space50.org/objekt.php?mot=2016-033A&ja...,,,,Hyperspectral imaging
7557,ZARYA,https://spaceflightnow.com/2020/07/25/china-la...,,http://www.lib.cas.cz/space.40/2011/079A.HTM,,,,,Land survey satellite.
7558,JMSatcat/1_22,https://spaceflightnow.com/2021/11/15/japanese...,,https://www.planet4589.org/space/log/satcat.txt,,,,,Thought to be for intelligence gathering.


In [11]:
norad_data = ucs_sats_messy['NORAD Number']
purpose_data = ucs_sats_messy['Detailed Purpose']

# manual column name change fix from 'NORAD Number' to norad_id to avoid merge issues later down the road
# that would result from the column name changes I do at the end of cleanup.

sources.insert(0, 'norad_id', norad_data) 
sources.insert(10, 'Detailed Purpose', purpose_data)

sources = sources.sort_values(by='norad_id')
sources.to_csv('./../data/clean/ucs_dropped.csv', index=False)

In [12]:
ucs_sats_messy.drop(columns=['Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2', 'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Comments', 'Detailed Purpose'], inplace=True)
sources.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7552 entries, 73 to 5449
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   norad_id                      7552 non-null   int64 
 1   Source Used for Orbital Data  6634 non-null   object
 2   Source                        3284 non-null   object
 3   Source.1                      725 non-null    object
 4   Source.2                      1832 non-null   object
 5   Source.3                      1126 non-null   object
 6   Source.4                      729 non-null    object
 7   Source.5                      553 non-null    object
 8   Source.6                      504 non-null    object
 9   Comments                      2084 non-null   object
 10  Detailed Purpose              1248 non-null   object
dtypes: int64(1), object(10)
memory usage: 708.0+ KB


In [13]:
ucs_sats_messy.head()

Unnamed: 0,"Name of Satellite, Alternate Names",Current Official Name of Satellite,Country/Org of UN Registry,Country of Operator/Owner,Operator/Owner,Users,Purpose,Class of Orbit,Type of Orbit,Longitude of GEO (degrees),...,Period (minutes),Launch Mass (kg.),Date of Launch,Expected Lifetime (yrs.),Contractor,Country of Contractor,Launch Site,Launch Vehicle,COSPAR Number,NORAD Number
0,1HOPSAT-TD (1st-generation High Optical Perfor...,1HOPSAT-TD,NR,USA,Hera Systems,Commercial,Earth Observation,LEO,Non-Polar Inclined,0.0,...,96.08,22,12/11/2019,0.5,Hera Systems,USA,Satish Dhawan Space Centre,PSLV,2019-089H,44859
1,AAC AIS-Sat1 (Kelpie 1),AAC AIS-Sat1Â (Kelpie 1),United Kingdom,United Kingdom,AAC Clyde Space,Commercial,Earth Observation,LEO,Sun-Synchronous,0.0,...,95.0,4,1/3/2023,,AAC Clyde Space,Sweden/UK/USA/Netherlands,Cape Canaveral,Falcon 9,2023-001DC,55107
2,Aalto-1,Aalto-1,Finland,Finland,Aalto University,Civil,Technology Development,LEO,Sun-Synchronous,0.0,...,94.7,5,6/23/2017,2.0,Aalto University,Finland,Satish Dhawan Space Centre,PSLV,2017-036L,42775
3,AAt-4,AAt-4,Denmark,Denmark,University of Aalborg,Civil,Earth Observation,LEO,Sun-Synchronous,0.0,...,95.9,1,4/25/2016,,University of Aalborg,Denmark,Guiana Space Center,Soyuz-2.1a,2016-025E,41460
4,"ABS-2 (Koreasat-8, ST-3)",ABS-2,NR,Multinational,Asia Broadcast Satellite Ltd.,Commercial,Communications,GEO,,75.0,...,1436.03,6330,2/6/2014,15.0,Space Systems/Loral,USA,Guiana Space Center,Ariane 5 ECA,2014-006A,39508


In [14]:
ucs_sats_messy['Class of Orbit'].unique()

array(['LEO', 'GEO', 'Elliptical', 'MEO', 'LEo'], dtype=object)

In [15]:
ucs_sats_messy['Class of Orbit'] = ucs_sats_messy['Class of Orbit'].str.upper()
ucs_sats_messy['Class of Orbit'].unique()

array(['LEO', 'GEO', 'ELLIPTICAL', 'MEO'], dtype=object)

In [16]:
ucs_sats_messy['Date of Launch'] = pd.to_datetime(ucs_sats_messy['Date of Launch'], errors='coerce')
ucs_sats_messy = ucs_sats_messy.dropna(subset=['Date of Launch'])

Fix various issues with launch mass.

In [17]:
ucs_sats_messy['Launch Mass (kg.)'] = ucs_sats_messy['Launch Mass (kg.)'].astype(str).str.replace(',', '', regex=False)
ucs_sats_messy['Launch Mass (kg.)'] = pd.to_numeric(ucs_sats_messy['Launch Mass (kg.)'], errors='coerce')
ucs_sats_messy['Launch Mass (kg.)'].describe()

count     7307.000000
mean       626.845080
std       1386.869602
min          1.000000
25%        148.000000
50%        260.000000
75%        280.000000
max      22500.000000
Name: Launch Mass (kg.), dtype: float64

Manual Correction: The ISS is the largest object in orbit, but its mass often fluctuates in datasets or is missing. We manually enforce its known mass (~450,000 kg) to ensure accurate outlier analysis.

In [18]:
ucs_sats_messy.loc[ucs_sats_messy['NORAD Number'] == 25544, 'Launch Mass (kg.)'] = 450000

In [19]:
medians = ucs_sats_messy.groupby(['Class of Orbit', 'Purpose'])['Launch Mass (kg.)'].transform('median')

ucs_sats_messy['Launch Mass (kg.)'] = ucs_sats_messy['Launch Mass (kg.)'].fillna(medians)
ucs_sats_messy['Launch Mass (kg.)'].isnull().value_counts()

Launch Mass (kg.)
False    7550
Name: count, dtype: int64

In [20]:
missing_count = ucs_sats_messy['Launch Mass (kg.)'].isnull().sum()
print(f"Remaining missing masses: {missing_count}")

Remaining missing masses: 0


The column names in this dataset are long and contain spaces, periods, and parentheses (e.g., Launch Mass (kg.)). This makes them annoying to type. Let's clean them up.  I could have done this early but honestly didn't realize how simple it would be to accomplish.

In [21]:
column_mapping = {
    'Name of Satellite, Alternate Names': 'satellite_name',
    'Current Official Name of Satellite': 'official_name',
    'Country/Org of UN Registry': 'un_registry',
    'Country of Operator/Owner': 'country_operator',
    'Operator/Owner': 'owner',
    'Users': 'users',
    'Purpose': 'purpose',
    'Class of Orbit': 'orbit_class',
    'Type of Orbit': 'orbit_type',
    'Longitude of GEO (degrees)': 'geo_longitude',
    'Perigee (km)': 'perigee_km',
    'Apogee (km)': 'apogee_km',
    'Eccentricity': 'eccentricity',
    'Inclination (degrees)': 'inclination_degrees',
    'Period (minutes)': 'period_minutes',
    'Launch Mass (kg.)': 'launch_mass_kg',
    'Date of Launch': 'launch_date',
    'Expected Lifetime (yrs.)': 'lifetime_years',
    'Contractor': 'contractor',
    'Country of Contractor': 'contractor_country',
    'Launch Site': 'launch_site',
    'Launch Vehicle': 'launch_vehicle',
    'COSPAR Number': 'cospar_id',
    'NORAD Number': 'norad_id',
    'Comments': 'comments',
}

ucs_sats_messy.rename(columns=column_mapping, inplace=True)

print(ucs_sats_messy.columns)

Index(['satellite_name', 'official_name', 'un_registry', 'country_operator',
       'owner', 'users', 'purpose', 'orbit_class', 'orbit_type',
       'geo_longitude', 'perigee_km', 'apogee_km', 'eccentricity',
       'inclination_degrees', 'period_minutes', 'launch_mass_kg',
       'launch_date', 'lifetime_years', 'contractor', 'contractor_country',
       'launch_site', 'launch_vehicle', 'cospar_id', 'norad_id'],
      dtype='object')


Save the cleaned data to a new csv for use after cleanup.

In [22]:
total_rows = len(ucs_sats_messy)
commercial_count = ucs_sats_messy[ucs_sats_messy['users'].str.contains('Commercial', na=False)].shape[0]
usa_count = ucs_sats_messy[ucs_sats_messy['country_operator'] == 'USA'].shape[0]

print(f"âœ… UCS Pipeline Complete.")
print(f"   - Total Active Satellites: {total_rows:,}")
print(f"   - Commercial Sector: {commercial_count:,} ({commercial_count/total_rows:.1%})")
print(f"   - US Operated: {usa_count:,} ({usa_count/total_rows:.1%})")
print(f"   - Date Range: {ucs_sats_messy['launch_date'].min().date()} to {ucs_sats_messy['launch_date'].max().date()}")

output_path = '../data/clean/ucs_cleaned.csv'
ucs_sats_messy.to_csv(output_path, index=False)
print(f"\nðŸ’¾ File Saved: {output_path}")

âœ… UCS Pipeline Complete.
   - Total Active Satellites: 7,550
   - Commercial Sector: 6,260 (82.9%)
   - US Operated: 5,163 (68.4%)
   - Date Range: 1974-11-15 to 2023-04-27

ðŸ’¾ File Saved: ../data/clean/ucs_cleaned.csv
