# Milestone 2

In [2]:
import pandas as pd
import utils.preprocessing as pp

## Data Acquisition 

We are using historical [Two-Line Element](https://en.wikipedia.org/wiki/Two-line_element_set) sets and satellite catalog information from the [space-track.org](https://www.space-track.org/) API. space-track offers both a current state catalog as well as the ability to query historical state catalogs. 

For our project, we want to go back two years and pull the state record on a cadence of 15 days. Due to orbital perturbations, tactical space operations, and the nature of the TLE, state encodings for even stable objects will change over time. This makes having multiple state representations for the same object useful and prevents duplicate information. As a step zero, we pulled our snapshots from the space-track api and saved them as .zip files. We also pulled both the current satellite catalog as well as the decayed satellite catalog (and saved these as .zip files as well)


*To see the process described above visit the [DataAcquisition.ipynb](https://github.com/tedinspace/space-object-identification/blob/main/src/DataAcquisition.ipynb) notebook*


In [3]:
satCat_current = pd.DataFrame(pp.loadUnprocessedSatCat_current("../data/satcat.zip")).set_index("NORAD_CAT_ID")
satCat_decayed = pd.DataFrame(pp.loadUnprocessedSatCat_decayed("../data/satcat.zip")).set_index("NORAD_CAT_ID")
snapshot_states = pp.loadUnprocessedStateSnapshots("../data/snapshots.zip")

## Data Description

Now that we have the snapshots of our historical states saved locally, we can create a pandas dataframe that contains the following merged information

1. state information - in TLE form

2. satellite catalog (satcat) information - descriptive information about a satellite

3. calculated orbital regime - what type of orbit the state is (needs to be computed)


#### State Data: Two-Line Element Sets

[AEHF 4](https://www.n2yo.com/satellite/?s=43651)

<table>
<tbody>
    <tr>
        <td>Line 1</td>
        <td>1 43651U 18079A   25074.81403196 -.00000276  00000-0  00000-0 0  9990</td>
    </tr>
     <tr>
        <td>Line 2</td>
        <td>2 43651   1.5535 338.6932 0052106   7.7813  52.6795  1.00274975 23705</td>
    </tr>
</tbody>
</table>

<h4>Line 1</h4>

<table>
  <tbody>
     <tr> 
      <td>Relevant Info (<span style="color:green;"><strong>Y</strong></span>/<span style="color:yellow;"><em><strong>M</strong></em></span>/<span style="color:red;">N</span>)</td>
      <td>Value</td>
      <td>Name</td>
      <td>Description</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>1</td>
      <td>Line Number</td>
      <td>Identifies what line you are reading</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>43651</td>
      <td>Catalog Number</td>
      <td>A unique identifier for the satellite</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>U</td>
      <td>Classification</td>
      <td>Indicates the object is unclassified</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>18079A</td>
      <td>International Designator</td>
      <td>A unique identifier for the launch (year, launch number, piece of the launch)</td>
    </tr>
    <tr> 
      <td style="background-color:yellow;"><em><strong>M</strong></em></td>
      <td>25074.81403196</td>
      <td>Epoch Time</td>
      <td>The time when the TLE was valid (in YYDDD.DDDDDDDD format)</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>-0.00000276</td>
      <td>First Derivative of the Mean Motion</td>
      <td>Rate of change of the satellite’s orbit</td>
    </tr>
    <tr> 
      <td style="background-color:yellow;"><em><strong>M</strong></em></td>
      <td>00000-0</td>
      <td>Second Derivative of Mean Motion (decimal point assumed)</td>
      <td>Shows the acceleration of the satellite's orbit, which is usually close to zero</td>
    </tr>
    <tr> 
      <td style="background-color:yellow;"><em><strong>M</strong></em></td>
      <td>00000-0</td>
      <td>B-STAR </td>
      <td>The drag term, or radiation pressure coefficient (decimal point assumed)</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>0</td>
      <td>Ephem Type</td>
      <td>Always zero</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>999</td>
      <td>ElSet Number</td>
      <td>Element set number. Incremented when a new TLE is generated for this object</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>0</td>
      <td>CheckSum</td>
      <td>Checksum (modulo 10)</td>
    </tr>
  </tbody>
</table>


<h4>Line 2</h4>

<table>
  <tbody>
     <tr> 
      <td>Relevant Info (<span style="color:green;"><strong>Y</strong></span>/<span style="color:yellow;"><em><strong>M</strong></em></span>/<span style="color:red;">N</span>)</td>
      <td>Value</td>
      <td>Name</td>
      <td>Description</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>2</td>
      <td>Line Number</td>
      <td>Identifies the second line of the TLE</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>43651</td>
      <td>Catalog Number</td>
      <td>The same unique identifier as in Line 1</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>1.5535</td>
      <td>Inclination</td>
      <td>The angle between the satellite's orbit and the equator, in degrees</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>338.6932</td>
      <td>Right Ascension of Ascending Node (RAAN)</td>
      <td>The angle from the vernal equinox to the ascending node of the orbit, in degrees</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>0052106</td>
      <td>Eccentricity</td>
      <td>The shape of the orbit; how much the orbit deviates from a perfect circle</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>7.7813</td>
      <td>Argument of Perigee</td>
      <td>The angle between the ascending node and the orbit's point of closest approach to Earth, in degrees</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>52.6795</td>
      <td>Mean Anomaly</td>
      <td>The fraction of an orbit's period that has elapsed since the satellite last passed perigee, in degrees</td>
    </tr>
    <tr> 
      <td style="background-color:green;"><strong>Y</strong></td>
      <td>1.00274975</td>
      <td>Mean Motion</td>
      <td>The number of orbits the satellite completes per day</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>2370</td>
      <td>Revolution Number at Epoch</td>
      <td>The number of orbits completed at the time of the epoch</td>
    </tr>
    <tr> 
      <td style="background-color:red;">N</td>
      <td>5</td>
      <td>CheckSum</td>
      <td>Checksum (modulo 10)</td>
    </tr>
  </tbody>
</table>



In [4]:
print(snapshot_states[0])
print(snapshot_states[1])
print(snapshot_states[2])

0 WESTFORD NEEDLES
1  2532U 63014AL  23106.78082352 -.00000062  00000-0 -28241-1 0  9991
2  2532  87.3257 248.7362 0379925 109.4647 254.7657  8.66892838300135


#### Satellite Catalog Information

<table>
  <tbody>
    <tr>
      <td>Relevant Info (<span style="color:green;"><strong>Y</strong></span>/<span style="color:yellow;"><em><strong>M</strong></em></span>/<span style="color:red;">N</span>)</td>
      <td>Name</td>
      <td>Description</td>
    </tr>
    <tr>
       <td style="background-color:green;"><strong>Y</strong></td>
      <td>NORAD_CAT_ID</td>
      <td>identifying number; refered to as an RSO (resident space object) number</td>
    </tr>
    <tr>
     <td style="background-color:red;">N</td>
      <td>INTLDES</td>
      <td>International Designator</td>
    </tr>
    <tr>
     <td style="background-color:green;"><strong>Y</strong></td>
      <td>OBJECT_TYPE</td>
      <td>What type of object </td>
    </tr>
    <tr><td style="background-color:green;"><strong>Y</strong></td>
      <td>SATNAME</td>
      <td>Name of satellite</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>COUNTRY</td>
      <td>Country of origin</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>LAUNCH</td>
      <td>Launch date</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>SITE</td>
      <td>Launch site</td>
    </tr>
    <tr><td style="background-color:yellow;"><em><strong>M</strong></em></td>
      <td>DECAY</td>
      <td>Date it decayed; null if still active</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>PERIOD</td>
      <td>Static state information to be disregarded </td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>INCLINATION</td>
      <td>Static state information to be disregarded </td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>APOGEE</td>
      <td>Static state information to be disregarded </td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>PERIGEE</td>
      <td>Static state information to be disregarded </td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>COMMENT</td>
      <td>usually empty</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>COMMENTCODE</td>
      <td>usually empty</td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>RCSVALUE</td>
      <td>no longer filled out</td>
    </tr>
    <tr><td style="background-color:yellow;">M</td>
      <td>RCS_SIZE</td>
      <td>vague descriptive size of object</td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>FILE</td>
      <td></td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>LAUNCH_YEAR</td>
      <td>year of launch</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>LAUNCH_NUM</td>
      <td>number of launch portion</td>
    </tr>
    <tr><td style="background-color:red;">N</td>
      <td>LAUNCH_PIECE</td>
      <td>portion of launch</td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>CURRENT</td>
      <td>Doesn't encode useful information</td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>OBJECT_NAME</td>
      <td>redundant</td>
    </tr>
     <tr><td style="background-color:red;">N</td>
      <td>OBJECT_ID</td>
      <td>redundant</td>
    </tr>
  </tbody>
</table>


In [5]:
satCat_current.head(5)

Unnamed: 0_level_0,INTLDES,OBJECT_TYPE,SATNAME,COUNTRY,LAUNCH,SITE,DECAY,PERIOD,INCLINATION,APOGEE,...,RCSVALUE,RCS_SIZE,FILE,LAUNCH_YEAR,LAUNCH_NUM,LAUNCH_PIECE,CURRENT,OBJECT_NAME,OBJECT_ID,OBJECT_NUMBER
NORAD_CAT_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,1958-002B,PAYLOAD,VANGUARD 1,US,1958-03-17,AFETR,,132.61,34.26,3822,...,0,SMALL,9072,1958,2,B,Y,VANGUARD 1,1958-002B,5
11,1959-001A,PAYLOAD,VANGUARD 2,US,1959-02-17,AFETR,,121.08,32.88,2904,...,0,MEDIUM,9071,1959,1,A,Y,VANGUARD 2,1959-001A,11
12,1959-001B,ROCKET BODY,VANGUARD R/B,US,1959-02-17,AFETR,,125.48,32.9,3295,...,0,MEDIUM,9064,1959,1,B,Y,VANGUARD R/B,1959-001B,12
16,1958-002A,ROCKET BODY,VANGUARD R/B,US,1958-03-17,AFETR,,137.2,34.27,4216,...,0,MEDIUM,9067,1958,2,A,Y,VANGUARD R/B,1958-002A,16
20,1959-007A,PAYLOAD,VANGUARD 3,US,1959-09-18,AFETR,,124.08,33.34,3218,...,0,MEDIUM,9073,1959,7,A,Y,VANGUARD 3,1959-007A,20


## Pre-processing Data

#### Merging Datasets and Creating Regime Labels

The code below does the following

1. Performs deduplication: no copies of the same state

2. Checks for catalog information; if the object doesn't have catalog information we can't tell what type of object it is

3. Extracts relevant information from TLE and merges it with catalog information 

4. Additional Computation

- semi-major axis: the distance from the center of an ellipse to the longer end of the ellipse

- apogee / perigee: furthest distance from earth; smallest distance from earth

- regime: while we might not use this in the model, this will be useful for looking at data imbalances. Example: Is most debris in a particiular regime? 

5. puts all accumulated information in single data frame 

In [15]:
aggregatedData, uncataloged_states, nDuplicatesFound = pp.deduplicateAndMergeDataSources(snapshot_states, satCat_current, satCat_decayed)

print(f"{len(uncataloged_states)} objects not in catalog (removing these states)")
print(f"{nDuplicatesFound} duplicate states found")

for rso in uncataloged_states:
    if "TBA - TO BE ASSIGNED" not in uncataloged_states[rso][0][0]:
        print(uncataloged_states[rso])

df = pd.DataFrame(aggregatedData)
df.head(5)

602 objects not in catalog (removing these states)
0 duplicate states found


Unnamed: 0,NUMBER,NAME,TYPE,RCS,IS_CURRENT,REGIME,EPOCH,INCL,RAAN,ECC,ARG_PER,MEAN_ANOM,MEAN_MOTION,SMA_KM,APOGEE_KM,PERIGEE_KM,MEAN_MOTION_1ST_DER,LINE1,LINE2
0,2532,WESTFORD NEEDLES,DEBRIS,MEDIUM,1,LEO,23106.780824,87.3257,248.7362,0.037992,109.4647,254.7657,8.668928,10009.788427,4011.985313,3251.39154,-6.2e-07,1 2532U 63014AL 23106.78082352 -.00000062 0...,2 2532 87.3257 248.7362 0379925 109.4647 254...
1,5595,COSMOS 252 DEB *,DEBRIS,SMALL,0,LEO,23114.02302,62.6827,77.9198,0.001771,92.997,267.3207,16.115735,6620.667329,254.290545,230.844114,0.05398805,1 05595U 68097CW 23114.02301989 .05398805 7...,2 05595 62.6827 77.9198 0017707 92.9970 267...
2,24138,PEGASUS DEB,DEBRIS,SMALL,0,LEO,23104.095834,81.4235,10.5713,0.001267,231.5153,128.5002,16.299754,6570.742818,200.965978,184.319658,0.1893418,1 24138U 94029GM 23104.09583383 .18934183 -9...,2 24138 81.4235 10.5713 0012667 231.5153 128...
3,26423,CZ-4 DEB,DEBRIS,SMALL,0,LEO,23106.622209,97.6514,158.7202,0.002042,279.5145,80.3845,16.382686,6548.549371,183.822163,157.076578,0.2758836,1 26423U 99057JU 23106.62220913 .27588362 2...,2 26423 97.6514 158.7202 0020421 279.5145 80...
4,26919,COSMOS 1217 DEB,DEBRIS,MEDIUM,1,HEO,23103.864714,72.6846,288.1195,0.52035,248.7902,49.4824,2.04878,26186.149125,33434.022297,6182.075954,3.97e-06,1 26919U 80085H 23103.86471375 .00000397 0...,2 26919 72.6846 288.1195 5203504 248.7902 49...


In [14]:
df['REGIME'].value_counts()

REGIME
LEO      24121
HEO       1951
GEO       1717
MEO        401
OTHER       83
Name: count, dtype: int64

### Missing Data: 

Missing data may arise due to a range of factors, such as human error (e.g., intentional non-response to survey questions), malfunctioning electrical sensors, or other causes. When data is missing, a significant amount of valuable information can be lost. Investigate the extent and pattern of missing data. Determine the nature of missingness (Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR)) , these are CS1090a concepts, and apply the most suitable technique to address it. Options include data deletion, mean/mode imputation, or more advanced methods like multiple imputation or k-NN imputation. Justify your choice based on the dataset's characteristics.

1. some states don't have catalog information (they haven't been assigned yet)

2. some objects don't have RCS / size description (we filled these in as NA)

3. there are some werid orbits that are falling into the "other" categories; should we leave this alone or make our orbit detection have more loose definitions of each regime

### Data Imbalance:

We should check

1. what portion of the data is debris, rocket bodies, or payloads

2. we should see if that ratio is consistent across regimes 

3. is RCS/size missingness consistent across regimes? 

Imbalanced data is a common issue in classification problems when one class has significantly fewer samples than the other. When dealing with imbalanced data, machine learning models may learn to favor the majority class and make predictions that prioritize accuracy for that class. This can result in unsatisfactory performance for the minority class and reduced overall model effectiveness.

Assess the class distribution in your dataset, especially for classification tasks. If a significant imbalance is present, consider resampling techniques (oversampling minorities or undersampling majorities) or applying synthetic data generation methods like SMOTE to achieve a balanced dataset, another CS1090a content piece.

### Feature Scaling:

Scaling the data is a crucial step in improving model performance and avoiding bias, as well as enhancing interpretability. When features are not appropriately scaled, those with larger scales can potentially dominate the analysis and result in biased conclusions. Standardize or normalize numerical features to ensure equal weighting in analytical models. Choose the most appropriate scaling method (e.g., Min-Max normalization, Z-score standardization) based on your data distribution and the models you plan to use.

1. should scale features 

2. should hot encode categorical variables 