# Milestone 2

In [None]:
# imports
from utils import spacetrack

## Data Acquisition 

We are using historical [Two-Line Element](https://en.wikipedia.org/wiki/Two-line_element_set) sets and satellite catalog information from the [space-track.org](https://www.space-track.org/) API, which requires registering an email and password to access. Every day, space-track.org updates its state catalog to reflect the updated positions of satellites and to add states for any new objects. Luckily, space-track.org keeps a historical record of their state catalogs, allowing us to look back in time at the state history for any given satellite in orbit at the time of inquiry.

For our project, we want to go back one year and pull the state record on a cadence of 12 hours. Due to orbital perturbations, tactical space operations, and the nature of the TLE, state encodings for even stable objects will change over time. This makes having multiple state representations for the same object useful and prevents duplicate information. That said, not every object will update as often as every 12 hours, so we have included deduplication logic to prevent duplicate states for the same object.

The collection process and initial processing of our data were quite intensive, so the code exists in another notebook. However, here is an outline of the initial data collection process:

1. Establish a connection to the space-track API.
2. Collect and save state catalogs to the computer, where each file represents one timestamped snapshot of the state catalog. (*Note: This gave us the ability to experiment with our final dataframe object without having to make redundant, time-expensive calls to the API.*) Once downloaded, we saved these files in .zip form so they could be uploaded to Github.
3. Run code that loads all snapshots into memory, performs state duplication, and resaves the data locally into a single, usable file. The resulting object is what we are using for our Milestone 2 analysis below.


*To see the process described above visit the DataAcquisition.ipynb notebook*

TODO: add link to NB

INSTRUCTIONS: https://edstem.org/us/courses/74185/lessons/130679/slides/736235  

## Data Description: Loading and Understanding

Load: Start a new Jupyter Notebook, import necessary Python libraries (e.g., pandas, numpy, sklearn), and load your dataset. 

Understand: Examine the dataset. Ensure you understand what different columns/rows represent or the image/text intricacies. 

## Pre-Process
Preprocess: Propose or perform basic dataset cleaning to make it suitable for analysis, visualization, and modeling which you will pursue in later milestones. Document each step in your Jupyter Notebook to justify the preprocessing decisions made. Reference the next section for details on what comprehensive data cleaning and preprocessing should include. 



Missing Data: 

Missing data may arise due to a range of factors, such as human error (e.g., intentional non-response to survey questions), malfunctioning electrical sensors, or other causes. When data is missing, a significant amount of valuable information can be lost. Investigate the extent and pattern of missing data. Determine the nature of missingness (Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR)) , these are CS1090a concepts, and apply the most suitable technique to address it. Options include data deletion, mean/mode imputation, or more advanced methods like multiple imputation or k-NN imputation. Justify your choice based on the dataset's characteristics.

Data Imbalance:

Imbalanced data is a common issue in classification problems when one class has significantly fewer samples than the other. When dealing with imbalanced data, machine learning models may learn to favor the majority class and make predictions that prioritize accuracy for that class. This can result in unsatisfactory performance for the minority class and reduced overall model effectiveness.

Assess the class distribution in your dataset, especially for classification tasks. If a significant imbalance is present, consider resampling techniques (oversampling minorities or undersampling majorities) or applying synthetic data generation methods like SMOTE to achieve a balanced dataset, another CS1090a content piece.

Feature Scaling:

Scaling the data is a crucial step in improving model performance and avoiding bias, as well as enhancing interpretability. When features are not appropriately scaled, those with larger scales can potentially dominate the analysis and result in biased conclusions. Standardize or normalize numerical features to ensure equal weighting in analytical models. Choose the most appropriate scaling method (e.g., Min-Max normalization, Z-score standardization) based on your data distribution and the models you plan to use.