## Predicting Satellite Congestion Risk: Neural Network Approach

Just like in the last part, in this we explore how to predict the congestion risk of satellites using machine learning, using a dataset containing various satellite parameters to build a model that can classify satellites into different congestion risk categories (Low, Medium, High). However in this notebook, we take a different approach. Instead of using a traditional, statistical machine learning approach, we use a neural network algorithm to make predictions instead.

In [None]:
# https://www.kaggle.com/datasets/karnikakapoor/satellite-orbital-catalog

### 1. Setting Up Our Environment and Loading Data

Before we start, we need to import the necessary libraries. These are like tools in our data science toolkit. We'll be using `pandas` for data manipulation, `sklearn` (Scikit-learn) for machine learning tasks, and `tensorflow` for neural network models

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OrdinalEncoder
from sklearn.utils import class_weight
import tensorflow as tf

Next, we'll load our dataset. The data comes from a Kaggle dataset containing an orbital catalog of satellites. We'll load it into a `pandas DataFrame`, which is a table-like data structure that's very common and useful for data analysis in Python.

After loading, we'll display the first few rows (`df.head()`) to get a quick peek at the data and use `df.info()` to see a summary of the columns, their data types, and whether they contain any missing values. This helps us understand what we're working with.


In [None]:
# download kaggle dataset from google drive, and import as a pandas dataframe
df = pd.read_csv('https://drive.google.com/uc?export=download&id=1i4FdBT71ale29-1ido9Q0HNeNzOZ6lFN')

# display some info about the dataframe
display(df.head())
df.info()

Unnamed: 0,norad_id,name,object_type,satellite_constellation,altitude_km,altitude_category,orbital_band,congestion_risk,inclination,eccentricity,launch_year_estimate,days_in_orbit_estimate,orbit_lifetime_category,mean_motion,epoch,data_source,snapshot_date,country,last_seen
0,900,CALSPHERE 1,PAYLOAD,Other,976.868247,Low LEO,LEO-Polar,LOW,90.2215,0.00271,2023,0,<1yr,13.763481,2025-12-03 11:44:40.165728,celestrak,2025-12-03,US,2025-12-03
1,902,CALSPHERE 2,PAYLOAD,Other,1061.675587,Mid LEO,LEO-Polar,LOW,90.2363,0.002044,2023,0,<1yr,13.528815,2025-12-03 06:12:53.330976,celestrak,2025-12-03,US,2025-12-03
2,1361,LCS 1,PAYLOAD,Other,2787.874819,High LEO,MEO,LOW,32.1427,0.001343,2023,0,<1yr,9.893094,2025-12-03 11:26:30.164064,celestrak,2025-12-03,US,2025-12-03
3,1512,TEMPSAT 1,PAYLOAD,Other,1133.286101,Mid LEO,LEO-Polar,HIGH,89.9888,0.007142,2023,0,<1yr,13.335811,2025-12-03 09:48:38.369088,celestrak,2025-12-03,US,2025-12-03
4,1520,CALSPHERE 4A,PAYLOAD,Other,1123.330697,Mid LEO,LEO-Polar,HIGH,89.9092,0.006823,2023,0,<1yr,13.362367,2025-12-03 09:46:39.199296,celestrak,2025-12-03,US,2025-12-03


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13610 entries, 0 to 13609
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   norad_id                 13610 non-null  int64  
 1   name                     13610 non-null  object 
 2   object_type              13610 non-null  object 
 3   satellite_constellation  13610 non-null  object 
 4   altitude_km              13610 non-null  float64
 5   altitude_category        13610 non-null  object 
 6   orbital_band             13610 non-null  object 
 7   congestion_risk          13610 non-null  object 
 8   inclination              13610 non-null  float64
 9   eccentricity             13610 non-null  float64
 10  launch_year_estimate     13610 non-null  int64  
 11  days_in_orbit_estimate   13610 non-null  int64  
 12  orbit_lifetime_category  13610 non-null  object 
 13  mean_motion              13610 non-null  float64
 14  epoch                 

### 2. Preparing the Data for Machine Learning: The Preprocessing Stage

Raw data, as interesting as it may be, is rarely ready for a machine learning model right out of the box. This is where 'data preprocessing' comes in – cleaning, transforming, and structuring our data so that our model can learn from it effectively.

Here's what we're doing in this section:

1.  **Defining Features (X) and Target (y)**:
    *   In supervised machine learning, we're trying to predict a specific outcome. This outcome is called our **target variable** (often denoted as `y`). In our case, `y` is `'congestion_risk'`, which tells us how risky a satellite's orbit is.
    *   All the other pieces of information that we use to predict the target are called **features** (often denoted as `X`). We take all columns *except* `'congestion_risk'` and assign them to `x`.

2.  **Excluding Irrelevant Columns**:
    *   Not all columns in our dataset are useful for prediction. Some are just identifiers or descriptive text that a machine learning model won't understand or benefit from (e.g., `norad_id`, `name`). We create a list `exclude_columns` to remove these from our features (`x`).

3.  **Identifying Categorical Data**:
    *   Many machine learning models work best with numbers. However, our dataset contains 'categorical' data – columns with distinct categories rather than continuous numerical values (e.g., `'object_type'`, `'altitude_category'`). We identify these in `categorical_cols`.

4.  **Ordinal Encoding: Turning Categories into Numbers**:
    *   Since our model needs numerical input, we need to convert these categorical columns. We use `OrdinalEncoder` from `sklearn.preprocessing`. This encoder assigns a unique integer to each unique category in a column. For example, 'Low' might become 0, 'Medium' might become 1, and 'High' might become 2.
    *   We `fit_transform` this encoder on our categorical columns in `x_processed`. This means the encoder first 'learns' all the unique categories and then 'transforms' them into their corresponding numerical representations.
    *   Crucially, we also `fit_transform` the encoder on our target variable `y` (`congestion_risk`). This ensures that the model's output (which will be numbers) can be directly compared to the numerical representation of our risk categories.

5.  **Splitting Data: Training and Testing Sets**:
    *   To evaluate how well our model performs on *unseen* data, we split our dataset into two parts: a **training set** and a **testing set**. Think of it like studying for an exam: you learn the material (training) and then take the exam with new questions (testing) to see what you've truly learned.
    *   `train_test_split` from `sklearn.model_selection` does this for us. We allocate 80% of the data for training (`x_train`, `y_train`) and 20% for testing (`x_test`, `y_test`).
    *   `random_state=42` ensures that if we run this code again, the split will be exactly the same, making our results reproducible.
    *   Finally, we apply the *same* `encoder` (already fitted on the full dataset) to transform `y_train` and `y_test` into their numerical versions (`y_train_encoded`, `y_test_encoded`). This is vital: the test data should only be transformed based on knowledge gained from the training data, not the test data itself.

In [None]:
# separate features (x) and target (y)
x = df.drop('congestion_risk', axis=1)
y = df['congestion_risk']

# exclude descriptive columns not used for training data
exclude_columns = ['norad_id', 'name', 'epoch', 'data_source', 'snapshot_date', 'last_seen']
categorical_cols = ['object_type', 'satellite_constellation', 'altitude_category', 'orbital_band', 'orbit_lifetime_category', 'country']

# drop excluded columns
x_processed = x.drop(columns=exclude_columns, errors='ignore')

# use ordinal encoding to change categorical data to numerical for training
encoder = OrdinalEncoder()
x_processed[categorical_cols] = encoder.fit_transform(x_processed[categorical_cols])

# fit the encoder on the entire target variable 'y' to ensure all possible labels are learned
y_encoded_full = encoder.fit_transform(y.values.reshape(-1, 1))
num_classes = len(encoder.categories_[0]) # get the number of unique classes

# split data into training and testing sets with the processed data
x_train, x_test, y_train, y_test = train_test_split(x_processed, y, test_size=0.2, random_state=42)

# now transform y_train and y_test using the fitted encoder
y_train_encoded = encoder.transform(y_train.values.reshape(-1, 1)).flatten()
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1)).flatten()

### 3. Building Our Predictive Brain: The Neural Network Model

Now that our data is squeaky clean and ready, it's time to build the 'brain' that will learn from this data: our machine learning model. For this task, we're using a type of model called a **Fully Connected Neural Network** (also known as a Dense Neural Network), built with `tensorflow` and `keras`. Neural networks are inspired by the human brain and are excellent at finding complex patterns in data.

Let's break down the model's construction:

1.  **`tf.keras.models.Sequential`**: This is like stacking layers on top of each other to build our network. Data flows from one layer to the next.

2.  **`tf.keras.layers.Dense(128, activation='relu', input_shape=(x_train.shape[1],))`**: This is our **input layer** and the first 'hidden' layer.
    *   `Dense` means every neuron in this layer is connected to every neuron in the previous layer (or the input).
    *   `128` is the number of 'neurons' or units in this layer. More neurons can capture more complex patterns.
    *   `activation='relu'` stands for Rectified Linear Unit. It's a common activation function that helps the network learn non-linear relationships. Think of it as a switch that turns a neuron 'on' or 'off' based on the input it receives.
    *   `input_shape=(x_train.shape[1],)` tells the model how many features it should expect in each input sample (which is the number of columns in our `x_train`).

3.  **`tf.keras.layers.Dropout(0.5)`**: This is a regularization technique. During training, it randomly 'turns off' 50% of the neurons in the previous layer. Why? To prevent the model from becoming too reliant on any single neuron and to make it more robust. This helps combat **overfitting**, where the model memorizes the training data instead of learning general patterns.

4.  **`tf.keras.layers.Dense(64, activation='relu')`**: Another hidden layer, similar to the first, but with fewer neurons (`64`).

5.  **`tf.keras.layers.Dropout(0.3)`**: Another dropout layer, this time turning off 30% of neurons.

6.  **`tf.keras.layers.Dense(num_classes, activation='softmax')`**: This is our **output layer**.
    *   `num_classes` is the number of unique congestion risk categories we are trying to predict (e.g., 'Low', 'Medium', 'High').
    *   `activation='softmax'` is used for multi-class classification problems. It outputs a probability distribution over the `num_classes`, meaning it tells us the likelihood that a satellite belongs to each risk category. The sum of these probabilities will be 1.

### Compiling the Model: Setting Up for Learning

After defining the network's structure, we need to 'compile' it. This step configures the learning process:

*   **`optimizer='adam'`**: The optimizer is like the engine of our learning process. 'Adam' is a very popular and effective algorithm that adjusts the internal weights of the neural network to minimize errors during training.
*   **`loss='sparse_categorical_crossentropy'`**: The 'loss function' measures how far off our model's predictions are from the true values. For multi-class classification with integer-encoded labels, `sparse_categorical_crossentropy` is the appropriate choice. The optimizer tries to minimize this loss.
*   **`metrics=['accuracy']`**: Metrics are what we use to monitor the training process and evaluate the model's performance. 'Accuracy' is straightforward: it tells us the proportion of correctly predicted instances.

### Training and Evaluating the Model: The Learning Begins!

Finally, we train our model using the `cnn_model.fit()` method. This is where the model 'learns' from the training data:

*   **`x_train`, `y_train_encoded`**: Our training features and their corresponding encoded target labels.
*   **`epochs=10`**: An epoch is one complete pass through the entire training dataset. We're telling the model to iterate over the data 10 times.
*   **`batch_size=32`**: Instead of feeding all data at once (which can be memory-intensive), the model processes data in smaller 'batches' (here, 32 samples at a time) and updates its weights after each batch.
*   **`validation_split=0.2`**: During training, we reserve a small portion (20%) of the training data as a 'validation set'. The model does not learn from this data directly, but we use it to monitor performance during training. This helps us catch overfitting early.

After training, we `evaluate` the model on our completely unseen `x_test` and `y_test_encoded` data to get a final, unbiased measure of its performance. The `accuracyCNN` value tells us how well our neural network performed in predicting the congestion risk for satellites it had never seen before.

In [None]:
# define a fully connected (dense) neural network model
cnn_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(x_train.shape[1],)), # input layer with number of features
    tf.keras.layers.Dropout(0.5), # dropout layer to prevent overfitting
    tf.keras.layers.Dense(64, activation='relu'), # hidden layer
    tf.keras.layers.Dropout(0.3), # dropout layer to prevent overfitting
    tf.keras.layers.Dense(num_classes, activation='softmax') # output layer with number of classes
])

# compile the model
cnn_model.compile(optimizer='adam', # popular and efficient algorithm
              loss='sparse_categorical_crossentropy', # sparse categorical cross-entropy is appropriate for multi-class classification problems
              metrics=['accuracy'])

# fit the model
cnn_model.fit(x_train, y_train_encoded, epochs=10, batch_size=32, validation_split=0.2)

# evaluate the model
lossCNN, accuracyCNN = cnn_model.evaluate(x_test, y_test_encoded)
print('CNN Test accuracy:', accuracyCNN)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.6967 - loss: 170.0155 - val_accuracy: 0.8246 - val_loss: 15.7885
Epoch 2/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.6907 - loss: 35.1784 - val_accuracy: 0.8251 - val_loss: 0.7830
Epoch 3/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7800 - loss: 6.5387 - val_accuracy: 0.8251 - val_loss: 0.6160
Epoch 4/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8109 - loss: 2.3573 - val_accuracy: 0.8251 - val_loss: 0.5904
Epoch 5/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8165 - loss: 1.7016 - val_accuracy: 0.8251 - val_loss: 0.5419
Epoch 6/10
[1m273/273[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.8190 - loss: 1.3572 - val_accuracy: 0.8251 - val_loss: 0.5328
Epoch 7/10
[1m273/273

### 4. Conclusion: Neural Networks for Satellite Congestion

In this journey, we've successfully built and trained a fully connected neural network to predict satellite congestion risk. We started from raw data, meticulously preprocessed it, and then designed a neural network architecture capable of learning complex patterns.

Our neural network achieved an impressive accuracy of approximately **0.817 (81.7%)** on unseen test data. This indicates that the model is quite effective at classifying satellites into their respective congestion risk categories. While this is a good result, it's worth noting that the Random Forest Classifier from the previous part performed even better on tabular data like this, since the relationships in the data are clearly defined by distinct feature boundaries. Random Forests excel at making decisions based on such clear splits.