- Loading the Dataset :The CSV file (thermoracleTrain.csv) is loaded into a Pandas DataFrame named df.
- The dataset contains 60000 rows and 11 columns with the Prediction column with values 0 to 9, being the target column.
- Dataset Statistic Summary : Mean for all columns (except Prediction) is almost zero indicating the data is scaled. Two categorical columns are present - Magnetic Field Strength and Radiation Levels.
- Rows where the Prediction column is Null are dropped from the dataset.
- The rest of the missing values in numerical columns are filled with the respective medians.
- The value counts of the predicted classes show that the data is fairly balanced.
- The categorical columns- Magnetic Field Strength and Radiation Levels have values Category_1 to Category_20 which are mapped to corresponding values of 1 to 20 (treating them as levels) to numerical columns.
- K-Nearest Neighbours Imputation is used to fill the missing values of the categorical columns.
- Distributions of all features are plotted to check for skewness and understand its overall distributon.
- Pairwise correlation of all the features in plotted in a correlation matrix as a part of bivariate analysis.
- Surface Temperature and Mineral Abundance is seen to have strong positive correlation with the target column.
- Rows with absolute z-scores <3 are retained and else dropped leaving 55778 rows.
- In the ensemble-based noise detection method, a list of classifiers is chosen to use for predicting class labels on the dataset. These models are not pre-trained; rather, they are instantiatedand then used within a cross-validation framework to generate predictions for each instance.The ensemble method computes how often each instance is misclassified across all the classifiers. The more classifiers that disagree with the true label, the higher the noise score for that instance, suggesting that it might be mislabeled or noisy.
- Top 20 percentile data based on noise scores are removed resulting in 50186 columns remaining.
We developed a custom neural network architecture for the classification task. Details of the same are as follows.
This model uses a deep learning approach with residual connections to classify cosmic objects.
- Each residual block maintains a skip connection that allows gradients to flow through the network more efficiently.
- Uses L2 regularization to prevent overfitting.
- Implements BatchNormalization to stabilize learning.
- Uses LeakyReLU activation for better gradient flow.
- Includes dropout for additional regularization.
- Input Layer: Accepts 15 astronomical features.
- Initial Feature Extraction: 256 neurons with regularization.
- Residual Blocks: Four blocks with decreasing dimensions (256 → 192 → 128 → 64).
- Output Layer: 10 classification categories using softmax activation.
✅ SGD Optimizer: Uses momentum and Nesterov acceleration.
✅ Adaptive Learning Rate: Reduces learning rate when improvement plateaus.
✅ Early Stopping: Prevents overfitting by monitoring validation accuracy.
✅ Data Augmentation: Applies random noise to training data for better generalization.
This architecture combines modern deep learning techniques like residual connections, regularization, and normalization to create a robust classifier for cosmic objects while minimizing overfitting.
This deep learning model demonstrates strong performance across all classes with an overall accuracy of 95.85%.
📌 Overall Accuracy: 96.50% across all 10 planet classifications
📌 Average Precision: 96.36% (macro)
📌 Average Recall: 96.33%
📌 Average F1-Score: 96.34%
- Best Performing Class: Class 1 with 99.00% F1-score (99.60% recall, 98.42% precision)
- Lowest Performing Class: Class 9 with 91.99% F1-score (92.09% recall, 91.89% precision)
- Most Balanced Classes: Classes 0, 2, and 6 show nearly equal precision and recall
- Confusion Patterns: Class 9 has the most misclassifications, particularly with Class 4
- False Positives: Most prevalent in Classes 8 and 9
- False Negatives: Most common in Class 5 (8% of samples)
✅ The model maintains consistent performance across all classes with minimal variance in metrics.
✅ No significant bias toward any particular class, despite slight variations in support counts.
✅ Weighted averages closely match macro averages, indicating good class balance.
This balanced performance suggests that the model has successfully learned distinctive features for each planetary classification while maintaining generalizability. 🚀
- The training accuracy consistently increases, reaching around 95%.
- The validation accuracy surpasses training accuracy early and stabilizes around 96%.
- The training loss starts high but steadily decreases, indicating effective learning.
- The validation loss also decreases but remains slightly higher than training loss.
- Validation accuracy does not drop significantly, and validation loss does not increase at later epochs.
- The small gap between training and validation loss suggests good generalization.
- The model continues improving without drastic fluctuations, suggesting stable training.
- If validation loss starts increasing after more epochs, early stopping can be applied to prevent overfitting.