 Question 3: Experimental Investigation

In this question, an experimental investigation is conducted to analyse the impact of
class imbalance on model performance. The Kepler dataset contains a significantly larger
number of non-confirmed detections than confirmed exoplanets, which may bias classifiers
towards the majority class. This experiment investigates whether applying class weighting
can improve detection of confirmed exoplanets.


In [3]:
import sys, pandas as pd
from importlib import reload
import kagglehub
!rm -rf /content/up2115556-machine-learning-and-neural-network-coursework-2 # ensures a clean cell (this was a massive issue over and over)
!git clone https://github.com/up2115556/up2115556-machine-learning-and-neural-network-coursework-2.git #clones the repository
sys.path.insert(0, "/content/up2115556-machine-learning-and-neural-network-coursework-2")
from helpers.functions import prepare_kepler_data, make_train_test_split
import helpers.functions as funcs
reload(funcs)
prepare_kepler_data = funcs.prepare_kepler_data
make_train_test_split = funcs.make_train_test_split
#downloads the kepler dataset
path = kagglehub.dataset_download("nasa/kepler-exoplanet-search-results")
df = pd.read_csv(f"{path}/cumulative.csv")
#preps the data
X, y = prepare_kepler_data(df)
X_train_scaled, X_test_scaled, y_train, y_test, scaler = make_train_test_split(X, y)

Cloning into 'up2115556-machine-learning-and-neural-network-coursework-2'...
remote: Enumerating objects: 60, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 60 (delta 19), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (60/60), 26.05 KiB | 4.34 MiB/s, done.
Resolving deltas: 100% (19/19), done.
Using Colab cache for faster access to the 'kepler-exoplanet-search-results' dataset.


 Experimental Setup

A Logistic Regression model is used as a baseline and compared against the same model
trained with class weighting applied. All preprocessing steps, including feature
standardisation and the train test split, are kept identical to ensure a controlled
experiment. The only modification is the introduction of class weighting to penalise
misclassification of confirmed exoplanets more heavily.


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# baseline logistic regression model
baseline_model = LogisticRegression(max_iter=500)
baseline_model.fit(X_train_scaled, y_train)
y_pred_base = baseline_model.predict(X_test_scaled)

print("baseline model")
print(confusion_matrix(y_test, y_pred_base))
print(classification_report(y_test, y_pred_base))

# logistic regression with class weighting
weighted_model = LogisticRegression(
    max_iter=500,
    class_weight="balanced")
weighted_model.fit(X_train_scaled, y_train)
y_pred_weighted = weighted_model.predict(X_test_scaled)

print("\nweighted model")
print(confusion_matrix(y_test, y_pred_weighted))
print(classification_report(y_test, y_pred_weighted))


baseline model
[[1347  107]
 [  72  387]]
              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94      1454
         1.0       0.78      0.84      0.81       459

    accuracy                           0.91      1913
   macro avg       0.87      0.88      0.87      1913
weighted avg       0.91      0.91      0.91      1913


weighted model
[[1271  183]
 [  24  435]]
              precision    recall  f1-score   support

         0.0       0.98      0.87      0.92      1454
         1.0       0.70      0.95      0.81       459

    accuracy                           0.89      1913
   macro avg       0.84      0.91      0.87      1913
weighted avg       0.91      0.89      0.90      1913



Results and Discussion

Applying class weighting substantially increases recall for confirmed exoplanets,
improving detection of real planets and significantly reducing false negatives. This
improvement comes at the cost of reduced precision, as the number of false positives
increases. These results demonstrate a clear trade off between sensitivity and
specificity when addressing class imbalance in binary classification.


Conclusion

This experiment highlights the importance of addressing class imbalance in exoplanet
classification tasks. Class weighting improves sensitivity to confirmed exoplanets,
which may be desirable in scientific discovery contexts where missing true positives
is costly. However, the increase in false positives underscores the need to balance
recall and precision based on application requirements.
