This notebook explores the [NASA JPL Asteroid Dataset](https://www.kaggle.com/sakhawat18/asteroid-dataset) using machine learning techniques to create a model to predict whether asteoirds are potentially hazardous or not. The notebook covers the following aspects of machine learning:
1. Data Exploration
2. Data Wrangling
3. Data Preprocessing
4. ML Model Developing
5. Conclusion

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Import data to dataframe
data = pd.read_csv('/kaggle/input/asteroid-dataset/dataset.csv')

# 1. Data Exploration

In this section we shall explore the columns of the dataframe and analyse them accordingly.

### Basic Column Definition from the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi)
* SPK-ID: Object primary SPK-ID
* Object ID: Object internal database ID
* Object fullname: Object full name/designation
* pdes: Object primary designation
* name: Object IAU name
* NEO: Near-Earth Object (NEO) flag
* PHA: Potentially Hazardous Asteroid (PHA) flag
* H: Absolute magnitude parameter
* Diameter: object diameter (from equivalent sphere) km Unit
* Albedo: Geometric albedo
* Diameter_sigma: 1-sigma uncertainty in object diameter km Unit
* Orbit_id: Orbit solution ID
* Epoch: Epoch of osculation in modified Julian day form
* Equinox: Equinox of reference frame
* e: Eccentricity
* a: Semi-major axis au Unit
* q: perihelion distance au Unit
* i: inclination; angle with respect to x-y ecliptic plane
* tp: Time of perihelion passage TDB Unit
* moid_ld: Earth Minimum Orbit Intersection Distance au Unit

In [None]:
pd.set_option('display.max_columns', 500)
data.head()

In [None]:
data.columns

In [None]:
data.describe()

### Analyse columns

Based on the description of the data above, it can be noticed that many features have missing values. Before imputing or eliminating them, we need to first understand the kind of data each feature holds


In [None]:
data.shape

In [None]:
#1. id and spkid
print(data['id'].nunique())
print(data['spkid'].nunique())
print(data['full_name'].nunique())
print(data['pdes'].nunique())

No missing values exist in the ateroid identifying columns.

Now analyse all columns of the 'object' datatype

In [None]:
# Potentially hazardous asteroids
data['pha'].value_counts(normalize=True)

In [None]:
# Near Earth Object
data['neo'].value_counts(normalize=True)

In [None]:
# Asteroid orbit ID
print(data['orbit_id'].unique())
print(data['orbit_id'].nunique())

In [None]:
# Comet Designation prefix
print(data['prefix'].unique())
print(data['prefix'].nunique())

In [None]:
# Equinox reference
print(data['equinox'].unique())
print(data['equinox'].nunique())

In [None]:
# Orbit classification
print(data['class'].unique())
print(data['class'].nunique())

Columns 'id', 'spkid' and 'full_name' are unique for each row. The 'full_name' column values are split into columns 'pdes' and 'name'. These columns can be removed since they will not facilitate in the analysis. The 'id' column has alphanumeric values whereas column 'spkid' doesn't. So column 'id' can be removed as well. 

Columns 'prefix' and 'equinox' have only one value so they can be eliminated as well.

In [None]:
data1 = data.drop(['id', 'pdes', 'name', 'prefix', 'equinox'], axis='columns', inplace=False)

# 2. Data Wrangling

### Analyse missing values

Most columns have almost no missing values. The 'sigma' columns seems to have missing values for the same number of rows. Although the 'name' column has 97% missing values, it is paired with 'pdes' to make a full name. 

Columns 'diameter', 'albedo' and 'diameter_sigma' have 85% missing values. Since these values cannot be measured or derived, these columns can be removed.

Columns 'pha', 'moid' and those with the 'sigma' prefix columns have missing values for the same rows where 'pha' is missing data. Since its only 2% of the data, we can keep remove these entries.

In [None]:
asteroid_df = data1[data1['pha'].notna()]
asteroid_df = asteroid_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis= 'columns')

There are a few values missing in column 'H' - absolute magnitude. This can be determined using albedo and diameter but since those columns no longer exist, we can remove the rows with missing 'H' values.

In [None]:
asteroid_df = asteroid_df[asteroid_df['H'].notna()]

Remove the remaining missing row values using column 'sigma_ad' since it seems to have the most number of missing values.

In [None]:
asteroid_df = asteroid_df[asteroid_df['sigma_ad'].notna()]
asteroid_df = asteroid_df[asteroid_df['ma'].notna()] # Remove row with the one missing value for 'ma'

### Columns Data Types

Certain column types will need to be changed for the machine learning models to use. Machine Learning models will not be able to process <br> 
Convert columns 'neo', 'pha' and 'class' to categorical variables.

In [None]:
asteroid_df['neo'] = asteroid_df['neo'].astype('category')
asteroid_df['pha'] = asteroid_df['pha'].astype('category')
asteroid_df['class'] = asteroid_df['class'].astype('category')

These categories can be further analysed to understand their distribution by answering questions pertinent to their features.

In [None]:
# What percent of asteroids are near earth objects?

asteroid_df['neo'].value_counts(normalize=True)*100

In [None]:
# Of the near earth objects, what percent of them are potentially hazardous asteroids?

asteroid_df[asteroid_df['neo']=='Y']['pha'].value_counts(normalize=True)*100

In [None]:
# How many asteroids of the dataset are potentially hazardous asteroids?

asteroid_df['pha'].value_counts(normalize=True)*100

In [None]:
# Of the potentially hazardous asteroids, what percent of them are near earth objects?

asteroid_df[asteroid_df['pha']=='Y']['neo'].value_counts(normalize=True)*100

In [None]:
# What is the distribution of the orbit classification?

asteroid_df['class'].value_counts(normalize=True)*100

In [None]:
# How many orbit IDs exist?

asteroid_df['orbit_id'].nunique()

Of the data set of asteroids provided, 99.7% of the asteroids are non-hazardous. All the potentially hazardous asteroids are near earth objects (neo). On the other hand, only 9% of the near earth objects are hazardous. 

Our focus is to predict if an asteroid is potentially hazardous. 

# 3. Data Preprocessing

Before creating machine learning models, it is imperative to make sure the data being provided isn't cumbersome. For example, the 'orbit_id' feature has 525 unique catergories to identify the asteroid's orbit. We can reduce this number by analysing the less occuring orbit IDs.

In [None]:
# Number of orbit_id that have less than 10 occurances
orbits = asteroid_df['orbit_id'].value_counts().loc[lambda x: x<10].index.to_list()

In [None]:
len(orbits)

There are 331 orbit ids that occur less than 10 times. We can replace these orbit ids by renaming them as 'others' so there is no loss of data.

In [None]:
asteroid_df.loc[asteroid_df['orbit_id'].isin(orbits), 'orbit_id'] = 'other'

The data needs to be normalised before using it to train models, so all the numeric features need to be on the same scale. For this we use min-max scaler.

In [None]:
# Reset the index
asteroid_df = asteroid_df.reset_index(drop=True)

In [None]:
# Create a subset of only numerical columns to scale
subset_df = asteroid_df[asteroid_df.columns[~asteroid_df.columns.isin(['spkid', 'full_name', 'neo', 'pha', 'orbit_id', 'class'])]]

In [None]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaled_df = scaler.fit_transform(subset_df)
scaled_df = pd.DataFrame(scaled_df, columns=subset_df.columns)
asteroid_df = pd.concat([asteroid_df[['spkid', 'full_name', 'neo', 'pha', 'orbit_id', 'class']],scaled_df], axis=1)
scaled_df.head()

Convert the categorical columns 'neo' and 'class' and object column 'orbit_id' into one-hot encoding variables.

In [None]:
# 1. Create one-hot encoding columns using get_dummies
asteroid_df1 = pd.get_dummies(asteroid_df, columns=['neo', 'class', 'orbit_id'])
asteroid_df1.head()

# 4. ML Model Developing

Now that the data is ready to be modeled, there are a wide range of algorithms that can be put to use. The goal is to predict if an asteroid is potentially hazardous or not. For this classification problem, we can use the following algorithms.

1. Logistic Regression
2. Random Forest
3. Light Gradient Boosting

The best performing model can then be selected as a winner to conduct reliable predictions.

Before developing the models, we need to create the train and test sets. Remove 'spkid' and 'full_name' since it will not be required in the data modelling. The feature 'pha' will be used as label alone.

In [None]:
from sklearn.model_selection import train_test_split

X = asteroid_df1.drop(['spkid', 'full_name', 'pha'], axis=1)
y = asteroid_df1.iloc[:]['pha']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1501)

The data is highly imbalanced with over 99% of the data belonging to the negative class. This could sway the models and predict only the negative class for any input. For this reason, its best to oversample the positive class and create an equal sample numbers for both classes. This is achieved by usig the library [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html).

In [None]:
print("Before OverSampling, counts of label 'N': {}".format(sum(y_train == 'N'))) 
print("Before OverSampling, counts of label 'Y': {} \n".format(sum(y_train == 'Y'))) 
  
# import SMOTE module from imblearn library 
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 12) 
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel()) 
  
print("After OverSampling, counts of label 'N': {}".format(sum(y_train_res == 'N'))) 
print("After OverSampling, counts of label 'Y': {}".format(sum(y_train_res == 'Y'))) 

Create a function to calculate the metrics of each model.

In [None]:
def metricCalculation(y_test, pred):
    
    precision_metric = metrics.precision_score(y_test, pred, average = "macro")
    recall_metric = metrics.recall_score(y_test, pred, average = "macro")
    accuracy_metric = metrics.accuracy_score(y_test, pred)
    f1_metric = metrics.f1_score(y_test, pred, average = "macro")
    print('Precision metric:',round(precision_metric, 2))
    print('Recall Metric:',round(recall_metric, 2))
    print('Accuracy Metric:',round(accuracy_metric, 4))
    print('F1 score:',round(f1_metric, 2))

### 1. Logistic Regression

Logistic Regression will be the baseline model for the dataset. Using the metrics from this model, we can compare metrics from the other models and tune them to achieve better values.

In [None]:
# Import the model
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logisticRegr = LogisticRegression(max_iter= 10000) # create object for the class

# Fit to train model with features and labels
logisticRegr.fit(x_train_res, y_train_res)

# Predict for test set
lr_pred = logisticRegr.predict(x_test)

In [None]:
# Calculate metrics
metricCalculation(y_test, lr_pred)

In [None]:
# Print confusion matrix
print(metrics.confusion_matrix(y_test, lr_pred))

Although the accuracy and recall of the model is high, the precision metric and F1 score paint a different picture. The low F1 score and precision prove that the model doesn't classify well and has a poor balance between the two classes. Based on the confusion matrix alone, we notice a high number of false positives. The power of the logistic regression model isn't strong enough to predict the nature of the asteroids.

### 2. Random Forest

Random Forest is known to elimiate the chance of overfitting and with the help of its ensamble method, it could be a better classifier than logisitic regression. 

In [None]:
# Import the model
from sklearn.ensemble import RandomForestClassifier

# Instantiate model with 150 decision trees
rf = RandomForestClassifier(n_estimators = 150, random_state = 1551)

# Train the model on training data
rf.fit(x_train_res, y_train_res)

# Predict for test set
rf_pred = rf.predict(x_test)

In [None]:
# Calculate metrics
metricCalculation(y_test, rf_pred)

In [None]:
# Confusion matrix
print(metrics.confusion_matrix(y_test, rf_pred))

The random forest classifier has a higher F-score and precision than the logistic regression, proving that its a better model for identifying the nature of an asteroid. Using this model, we can identify the most important features that help in determining the type of asteroid.

In [None]:
feature_imp = pd.DataFrame(rf.feature_importances_,index=x_train_res.columns, columns = ['Importance']).sort_values(by='Importance', ascending=False)

In [None]:
# Top 10 important variables
feature_imp[0:10]

Based on the Random Forest model, the most important feature is the Earth Minimum Orbit Intersection Distance (moid_id) followed by identifying if the object in question is a near earth object (neo) or not. 

In [None]:
# 10 least important features
feature_imp[-10:]

In [None]:
feature_imp[-50:].index

Upon further exploring the least important features, it can be seen that the orbit IDs do not contribute much to the model. The dataset can be modified by eliminating the orbit_id feature completely. Similarly we can also elimate the features 'sigma_ma' and 'sigma_tp' that have 0 importance.

To do this, create a new dataset with one-hot encoding and dropping the orbit_id column.

In [None]:
asteroid_df2 = pd.get_dummies(asteroid_df, columns=['neo', 'class'])
asteroid_df2.drop(['orbit_id','sigma_ma', 'sigma_tp'], axis='columns', inplace=True)

In [None]:
# Create train test splits 

X1 = asteroid_df2.drop(['spkid', 'full_name', 'pha'], axis=1)
y1 = asteroid_df2.iloc[:]['pha']

x_train1, x_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.30, random_state=1501)

In [None]:
# Create equal balance of classes using SMOTE

sm = SMOTE(random_state = 12) 
x_train_res1, y_train_res1 = sm.fit_sample(x_train1, y_train1.ravel()) 
  
print("After OverSampling, counts of label 'N': {}".format(sum(y_train_res1 == 'N'))) 
print("After OverSampling, counts of label 'Y': {}".format(sum(y_train_res1 == 'Y'))) 

In [None]:
# Instantiate model with 150 decision trees
rf = RandomForestClassifier(n_estimators = 150, random_state = 1551)

# Train the model on training data
rf.fit(x_train_res1, y_train_res1)

# Predict for test set
rf_pred1 = rf.predict(x_test1)

In [None]:
# Calculate metrics
metricCalculation(y_test, rf_pred1)

In [None]:
print(metrics.confusion_matrix(y_test, rf_pred))

As can be seen, the model performance has improved with fewer false positives and false negatives. Thus, Random Forest can be used as a reliable model to predict the nature of the asteroid.

### 3. Light Gradient Boosting

Gradient Boosting Model (GBM) is a model better preferred for prediction since it combines the principles of gradient decsent and the randomness of decision trees. We can expect a better performing model with GBM, as compared to Random Forest. Due to the massive size of the data, we can choose Light Gradient Boosting model which is known for its quick performance when compared to XGBoosting.

This algorithm will be used on 2 datasets - one with all the features and one without the important features as identified by the Random Forest model.

### 3.1 LGBM with entire data set

For the sake of LGBM, convert the labels into numeric values by substituting Y with 1 and N with 0

In [None]:
# Duplicate the training data sets for the label
y_train_res_2 = y_train_res

In [None]:
# Encode labels

for n,i in enumerate(y_train_res_2):
    if i=='Y':
      y_train_res_2[n] = 1
    else:
        y_train_res_2[n] = 0

In [None]:
# Use label encoding to encode test labels 
y_test_2 = y_test.cat.codes

In [None]:
# Load the training dataset along with the label to LGBM
import lightgbm as lgb 

train_data=lgb.Dataset(x_train_res,label=y_train_res_2)

Set paramateres before fitting the model. After experimenting with a few learning rates, it was found that a rate of 0.01 yeilded the highest value for precision although all other metrics remained almost the same. 

In [None]:
#setting parameters for lightgbm

param = {'num_leaves': 150, # number of leaves per tree
         'nrounds': 350,
         'max_depth': 25, # depth of tree
         'learning_rate': 0.01, # learning rate
         'max_bin': 500 # max number of bins to bucket the feature values.
        }

In [None]:
# Train the model 

lgbm = lgb.train(param, train_data)
lgbm_pred = lgbm.predict(x_test)

# Convert the predicted probabilities to 0 or 1
for i in range(0,len(y_test_2)):
    if lgbm_pred[i]>=.5:       # setting threshold to .5
       lgbm_pred[i]=1
    else:  
       lgbm_pred[i]=0

In [None]:
# Calculate metrics
metricCalculation(y_test_2, lgbm_pred)

In [None]:
# Confusion Matrix
print(metrics.confusion_matrix(y_test_2, lgbm_pred))

The Light Gradient Boosting Model with the entire dataset has higher metric values as compared to Logistic Regression but a little lower than Random Forest. The confusion matrix shows that the LGBM has more mislabled values than the Random Forest model. Now we try training the model on the trimmed data set based on the importance of the Random Forest model.

### 3.2 LGBM with trimmed data

Convert the labels into numeric values by substituting Y with 1 and N with 0

In [None]:
# Duplicate the training data sets for the label
y_train_res_3 = y_train_res1

# Encode labels

for n,i in enumerate(y_train_res_3):
    if i=='Y':
      y_train_res_3[n] = 1
    else:
        y_train_res_3[n] = 0
        
# Use label encoding to encode test labels 
y_test_3 = y_test1.cat.codes

In [None]:
# Load the training dataset along with the label to LGBM
import lightgbm as lgb 

train_data_1=lgb.Dataset(x_train_res1,label=y_train_res_3)

We use the same parameters for the model.

In [None]:
# Train the model 

lgbm_1 = lgb.train(param, train_data_1)
lgbm_pred_1 = lgbm_1.predict(x_test1)

# Convert the predicted probabilities to 0 or 1
for i in range(0,len(y_test_3)):
    if lgbm_pred_1[i]>=.5:       # setting threshold to .5
       lgbm_pred_1[i]=1
    else:  
       lgbm_pred_1[i]=0

In [None]:
# Calculate metrics
metricCalculation(y_test_3, lgbm_pred_1)

In [None]:
# Confusion Matrix
print(metrics.confusion_matrix(y_test_3, lgbm_pred_1))

The model with the trimmed data is a definite improvement although it isn't as good as the prediciton by Random Forest, since it still has more mislabeled asteoroids. LGBM ranks second in model performance. 

# 5. Conclusion

Following is the table of performance evaluation for the models created.


| Model| Accuracy  | Precision    | Recall   | F-1 Score   |
|---:|:-------------|:-----------|:------|:------|
| Random Forest with importance | 99.99%  | 0.98    | 1.0   | 0.99     |
| Random Forest | 99.98%  | 0.97   | 0.99   | 0.98     |
| Light Gradient Boosting with importance | 99.98%  | 0.96    | 0.99   | 0.98     |
| Light Gradient Boosting | 99.98%  |  0.96  |  0.99  |   0.98   |
| Logistic Regression |  99.11% | 0.6       | 0.98   | 0.67     |

The Random Forest model with only important features trumps the other models in perfomance metrics. Light Gradient Boosting had a good performance as well, but not as good as Random Forest, even with the model tuned for different paramters. Logistic Regression was used a baseline model to match the other models with, and although it had good accuracy, it still was weak in performance. <br>

Thus a tuned Random Forest model would be best to predict the hazardous nature of the asteroids. 