# INTRODUCTION

### Data
The data is about Asteroids - NeoWs.
NeoWs (Near Earth Object Web Service) is a RESTful web service for near earth Asteroid information. With NeoWs a user can: search for Asteroids based on their closest approach date to Earth, lookup a specific Asteroid with its NASA JPL small body id, as well as browse the overall data-set.

Acknowledgements
Data-set: All the data is from the (http://neo.jpl.nasa.gov/). This API is maintained by SpaceRocks Team: David Greenfield, Arezu Sarvestani, Jason English and Peter Baunach.
 
### Tasks
Based on the information within the dataset we want to performs two tasks:
- Develop a **model that predicts if an asteroid is going to be hazardous or not**
- Identify which **features are more relevant towards the classfication** on point 1

To tackle this two tasks I'll use the methods listed below, and extract the results from the method with best performance on unseen data
- Logistic Regression
- Decision Tree
- Random Forest
- SVM
- XGBossting
 

# Libraries & Data Import

In [None]:
# Programming
import pandas as pd
import numpy as np

# Machine Learning | sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, plot_importance
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Other
import missingno as msno
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/kaggle/input/nasa-asteroids-classification/nasa.csv',
                 parse_dates=['Close Approach Date', 'Orbit Determination Date', 'Epoch Date Close Approach'])

In [None]:
# Fixing seed for reproducibility
seed = 1234

# PRE-PROCESSING

# Step 1: Data Inspection

In [None]:
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Step 2: Check Missing Values

In [None]:
# Check for missing values
print(df.isnull().sum())

# Visually inspect missing values
msno.matrix(df)

# Step 3: Imputing

This step is not necessary for this specific dataset, as there are no missing values.

# Step 4: Dimensionality Reduction

In [None]:
# Checking visually for feature correlation
sns.set(rc={'figure.figsize':(30,20)})
sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap="Spectral", annot=True)
plt.show()
plt.close()

In [None]:
# Dropping completely correlated features and datetime features
df = df.drop(['Est Dia in M(min)', 'Est Dia in M(max)', 'Est Dia in Miles(min)', 'Est Dia in Miles(max)', 'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 'Est Dia in KM(max)',
              'Relative Velocity km per hr', 'Miles per hour',
              'Miss Dist.(Astronomical)', 'Miss Dist.(lunar)', 'Miss Dist.(miles)',
              'Semi Major Axis',
              'Neo Reference ID', 'Name',
              'Close Approach Date', 'Epoch Date Close Approach', 'Orbit Determination Date'],axis=1)

# Plotting feature correlation with reduced dataset
sns.set(rc={'figure.figsize':(30,20)})
sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap="Spectral", annot=True)
plt.show()
plt.close()

# Step 5: Categorical Feature Encoding

In [None]:
# Encoding the target variable
l_enc = LabelEncoder()
df['hazardous'] = l_enc.fit_transform(df.Hazardous) 
print('Hazardous == True -> 1')
print('Hazardous == False -> 0\n')

# Checking if the other categorical features need to be encoded
print(df['Orbiting Body'].unique())
print(df['Equinox'].unique())
print('\n')
# Removing them since there is only a single value that is identical across all observations
df = df.drop(['Orbiting Body', 'Equinox', 'Hazardous'], axis=1)

# Check after all the changes
print(df.info())
df.head()

# Step 6: Train/Test Split

In [None]:
# Creating the Features/Label split as numpy arrays
features = df.drop('hazardous', axis=1).values
label = df.hazardous.values

# Creating the test/train split
training_features, test_features, training_label, test_label = train_test_split(features, label,
                                                                                test_size=0.8,
                                                                                stratify=label,
                                                                                random_state=seed)

# Getting feature labels for future plotting
df_graph = df.copy()
feature_names = df_graph.drop('hazardous', axis=1).columns.tolist()
del df_graph

# MODELS

# Logistic Regression

In [None]:
# Creating the pipeline
logreg_pipe = Pipeline([('Scaling', StandardScaler()),
                        ('LogReg', LogisticRegression())])

# Creating hyperparameter options
logreg_params = {'LogReg__C': np.arange(0, 10, 0.1),
                 'LogReg__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                 'LogReg__penalty': ['l1', 'l2', 'elasticnet', 'none'],
                 'LogReg__random_state': [seed]}

# GrideSearcCV
logreg_grid = GridSearchCV(estimator=logreg_pipe, param_grid=logreg_params,
                           scoring='accuracy', cv=5)
logreg_grid.fit(training_features, training_label)
logreg_opt_param = logreg_grid.best_params_
logreg_best_score = (logreg_grid.best_score_*100).round(2)
logreg_best_est = logreg_grid.best_estimator_

# Score on holdout data
logreg_holdout_score = (logreg_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(logreg_opt_param)
print('Optimal Estimator:')
print(logreg_best_est)
print('\n')
print('Training Accuracy {}'.format(logreg_best_score))
print('Testing Accuracy {}'.format(logreg_holdout_score))

# Decision Tree

In [None]:
# Creating hyperparameter options
dectree_params = {'max_depth': np.arange(0, 20, 1),
                  'criterion': ['gini', 'entropy'],
                  'min_samples_leaf': np.arange(0, 1, 0.05),
                  'random_state': [seed]}

# GrideSearcCV
dectree_grid = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=dectree_params,
                            scoring='accuracy', cv=5)
dectree_grid.fit(training_features, training_label)
dectree_opt_param = dectree_grid.best_params_
dectree_best_score = (dectree_grid.best_score_*100).round(2)
dectree_best_est = dectree_grid.best_estimator_
dectree_feat_imp = dectree_best_est.feature_importances_

# Score on holdout data
dectree_holdout_score = (dectree_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(dectree_opt_param)
print('Optimal Estimator:')
print(dectree_best_est)
print('\n')
print('Training Accuracy {}'.format(dectree_best_score))
print('Testing Accuracy {}'.format(dectree_holdout_score))

In [None]:
sns.set(rc={'figure.figsize':(20,10)})

# Plotting the optimal tree
plt.subplot(1, 2, 1)
plot_tree(dectree_best_est,
          feature_names=feature_names,  
          class_names=['Non-Hazardous [0]', 'Hazardous [1]'],
          filled=True)

# Plotting feature importnace
plt.subplot(1, 2, 2)
plt.barh(feature_names, dectree_feat_imp)

plt.show()
plt.close

# SVM


In [None]:
# Creating the pipeline
svm_pipe = Pipeline([('Scaling', StandardScaler()),
                     ('SVM', SVC())])

# Creating hyperparameter options
svm_params = {'SVM__C': np.arange(0, 20, 0.1),
              'SVM__gamma': [0.001, 0.01, 0.1, 1, 2, 5],        
              'SVM__kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
              'SVM__random_state': [seed]}

# GrideSearcCV
svm_grid = GridSearchCV(estimator=svm_pipe, param_grid=svm_params,
                        scoring='accuracy', cv=5)
svm_grid.fit(training_features, training_label)
svm_opt_param = svm_grid.best_params_
svm_best_score = (svm_grid.best_score_*100).round(2)
svm_best_est = svm_grid.best_estimator_

# Score on holdout data
svm_holdout_score = (svm_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(svm_opt_param)
print('Optimal Estimator:')
print(svm_best_est)
print('\n')
print('Training Accuracy {}'.format(svm_best_score))
print('Testing Accuracy {}'.format(svm_holdout_score))

# Random Forest

In [None]:
# Creating hyperparameter options
rf_params = {'max_depth': np.arange(0, 20, 1),
             'criterion': ['gini', 'entropy'],
             'min_samples_leaf': np.arange(0, 1, 0.05),
             'random_state': [seed],
             'n_estimators': np.arange(0, 10, 1)}

# GrideSearcCV
rf_grid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=rf_params,
                       scoring='accuracy', cv=5)
rf_grid.fit(training_features, training_label)
rf_opt_param = rf_grid.best_params_
rf_best_score = (rf_grid.best_score_*100).round(2)
rf_best_est = rf_grid.best_estimator_
rf_feat_imp = rf_best_est.feature_importances_

# Score on holdout data
rf_holdout_score = (rf_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(rf_opt_param)
print('Optimal Estimator:')
print(rf_best_est)
print('\n')
print('Training Accuracy {}'.format(rf_best_score))
print('Testing Accuracy {}'.format(rf_holdout_score))

In [None]:
# Plotting feature importnace
plt.barh(feature_names, rf_feat_imp)
plt.show()
plt.close

# XGBoosting

In [None]:
# Creating hyperparameter options
xgb_params = {'max_depth': np.arange(0, 5, 1),
              'objective': ['binary:logistic'],
              'random_state': [seed],
              'alpha': [0, 0.01, 0.1, 1],
              'lambda': [0, 0.01, 0.1, 1],
              'subsample': [0.25, 0.5, 0.75],
              'colsample_bytree': [0.25, 0.5, 0.75],
              'eval_metric': ['logloss']}

# GrideSearcCV
xgb_grid = GridSearchCV(estimator=XGBClassifier(), param_grid=xgb_params,
                            scoring='accuracy', cv=5)
xgb_grid.fit(training_features, training_label)
xgb_opt_param = xgb_grid.best_params_
xgb_best_score = (xgb_grid.best_score_*100).round(2)
xgb_best_est = xgb_grid.best_estimator_
xgb_feat_imp = xgb_best_est.feature_importances_

# Score on holdout data
xgb_holdout_score = (xgb_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(xgb_opt_param)
print('Optimal Estimator:')
print(xgb_best_est)
print('\n')
print('Training Accuracy {}'.format(xgb_best_score))
print('Testing Accuracy {}'.format(xgb_holdout_score))

In [None]:
# Plotting feature importance
plt.barh(feature_names, xgb_feat_imp)

In [None]:
# Alternative plotting with XGBoost library built-in feature importance plot function
plot_importance(xgb_best_est)

# CONCLUSIONS

### Model Performance on Unseen Data
 Model | Accuracy (%)
 - | -
Logistic Regression | 95.47
Decision Tree | 99.44
SVM | 94.99
Random Forest | 99.47
XGBoosting | **99.49**

As we can see above, **XGBoosting** has provided the best performance on unseen data, and thus is the best model for this classifcation problem out of those tested. Nonetheless the difference amongst the three tree based models (Decision Tree / Random Forest / XGBoosting) is almost negligible. 

In an scenario were the model was going to be deployed into production, it would be interesting to look into the relation between the different accuracy performances and resource consumption. Given the small performance difference, an argument can be made that the best model, out of the three tree based models, is the one that consumes less resources.

### Feature Importance
Looking into each tree based model we can see a pattern on feature importance of the best estimator found for each model.

Found on 100% of the best models:
- Minimum Orbit Intersection
- Absolute Magnitude

Found on 33% of the best models:
- Orbit ID
- Perihelion Distance
- Est dia in KM(min). This feature is the one I selected to capture size. Given the perfect correlation between size features (same measurement, different units) any of them would yield the same result, as long as only one is selected as part of the training dataset.


### Overfitting
- All models present some degree of overfitting (train accuracy > test accuracy). Nonetheless the differences are relatively small, less than 1 percentage point (except for SVM). Given this I would not consider that the selected model presents a significant overfitting issue.
