## Content
The data is about Asteroids - NeoWs.
NeoWs (Near Earth Object Web Service) is a RESTful web service for near earth Asteroid information. With NeoWs a user can: search for Asteroids based on their closest approach date to Earth, lookup a specific Asteroid with its NASA JPL small body id, as well as browse the overall data-set.

## Acknowledgements
Data-set: All the data is from the (http://neo.jpl.nasa.gov/). This API is maintained by SpaceRocks Team: David Greenfield, Arezu Sarvestani, Jason English and Peter Baunach.

## Tasks
Given this dataset, we think about three tasks:
1. Develop a model that predicts if an asteroid is going to be hazardous (or not!)
2. Identify the features responsible for claiming an asteroid to be hazardous
3. Identify clusters of asteroids and reveal its characteristics

## Approach
To deal with it, we will approach with different methods, and in the end, compare which performed the best on the data :) . Some of them are listed below:
- Decision tree
- Random forest
- SVM
- XGBoosting
- K-Means
- PCA
- and others...

## Libraries imported

In [1]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, plot_importance
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# seed fixing
SEED = 42

In [1]:
df = pd.read_csv('../input/nasa-asteroids-classification/nasa.csv', parse_dates=['Close Approach Date', 'Orbit Determination Date', 'Epoch Date Close Approach'])
df

In [1]:
df.dtypes

## Exploratory Data Analysis

In [1]:
df.info()

In [1]:
df.describe()

## Checking null values

In [1]:
import missingno as msno

msno.matrix(df)

In [1]:
## Dimensionality reduction

In [1]:
sns.set(rc={'figure.figsize':(30,20)})

mask = np.triu(df.corr())

sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap="viridis", mask = mask)
plt.show()
plt.close()

In [1]:
df = df.drop(['Est Dia in M(min)', 'Est Dia in M(max)', 'Est Dia in Miles(min)', 'Est Dia in Miles(max)', 'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 'Est Dia in KM(max)',
              'Relative Velocity km per hr', 'Miles per hour',
              'Miss Dist.(Astronomical)', 'Miss Dist.(lunar)', 'Miss Dist.(miles)',
              'Semi Major Axis',
              'Neo Reference ID', 'Name',
              'Close Approach Date', 'Epoch Date Close Approach', 'Orbit Determination Date'],axis=1)


mask = np.triu(df.corr())

sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap="viridis", mask = mask, annot=True)
plt.show()
plt.close()

## Categorical Feature Encoding

In [1]:
encoder = LabelEncoder()

df['hazardous'] = encoder.fit_transform(df.Hazardous)

# Dropping these categorical features since they are repeated among all observations
df = df.drop(['Orbiting Body', 'Equinox', 'Hazardous'], axis = 1)

## Train-test split

In [1]:
features = df.drop('hazardous', axis = 1).values
target = df['hazardous'].values

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.7, stratify = target, random_state = SEED)

In [1]:
# Creating the Features/Label split as numpy arrays
features = df.drop('hazardous', axis=1).values
label = df.hazardous.values

# Creating the test/train split
training_features, test_features, training_label, test_label = train_test_split(features, label,
                                                                                test_size=0.8,
                                                                                stratify=label,
                                                                                random_state=SEED)

df_graph = df.copy()
feature_names = df_graph.drop('hazardous', axis=1).columns.tolist()

In [1]:
feature_names

## Models

## Decision Tree

In [1]:
hyperparameters_decision_tree = {'max_depth'        : np.arange(0, 25, 1),
                                 'criterion'        : ['gini', 'entropy'],
                                 'min_samples_leaf' : np.arange(0, 1, 0.05),
                                  'random_state'    : [SEED]}

# Applying GridSearchCV
decision_tree_grid = GridSearchCV(estimator = DecisionTreeClassifier(),
                                  param_grid = hyperparameters_decision_tree,
                                  scoring = 'accuracy',
                                  cv = 10)

decision_tree_grid.fit(X_train, y_train)
decision_tree_opt = decision_tree_grid.best_params_

decision_tree_score = (decision_tree_grid.best_score_*100).round(2)
decision_tree_est = decision_tree_grid.best_estimator_
decision_tree_features = decision_tree_est.feature_importances_

# Score on holdout data
decision_tree_holdout_score = (decision_tree_grid.score(test_features, test_label)*100).round(2)

print('Optimal Hyperparameters:')
print(decision_tree_opt)
print('Optimal Estimator:')
print(decision_tree_est)

print('Training Accuracy {}'.format(decision_tree_score))
print('Testing Accuracy {}'.format(decision_tree_holdout_score))

In [1]:
# Plotting the optimal tree
plt.subplot(1, 2, 1)
plot_tree(decision_tree_est,
          feature_names = feature_names,  
          class_names = ['Non-Hazardous [0]', 'Hazardous [1]'],
          filled = True)

# Plotting feature importnace
plt.subplot(1, 2, 2)
plt.barh(feature_names, decision_tree_features)

plt.show()
plt.close

## Random Forest

In [1]:
rf_params = {'max_depth': np.arange(0, 30, 1),
             'criterion': ['gini', 'entropy'],
             'min_samples_leaf': np.arange(0, 1, 0.05),
             'random_state': [SEED],
             'n_estimators': np.arange(0, 20, 1)}


rf_grid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=rf_params,
                       scoring='accuracy', cv=5)
rf_grid.fit(X_train, y_train)
rf_opt_param = rf_grid.best_params_
rf_best_score = (rf_grid.best_score_*100).round(2)
rf_best_est = rf_grid.best_estimator_
rf_feat_imp = rf_best_est.feature_importances_

rf_holdout_score = (rf_grid.score(X_test, y_test)*100).round(2)

print('Optimal Hyperparameters:')
print(rf_opt_param)
print('Optimal Estimator:')
print(rf_best_est)
print('Training Accuracy {}'.format(rf_best_score))
print('Testing Accuracy {}'.format(rf_holdout_score))

In [1]:
# Plotting feature importance
plt.barh(feature_names, rf_feat_imp)
plt.show()
plt.close

## Support Vector Machines

In [1]:
svm_pipe = Pipeline([('Scaling', StandardScaler()),
                     ('SVM', SVC())])

svm_params = {'SVM__C': np.arange(0, 20, 0.1),
              'SVM__gamma': [0.01, 0.1, 1, 2, 5],        
              'SVM__kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
              'SVM__random_state': [SEED]}

svm_grid = GridSearchCV(estimator=svm_pipe, param_grid=svm_params,
                        scoring='accuracy', cv=5)
svm_grid.fit(X_train, y_train)
svm_opt_param = svm_grid.best_params_
svm_best_score = (svm_grid.best_score_*100).round(2)
svm_best_est = svm_grid.best_estimator_

svm_holdout_score = (svm_grid.score(X_test, y_test)*100).round(2)

print('Optimal Hyperparameters:')
print(svm_opt_param)
print('Optimal Estimator:')
print(svm_best_est)

print('Training Accuracy {}'.format(svm_best_score))
print('Testing Accuracy {}'.format(svm_holdout_score))

## XGBoosting

In [1]:
# Creating hyperparameter options
xgb_params = {'max_depth': np.arange(0, 5, 1),
              'objective': ['binary:logistic'],
              'random_state': [SEED],
              'alpha': [0, 0.01, 0.1, 1],
              'lambda': [0, 0.01, 0.1, 1],
              'subsample': [0.25, 0.5, 0.75],
              'colsample_bytree': [0.25, 0.5, 0.75],
              'eval_metric': ['logloss']}

# GridSearcCV
xgb_grid = GridSearchCV(estimator=XGBClassifier(), param_grid=xgb_params,
                            scoring='accuracy', cv=5)
xgb_grid.fit(X_train, y_train)
xgb_opt_param = xgb_grid.best_params_
xgb_best_score = (xgb_grid.best_score_*100).round(2)
xgb_best_est = xgb_grid.best_estimator_
xgb_feat_imp = xgb_best_est.feature_importances_

# Score on holdout data
xgb_holdout_score = (xgb_grid.score(X_test, y_test)*100).round(2)

print('Optimal Hyperparameters:')
print(xgb_opt_param)
print('Optimal Estimator:')
print(xgb_best_est)
print('\n')
print('Training Accuracy {}'.format(xgb_best_score))
print('Testing Accuracy {}'.format(xgb_holdout_score))

In [1]:
# Plotting feature importance
plt.barh(feature_names, xgb_feat_imp)

## KMeans and PCA - Approach to identifying asteroids groups

In [1]:
scaler = StandardScaler()
df = df.drop(['hazardous', 'Orbit ID'], axis = 1)
df_scaled = scaler.fit_transform(df)
df_scaled

In [1]:
from sklearn.cluster import KMeans

scores_1 = []

range_values = range(1, 20)

for i in range_values:
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(df_scaled)
    scores_1.append(kmeans.inertia_)

plt.figure(figsize=(16,9)) 
plt.plot(scores_1, 'bx-')
plt.title('Optimal cluster number')
plt.xlabel('Clusters')
plt.ylabel('scores') 
plt.show()

In [1]:
kmeans = KMeans(4)
kmeans.fit(df_scaled)
labels = kmeans.labels_

kmeans.cluster_centers_.shape

In [1]:
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [df.columns])

cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns = [df.columns])
cluster_centers

Simply, according to our KMeans approach, the asteroids can be subgrouped in 4 clusters.
- **Cluster 0**: higher magnitude, lowest `Est Dia in KM` and highest orbit uncertainty
- **Cluster 1**: mean magnitude, highest relative velocity (km/sec), highest Miss Dist (kilometers) and lowest orbit uncertainty, most inclinated and most `mean motion` value.
- **Cluster 2**: Highest eccentricity and greater orbital period
- **Cluster 3**: Highest magnitude and lowest relative velocity, smaller eccentricity and least inclination among other clusters.

In [1]:
from sklearn.decomposition import PCA

# Obtain the principal components 
pca = PCA(n_components=2)
principal_comp = pca.fit_transform(df_scaled)
principal_comp

In [1]:
# Create a dataframe with the two components
pca_df = pd.DataFrame(data = principal_comp, columns =['pca1','pca2'])
pca_df.head()

# Concatenate the clusters labels to the dataframe
pca_df = pd.concat([pca_df,pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()


ax = sns.scatterplot(x = pca_df['pca1'], y=pca_df['pca2'], hue = "cluster", data = pca_df)
ax.set(xlabel = f"PC0 {pca.explained_variance_ratio_[0]*100:.2f}%", ylabel=f"PC1 {pca.explained_variance_ratio_[1]*100:.2f}%")
plt.show()

## Conclusion
### Model Performance on Unseen Data

|Model |Accuracy  |
--- | --- 
|Decision Tree|99,55%|
|Random Forest|98,93%|
|Support Vector Machine|94,39%|
|XGBoosting|99,57%|


Clearly, using decision tree provided the best performance on unseen data, of course, considering its simplicity and fast execution, and thus is the best model for this classifcation problem out of those tested.

### Feature Importance
Looking into each tree based model we can see a pattern on feature importance of the best estimator found for each model:

- Minimum Orbit Intersection
- Absolute Magnitude
- Est Dia in Km