# Asteroid Impacts Classification

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict the asteroid Hazardous?

# 2. Data

Data from: https://www.kaggle.com/shrushtijoshi/asteroid-impacts

## Context

An asteroid's orbit is computed by finding the elliptical path about the sun that best fits the available observations of the object. That is, the object's computed path about the sun is adjusted until the predictions of where the asteroid should have appeared in the sky at several observed times match the positions where the object was actually observed to be at those same times. As more and more observations are used to further improve an object's orbit, we become more and more confident in our knowledge of where the object will be in the future.
When the discovery of a new near Earth asteroid is announced by the Minor Planet Center, Sentry automatically prioritizes the object for an impact risk analysis. If the prioritization analysis indicates that the asteroid cannot pass near the Earth or that its orbit is very well determined, the computationally intensive nonlinear search for potential impacts is not pursued. If, on the other hand, a search is deemed necessary then the object is added to a queue of objects awaiting analysis. Its position in the queue is determined by the estimated likelihood that potential impacts may be found.

## Content

Sentry is a highly automated collision monitoring system that continually scans the most current asteroid catalog for possibilities of future impact with Earth over the next 100 years. This dataset includes the Sentry system's list of possible asteroid impacts with Earth and their probability, in addition to a list of all known near Earth asteroids and their characteristics.

## Acknowledgements

The asteroid orbit and impact risk data was collected by NASA's Near Earth Object Program at the Jet Propulsion Laboratory (California Institute of Technology).

# 3. Evaluation

As this is a classification problem, we will use the classification metics for evauluting the model

# 4. Features

## Input / Features

    Object Name - Asteroid Name
    Epoch (TDB) - Epoch
    Orbit Axis (AU) - Orbit Axis
    Orbit Eccentricity- Eccentricity
    Orbit Inclination (deg)- Inclination
    Perihelion Argument (deg)- Perihelion Argument
    Node Longitude (deg) - Node Longitude
    Mean Anomoly (deg)- Mean Anomaly
    Perihelion Distance (AU) - Perihelion Distance
    Aphelion Distance (AU) - Aphelion Distance
    Orbital Period (yr) - Orbital Period
    Minimum Orbit Intersection Distance (AU) - Minimum Orbit Inclination
    Orbital Reference - Orbital Reference
    Asteroid Magnitude - Asteroid Magnitude
    Classification - Asteroid Classiication

## Output / Label
    Hazardous - Hazard

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the dataset

In [None]:
# Local
# df = pd.read_csv('orbits - orbits.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/asteroid-impacts/orbits - orbits.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

Since there is only one row with a missing data, we will drop that

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Hazardous Count')
sns.countplot(data=df, x ='Hazardous');

As the labels are in-balanced, we will use F1 scores to evaluate the model.

We will drop the Object Name

In [None]:
df = df.drop('Object Name', axis=1)

In [None]:
df

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=pd.get_dummies(df).corr(), annot=True);

In [None]:
sns.pairplot(data=df, hue='Hazardous')

In [None]:
df.corr()['Hazardous'].sort_values()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Classification Count colored by Hazardous')
sns.countplot(data=df, x='Classification', hue='Hazardous');

# 5. Modelling

In [None]:
X = df.drop('Hazardous', axis=1)
y = df['Hazardous']
X = pd.get_dummies(X, drop_first=True)
y = pd.get_dummies(y, drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

## Baseline Model Scores

In [None]:
from sklearn.metrics import classification_report,precision_score, recall_score,f1_score

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)
        print(name)
        print(classification_report(y_test, y_preds))
        print('\n')
        model_scores[name] = model.score(X_test,y_test)
        model_recall[name] = recall_score(y_test, y_preds)
        model_f1[name] = f1_score(y_test, y_preds)
        model_precision[name] = precision_score(y_test, y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    model_recall = pd.DataFrame(model_recall, index=['Recall']).transpose()
    model_recall = model_recall.sort_values('Recall')
    model_f1 = pd.DataFrame(model_f1, index=['F1']).transpose()
    model_f1 = model_f1.sort_values('F1')
    model_precision = pd.DataFrame(model_precision, index=['Precision']).transpose()
    model_precision = model_precision.sort_values('Precision')
        
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = fit_and_score(models, X_train, X_test, y_train, y_test)

## Baseline Evalution using Cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def get_baseline_cv_scores(model, X, y, cv=5):
    
    model_scores = {}
    model_recall = {}
    model_f1 = {}
    model_precision = {}
    
    for name, model in models.items():
        
        print(name)
        cv_accuracy = cross_val_score(model,X,y,cv=cv,
                             scoring='accuracy')
        print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
        print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')

        cv_precision = cross_val_score(model,X,y,cv=cv,
                             scoring='precision')
        print(f'Cross Validaion precision Scores: {cv_precision}')
        print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')

        cv_recall = cross_val_score(model,X,y,cv=cv,
                             scoring='recall')
        print(f'Cross Validaion recall Scores: {cv_recall}')
        print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')

        cv_f1 = cross_val_score(model,X,y,cv=cv,
                             scoring='f1')
        print(f'Cross Validaion f1 Scores: {cv_f1}')
        print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}') 
        print('\n')

        model_scores[name] = cv_accuracy.mean()
        model_recall[name] = cv_precision.mean()
        model_f1[name] = cv_recall.mean()
        model_precision[name] = cv_f1.mean()
    
    return model_scores, model_recall, model_f1, model_precision

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'XGBRFClassifier': XGBRFClassifier(objective='binary:logistic',eval_metric=['logloss']),
          'LGBMClassifier':LGBMClassifier(),
         'CatBoostClassifier': CatBoostClassifier(verbose=0)}

In [None]:
model_scores, model_recall, model_f1, model_precision = get_baseline_cv_scores(models, X_train, y_train, cv=5)

In [None]:
model_f1 = pd.DataFrame(model_f1, index=['F1'])

In [None]:
model_f1.transpose().sort_values('F1')

We will go with the AdaBoostClassifier to bulid our model.

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix,plot_roc_curve

In [None]:
model = AdaBoostClassifier()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Classification Report

In [None]:
print(classification_report(y_test, y_preds))

## Confusion Matirx

In [None]:
plot_confusion_matrix(model, X_test,y_test)

## ROC Curve

In [None]:
plot_roc_curve(model, X_test,y_test)

## Evalution using Cross-Validation

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_accuracy = cross_val_score(model,X,y,cv=cv,
                         scoring='accuracy')
    print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
    print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')
    
    cv_precision = cross_val_score(model,X,y,cv=cv,
                         scoring='precision')
    print(f'Cross Validaion precision Scores: {cv_precision}')
    print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')
    
    cv_recall = cross_val_score(model,X,y,cv=cv,
                         scoring='recall')
    print(f'Cross Validaion recall Scores: {cv_recall}')
    print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')
    
    cv_f1 = cross_val_score(model,X,y,cv=cv,
                         scoring='f1')
    print(f'Cross Validaion f1 Scores: {cv_f1}')
    print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')   
    
    cv_merics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_merics = get_cv_score(model, X_train, y_train, cv=5)

In [None]:
cv_merics

## Feature Importances

In [None]:
feat_importances = model.feature_importances_

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns.values)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Feature Importances')
plt.xticks(rotation=90)
sns.barplot(data=feat_importances.sort_values(0).T);