# Orbit Classification Prediction 

This notebook is a work flow for various Python-based machine learning model for predicting Orbit Classification?

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation


# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict if the type of Orbit classification?

# 2. Data

Data from: https://www.kaggle.com/brsdincer/orbitclassification

# 3. Evaluation

as this is a classification problem, we will use the classification metics for eveluated the model

# 4. Features

## Inputs / Features

1. a (AU) -- Semi-major axis of the orbit in AU
2. e -- Eccentricity of the orbit
3. i (deg) -- Inclination of the orbit with respect to the ecliptic plane and the equinox of J2000 (J2000-Ecliptic) in degrees
4. w (deg) -- Argument of perihelion (J2000-Ecliptic) in degrees
5. Node (deg) -- Longitude of the ascending node (J2000-Ecliptic) in degrees
6. M (deg) -- Mean anomoly at epoch in degrees
7. q (AU) -- Perihelion distance of the orbit in AU
8. Q (AU) -- Aphelion distance of the orbit in AU
9. P (yr) -- Orbital period in Julian years
10. H (mag) -- Absolute V-magnitude
11. MOID (AU) -- Minimum orbit intersection distance (the minimum distance between the osculating orbits of the NEO and the Earth

## Outputs / labels

12. class -- Object classification

CLASS:

AMO*
APO
APO*
ATE
ATE*
IEO*

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Local
# df = pd.read_csv("classast - pha.csv")
# Kaggle
df = pd.read_csv('/kaggle/input/orbitclassification/classast - pha.csv')
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

## Data Exporation

In [None]:
df

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of the classes')
sns.countplot(data=df,x='class');

We are dealling with an in-balanced data set and we will based the scoreing of the model via F1 score.

In [None]:
plt.figure(figsize=(20,20))
plt.title('Heatmap')
sns.heatmap(data=pd.get_dummies(df).corr(), annot=True)

In [None]:
plt.title('Pairplot')
sns.pairplot(data=df, hue='class')

In [None]:
plt.figure(figsize=(20,10))
plt.title('e vs q (AU)')
sns.scatterplot(data=df, x='e', y='q (AU)', hue='class', s=150)

# 5. Modelling

In [None]:
X = df.drop('class', axis=1)
y = df['class']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier

## Baseline Model

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

did not expect the results of the accuracy of the models for perform this well.
we will now based this model on the CV on the f1 scores:
* AdaBoostClassifier 	0.986667
* RandomForestClassifier 	0.988571
* XGBRFClassifier 	0.988571
* XGBClassifier 	0.996190
* DecisionTreeClassifier 	0.998095
* GradientBoostingClassifier 	0.998095

## HyperTuning by Random search CV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score

In [None]:
def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_rs_scores = {}
    model_rs_best_param = {}
    
    for name, model in models.items():
        rs_model = RandomizedSearchCV(model,
                                     param_distributions=params[name],
                                      cv=5,
                                     n_iter=20,n_jobs=-1,
                                     verbose=2)        
        rs_model.fit(X_train,y_train)
        model_rs_scores[name] = rs_model.score(X_test,y_test)
        model_rs_best_param[name] = rs_model.best_params_
        
    return model_rs_scores, model_rs_best_param

## RS model 1

In [None]:
models = {'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier()}

params = {'DecisionTreeClassifier': {'criterion': ['gini', 'entropy'],
                                      'max_depth': [None, 3,5,10,20,50],
                                      'max_leaf_nodes': [None, 3,5,10,20,50],
                                      'ccp_alpha' : [0.0,0.001,0.01,0.1,1]
                                      },
          'RandomForestClassifier': {'n_estimators': [20,50,100,200,400],
                                     'criterion': ['gini', 'entropy'],
                                     'max_depth': [None, 2,10,50,100],
                                     'ccp_alpha': [0.1,0.01,0.001]},
          'AdaBoostClassifier': {'n_estimators': [20,50,100,200,400],
                                'learning_rate': [0.001,0.01,0.1,1.0],
                                'algorithm': ['SAMME','SAMME.R']},
          'GradientBoostingClassifier' : {'loss': ['deviance', 'exponential'],
                                          'learning_rate': [0.001,0.01,0.1,1.0],
                                          'n_estimators': [20,50,100,200,400],
                                          'criterion': ['friedman_mse', 'mse'],
                                          'max_depth' : [2,3,6,10,20],
                                          'ccp_alpha' : [0.0,0.001,0.01,0.1,1]
                                          },
          'XGBClassifier': {'eta': [0.001,0.01,0.1,1.0],
                           'max_depth': [3,5,10,15],
                           'gamma':[0,2,5,10,100,300]},
          
          'XGBRFClassifier': {'eta': [0.001,0.01,0.1,1.0],
                           'max_depth': [3,5,10,15],
                           'gamma':[0,2,5,10,100,300]},
         }

In [None]:
model_rs_scores_1, model_rs_best_param_1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_rs_scores_1 = pd.DataFrame(model_rs_scores_1, index=['Score']).transpose()
model_rs_scores_1.sort_values('Score')

In [None]:
model_rs_best_param_1

# 6. Model Evalution

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 
from sklearn.model_selection import cross_val_score

## XGBClassifier

In [None]:
model = XGBClassifier(eta=1.0,gamma=0,max_depth=3)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

### Classification Report

In [None]:
print(classification_report(y_test,y_preds))

### Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test,y_test)

### Calculate evalution metrices using cross-validation

In [None]:
cv_accuracy = cross_val_score(model,X,y,cv=5,
                         scoring='accuracy')


In [None]:
print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')

### Feature Importance

In [None]:
feat_importance = model.feature_importances_

In [None]:
feat_importance = pd.DataFrame(feat_importance, index=df.columns[:-1]).sort_values(0)

In [None]:
feat_importance

In [None]:
plt.figure(figsize=(20,10))
plt.title('Feature Importances')
sns.barplot(data=feat_importance.T);