<h1>Introduction</h1>
<p>Welcome! In this notebook i'm going to analyze different asteroids data and implement a Machine Learning Classfier to predict the hazard for different asteroids</p>
<h3>My main objectives on this project are:</h3>   
<ul>
    <li>Applying exploratory data analysis and trying to get some insights about our dataset</li>
    <li>Getting data in better shape by transforming and feature engineering to help us in building better models</li>
    <li>Building and tuning a XGBClassifer to get some results on predicting Hazard</li>
</ul>

<h2>Importing Libraries</h2>
<p>Lets start by importing some packages we are going to need</p>

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import seaborn as sns

# Meeting the data
<p>Lets open the data and see what we have</p>

In [None]:
#Opening the data
data = pd.read_csv('../input/nasa-asteroids-classification/nasa.csv')

In [None]:
#Lets see the shapes of the data so we know what we are dealing with
data.shape

In [None]:
#lets observe some of his elements
data.head(10)

In [None]:
# Dividing the label and features columns in X, y and then eliminating irrelevant features such as name and ids
X = data.copy()
X.drop(columns=['Neo Reference ID', 'Name', 'Orbit ID', 'Hazardous'], inplace=True)
y = data['Hazardous'].astype(int)

# EDA
<p>Exploratory Data Analysis</p>

<p>Lets create a heatmap graphic here. With this graphics we can see the correlation between different features</p>

In [None]:
correlation = X.corr()

f, ax = plt.subplots(figsize=(14,12))
plt.title('Correlation of numerical attributes', size=16)
sns.heatmap(correlation)
plt.show()

<h4>Observations</h4>
<li>Let's focus on the lighter parts of the graph</li>
<ol>
    <li>The Estimated Diameters have a high correlation because they are telling the "same thing"</li>
    <li>The Relatives velocity have a high correlation because they are telling the "same thing"</li>
    <li>The Miss Distance have a high correlation because they are telling the "same thing"</li>
</ol>

In [None]:
#We can see there are 8 columns indicating Min and Max values of the Estimated Diameter of asteroids
#We are going to create a new column with the Mean value of KM(min) and KM(max) and then eliminate the rest
X['avg_dia'] = X[['Est Dia in KM(min)', 'Est Dia in KM(max)']].mean(axis=1)
X.drop(columns=['Est Dia in KM(min)', 'Est Dia in KM(max)', 'Est Dia in M(min)',
               'Est Dia in M(max)', 'Est Dia in Miles(min)', 'Est Dia in Miles(max)',
               'Est Dia in Feet(min)', 'Est Dia in Feet(max)'], inplace=True)

In [None]:
#There are 3 columns indicating Relative Velocity
#We are going to just leave the Relative Velocity km per hr
X.drop(columns=['Relative Velocity km per sec', 'Miles per hour'], inplace=True)

In [None]:
#There are 4 columns indicating Miss Distance
#We are going to just leave Mist Distance in kilometers
X.drop(columns=['Miss Dist.(Astronomical)', 'Miss Dist.(lunar)', 'Miss Dist.(miles)'], inplace=True)

In [None]:
#Lets see the variability of categorical columns
cat_columns = X.select_dtypes(include=['object']).columns
#We dont count these 2 columns because they are dates, we will process them later
cat_columns = cat_columns.drop(['Close Approach Date', 'Orbit Determination Date'])
for col in cat_columns:
    print(X[col].value_counts(ascending=True, normalize=True))

<h4>Observations</h4>
<li>We can see both "Orbiting Body" and "Equinox" have only 1 possible value, so we are going to eliminate them</li>

In [None]:
#Eliminating Orbiting Body and Equinox columns
X.drop(columns=cat_columns, inplace=True)

# Missing Data
<ul>
    <li>Lets see if there any missing values and visualize them</li>
</ul>

In [None]:
X.isnull().sum()

<li>Luckily we don't have any missing values, so we can proceed with modeling</li>

# Preprocessing + Pipeline
<li>First, lets split the data into train and test dataframes</li>
<p>Steps:</p>
<ol>
    <li>Extract year, month and day from the date columns so we can use them as numerical features</li>
    <li>Add Year, Month and Day for each date column to the dataset</li>
    <li>Eliminate date columns from the dataset</li>
    <li>Fit the model</li>
</ol>

In [None]:
#Import Pipeline
from sklearn.pipeline import Pipeline
#Import model and GridSearch for Hyperparameter Optimization
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
#Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)

In [None]:
#Import the BaseEstimator
from sklearn.base import BaseEstimator

#Define Date pre-processor class
class DateProcessor(BaseEstimator):

    def __init__(self):
        pass

    def fit(self, documents, y=None):
        return self

    def transform(self, df):
        dateCols = ['Close Approach Date', 'Orbit Determination Date']
        new_df = df.copy()
        for col in dateCols:
            
            new_df[col] = pd.to_datetime(new_df[col], errors="coerce",format="%Y-%m-%d")
            #df.dropna(axis=1, subset=['date'], inplace=True)
            
            newColsDict = {'day': str(col) + " day", 'month': str(col) + " month", 'year': str(col) + " year"}
            new_df[newColsDict['day']] = new_df[col].dt.day
            new_df[newColsDict['month']] = new_df[col].dt.month
            new_df[newColsDict['year']] = new_df[col].dt.year
            
        new_df.drop(inplace=True, columns=dateCols)
        return new_df

In [None]:
#Defining the pipeline
"""
objective= 'binary:logistic',
    nthread=4,
    seed=42,
    learning_rate = 0.2,
    max_depth = 3,
    n_estimators = 65,
    tree_method='gpu_hist',
    verbosity=2
"""
estimator = XGBClassifier(seed=42)
model_pipeline = Pipeline(steps=[
                                ('process_dates', DateProcessor()),
                                ('XGBoost', estimator)
                                ])
model_pipeline.fit(X_train, y_train)

In [None]:
#Score
model_pipeline.score(X_test, y_test)

In [None]:
#Lets try another score method
from sklearn.metrics import accuracy_score

y_pred = model_pipeline.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# End
Thanks for going all the way down through my notebook! I hope you were able to get something usefull from this. Feel free to ask your questions and use my code