<img src="https://drive.google.com/uc?id=1-d7H1l1lJ28_sLcd9Vvh_N-yro7CJZcZ" style="Width:1000px">

# 🚀 Modelling Near Earth Objects

This is a data problem that was part of the 2022/23 assessed coursework. Obviously the coursework will be different this year, but this gives you a sense of what is expected.

You are a consultant in scientific machine learning, and your first client is **NASA**! They have had a lot of success with their recent **Double Asteroid Redirection Test**, or **DART** mission: now, they have an increased buget and they want to spend it on some predictive machine learning. This is where you come into play: you will help **NASA** train a machine learning algorithm that can determine whether or not a near-Earth asteroid is <span style="color:teal">hazardous</span> (dangerous). <br>

### What NASA wants you to do is the following:

<p>❇️ <strong style="color:blue">Load the training data</strong> (<code>asteroid_training.csv</code>). This contains your <code>y</code> target (<code>hazardous</code>), and all the other potential features <code>X</code>that you can choose to use to predict <code>y</code>.</p>
<p>❇️ NASA says you can <strong style="color:blue">prepare the data</strong> anyway you feel is best (you have complete freedom here). Save the final prepared data as variable named <code>X_train_prep</code> in your notebook. Note that this needs to be a Pandas DataFrame (used in the code test).</p>
<p>❇️ In order <strong>to be fair</strong> when assessing your work against others, NASA informs you that for any algorithm where a <strong style="color:blue">random state</strong> is needed you should use <code>random_state=42</code>. Note that this does not necessarily imply that you need an operation that requires a random_state: but if you do, use 42 so you can be fairly assessed against other biders!</p>
<p>❇️ Once your data is prepared, NASA wants you to <strong style="color:blue">train a predictive machine learning algorithm</strong>. But they give you <span style="color:red"><strong>further constraints</strong></span>:
    <li>1. Your algorithm can only be a <strong style="color:purple">linear model</strong> (so don't use any fancy RandomForest, KNN or XGBoost: NASA is too old school for that)</li>
    <li>2. NASA will evaluate your model against metrics that determines the <strong style="color:purple">overall performance of the model at any threshold</strong></li></p>
<p>❇️ Using your <code>final_model</code>, <strong style="color:blue">estimate</strong> the performance of your algorithm on unseen data and save this in a variable called <code>predicted_performance</code></p>
<p>❇️ Using your <code>final_model</code>, <strong style="color:blue">predict</strong> whether or not the unknown asteroids contained in the file <code>unknown_asteroids.csv</code> are hazardous. Your prediction should be an array of <code>0</code> (not hazardous) and <code>1</code> (hazardous)</p>
<p>❇️ <strong style="color:blue">Save your prediction to file</strong> (see pandas's <code>to_csv</code>) making sure to <strong>NOT INCLUDE THE INDEX</strong>. The filename should be <code>predictions.csv</code></p>
<br>

#### NASA will judge your work on the following criteria:
<p>✅ How clean and easy to read your code is, and how well structure your notebook is: this includes using markdown cells to explain your decisions if needed (don't justify all basic decisions though: the code needs to speak for itself)</p>
<p>✅ The overall performance of your algorithm at predicting whether a near-Earth object is hazardous <strong>on an unseen dataset</strong> (see note above in point <code>2</code>).</p>
<p>✅ Whether or not you have demonstrated through code that your solution follows the best practices of data science.</p>
<br>

This is a pretty open-ended exercise, so be imaginative. Don't forget to do some solid EDA to understand your data. Also, always use `random_state=42` for the test to assess you fairly. And don't forget all the best practices you have learned in the course!

### Opening the data

In [None]:
from nbta.utils import download_data
download_data(id='10aqsyytz1F2qky7CTS5LOyGy99fQOPGC')

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('raw_data/asteroid_training.csv')
unknown_objects = pd.read_csv('raw_data/unknown_asteroids.csv')


# Your work

Now you can explore the dataset and train your model as you see fit.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Prediction

Now that you have a fully trained algorithm, it is time to predict how hazardous unknown asteroids are. Use your trained algorithm to predict a label for the samples in `unknown_objects`. Then, save your predictions (use a `pd.Series` for ease) into a file named `predictions.csv`. Make sure to use `index=False` when you do this to not save the index as a data column in your file. Then, test your code below!

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# 🏁  Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('new_data_prediction',
    predicted_performance=predicted_performance,
    X_train = X_train_prep
)
result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.