# Predicting Diameter and Potential Harm of Asteroids using Machine Learning

**Authors** :
Colin Campbell (c_c953), Jake Worden (jrw294), Leah Lewis (lrl68) and Ryan Wakabayashi (rjw102)

This uses the Asteroid dataset:  https://www.kaggle.com/basu369victor/prediction-of-asteroid-diameter

# Part 1 : Predicting Diameter

## Setup

Load python packages for data prepartion and analysis.

# Predicting Diameter and Physical Harm of Asteroids using Machine Learning
**Authors** :
Colin Campbell (c_c953), Jake Worden (jrw294), Leah Lewis (lrl68) and Ryan Wakabayashi (rjw102)

**Abstract** :  [  ]

## Introduction

 

 

## Problem Statement 
Question: How to use machine learning to predict the diameter of asteroids and classify them as physically hazardous.
* Asteroid diameter prediction based upon Asteroid_Updated.csv from Kaggle.
* Predict whether an asteroid is physically hazardous to Earth. 

* Success measures:
	* 5 - 10 fold CV accuracy for all models
	* Regression models: R^2 score
	* Classification models: Precision, Recall, ROC/AUC
	
* Hope to achieve >85% R^2 for regression models (based upon kaggle responses) and then >=80% precision and recall for the classification models (low goal based on amount of data for imbalanced classes).

### Related Work

**Link to other work:** [Asteroid Diameter Estimators with added difficulty](https://www.kaggle.com/liamkesatoran/asteroid-diameter-estimators-with-added-difficulty)

## Data Management 
- Describe how did you evaluate your solution
- What evaluation metrics did you use?
- Describe a baseline system
- How much did your system outperform the baseline?
- Were there other systems evaluated on the same dataset? How did your system do in comparison to theirs?
- Show graphs/tables with results
- Error analysis
- Suggestions for future improvements

Description of the dataset (dimensions, names of variables with their description)

### Data Gathering


#### *Motivation*
This database was acquired from the Jet Propulsion Laboratory at California Institute of Technology's "Solar System Dynamics" on behalf of NASA. This information is related to the orbits, physical and characteristics, and discovery cirumstances for most known natural bodies in our solar system


#### *Composition*
	
| Feature | Description | Dtype | Null |
| ------- | ----------------- | ------ | :------: |
| a | Semi-major axis(au) | float64 | 2 |
| e | Eccentricity | float64 | 0 |
| i | Inclination with respect to x-y ecliptic plain(deg) | float64 | 0 |
| om | Longitude of the ascending node | float64 | 0 |
| w | Argument of perihelion | float64 | 0 |
| q | Perihelion distance(au) | float64 | 0 |
| ad | Aphelion distance(au) | float64 | 6 |
| per_y | Oribital period(YEARS) | float64 | 1 |
| data_arc | Data arc-span(d) | float64 | 15474 |
| condition_code | Orbit condition code | object | 867 |
| n_obs_used | Number of Observation used | int64 | 0 |
| H | Absolute magnitude parameter | float64 | 2689 |
| neo | Near Earth Object | object | 6 |
| pha | Physically Hazardous Asteroid | object | 16442 |
| diameter | Diameter of asteroid(Km) | object | 702078 |
| extent | Object bi/tri axial ellipsoid dimensions(Km) | object | 839696 |
| albedo | Geometric albedo | float64 | 703305 |
| rot_per | Rotation Period(h) | float64 | 820918 |
| GM | Standard gravitational parameter, Product of mass and gravitational constant | float64 | 839700 |
| BV | Color index B-V magnitude difference | float64 | 838693 |
| UB | Color index U-B magnitude difference | float64 | 838735 |
| IR | Color index I-R magnitude difference | float64 | 839713 |
| spec_B | Spectral taxonomic type(SMASSII) | object | 838048 |
| spec_T | Spectral taxonomic type(Tholen) | object | 838734 |
| G | Magnitude slope parameter | float64 | 839595 |
| moid | Earth minimum orbit intersection distance(au) | float64 | 16442 |
| class | Asteroid orbit class | object | 0 |
| n | Mean motion(deg/d) | float64 | 2 |
| per | Orbital period(d) | float64 | 6 |
| ma | Mean anomaly(deg) | float64 | 8 |

* Shape: (839714 , 31)
* Memory usage: 198.6+ MB

**Dataset found here:** [Asteroid_Updated.csv](https://www.kaggle.com/basu369victor/prediction-of-asteroid-diameter?select=Asteroid_Updated.csv)

### Data Pre-processing, Cleaning, Labeling, and Maintenance 

- Read in the .csv and visualized .head() and .info()
- Checked the number of Null values. If the sum of null values are > 700,000, we dropped the column
- If the remaining column has only Nulls, it is dropped
- If the remaining rows contain any Nulls, it is dropped

In [None]:
import pandas as pd
df= pd.read_csv("Asteroid_Updated.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839714 entries, 0 to 839713
Data columns (total 31 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   name            21967 non-null   object 
 1   a               839712 non-null  float64
 2   e               839714 non-null  float64
 3   i               839714 non-null  float64
 4   om              839714 non-null  float64
 5   w               839714 non-null  float64
 6   q               839714 non-null  float64
 7   ad              839708 non-null  float64
 8   per_y           839713 non-null  float64
 9   data_arc        824240 non-null  float64
 10  condition_code  838847 non-null  object 
 11  n_obs_used      839714 non-null  int64  
 12  H               837025 non-null  float64
 13  neo             839708 non-null  object 
 14  pha             823272 non-null  object 
 15  diameter        137636 non-null  object 
 16  extent          18 non-null      object 
 17  albedo    

Print the sum of null values to determine which columns had a high percentage of null values.

In [None]:
print(df.shape)
print(df.isnull().sum())

(839714, 31)
name              817747
a                      2
e                      0
i                      0
om                     0
w                      0
q                      0
ad                     6
per_y                  1
data_arc           15474
condition_code       867
n_obs_used             0
H                   2689
neo                    6
pha                16442
diameter          702078
extent            839696
albedo            703305
rot_per           820918
GM                839700
BV                838693
UB                838735
IR                839713
spec_B            838048
spec_T            838734
G                 839595
moid               16442
class                  0
n                      2
per                    6
ma                     8
dtype: int64


Drop the columns with high amount of null values. Keeping diameter since it is the target.

In [None]:
columns = ['name', 'extent', 'albedo', 'rot_per', 'GM', 'BV', 'G', 'UB', 'IR', 'spec_B', 'spec_T']
df = df.drop(columns=columns)
print(df.shape)
print(df.isnull().sum())

(839714, 20)
a                      2
e                      0
i                      0
om                     0
w                      0
q                      0
ad                     6
per_y                  1
data_arc           15474
condition_code       867
n_obs_used             0
H                   2689
neo                    6
pha                16442
diameter          702078
moid               16442
class                  0
n                      2
per                    6
ma                     8
dtype: int64


After running into issues with incorrect datatypes, we found we needed to go through the data and turn the values into numerical values and those that did not become numeric, were dropped.

We then printed the sum of nulls to check that the dataframe had no null values.

In [None]:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['diameter'].astype(float)
df.dropna(how='all', axis=1, inplace=True)
df.dropna(how='any', axis=0, inplace=True)

print(df.shape)
print(df.isnull().sum())
print(df.dtypes)

(136759, 17)
a                 0
e                 0
i                 0
om                0
w                 0
q                 0
ad                0
per_y             0
data_arc          0
condition_code    0
n_obs_used        0
H                 0
diameter          0
moid              0
n                 0
per               0
ma                0
dtype: int64
a                 float64
e                 float64
i                 float64
om                float64
w                 float64
q                 float64
ad                float64
per_y             float64
data_arc          float64
condition_code    float64
n_obs_used          int64
H                 float64
diameter          float64
moid              float64
n                 float64
per               float64
ma                float64
dtype: object


### Exploratory Data Analysis 

What Data Acquisition, Cleaning, and Processing Tools have you used.  Why? 

* Describe the methods you explored (usually algorithms, or data cleaning or wrangling approaches). 
* Justify your methods in terms of the problem statement. 
* What did you consider but *not* use? In particular, be sure to include every method you tried, even if it didn't "work". 

## Machine Learning Approaches

In this section, you could describe the methods you used in your analysis. For example, if you are doing classifications, you could introduce the methods like logistic regression, discriminant analysis, support vector machines. You don't have to write formulas if you don't want to do so. It is fine to describe the methods in words. This section basically is a description of the methodologies that you have used for analyzing your data. (up to 2pages)
Describe the choice of Machine Learning Tool.  Refer ro related work, if applicable.  

* Evaluate a primary model and in addition a "baseline" model. 
  * The baseline is typically the simplest model that's applicable to that data problem
    * Naive Bayes for classification
	* K-means on raw feature data for clustering.
* Evaluate state-of-art model 
  * Research gitHuib, paperswithcode, Kaggle and similar. 
  * If not applicable, talk to the instructor.  
  
**Hint** Goal is to have some sort of baseline evaluation by Nov 11th checkpoint to establish a scale by which to measure your project's performance. Compare the performance of your baseline model and primary model and explain the differences.

** This is where all the methods you have tried go, including state-of-art if any **

### Describe the ML methods that you used and the reasons for their choice. 
What is the family of machine learnign algorithms you are using and why? 
* Supervised or Unsupervised?
* Regression or classification?

### Justify ML algorithms in terms of the problem itself and the methods you want to use. 
* How did you employ them? 
* What features worked well and what didn't?
* Provide documentation for integration  

### Tools and Infrastructure Tried and Not Used

Describe any tools and infrastruicture that you tried and ended up not using.
What was the problem? 
Describe infrastructure used. 

## Experiments

Give a detailed summary of the results of your work.

 * Setup - Here is where you specify the exact performance measures you used.  
   * Describe the data used in experiment for presenting dataset: Datasheets for Dataset template 
   * Describe your accuracy or quality measure, and your performance (runtime or throughput) measure. 
   
 * Please use visualizations whenever possible. Include links to interactive visualizations if you built them. 
 
 * You can also submit a separated notebook as an appendix to your report if that makes the visualization/interaction task easier. 
   * It would be reasonable to submit your report as a notebook, but please make sure it runs on one of the two standard environments, and that you include any required files. 

## Conclusion
In this section give a high-level summary of your results. If the reader only reads one section of the report, this one should be it, and it should be self-contained.  You can refer back to the Experiments Section for elaborations. This section should be less than a page. In particular emphasize any results that were surprising.

## References
List the references that cited in your project.

## Appendix## 

Explain the contributions of each member to the project. Include all supporting materials, e.g., additional figures/tables, Python code technical derivations.

## Determine Feature Selection

In [None]:
# ANOVA on features on target to determine which features are significant
#anova = SelectKBest(k=10)
# fitting ANOVA model with features and target
#anova.fit(x, y)

# origin airport causes most effect in model
#for i in range(len(x.columns)):
   # print(f'{x.columns[i]}: {anova.scores_[i]}')