# Applying Machine Learning to Exoplanet Candidate Classification

### Data produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu

Data used for prediction model:
* koi_period:     Orbital Period [days]
* koi_time0bk:    Transit Epoch [BKJD]
* koi_impact:     Impact Parameter
* koi_duration:   Transit Duration [hrs]
* koi_depth:      Transit Depth [ppm]
* koi_prad:       Planetary Radius [Earth radii]
* koi_teq:        Equilibrium Temperature [K]
* koi_insol:      Insolation Flux [Earth flux]
* koi_model_snr:  Transit Signal-to-Noise
* koi_steff:      Stellar Effective Temperature [K]
* koi_slogg:      Stellar Surface Gravity [log10(cm/s**2)]
* koi_srad:       Stellar Radius [Solar radii]

In [37]:
#initialize machine learning tool H2O.ai
import h2o
import pandas as pd
h2o.init(nthreads = 1, max_mem_size=8)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 min 47 secs
H2O cluster version:,3.10.4.8
H2O cluster version age:,1 month and 1 day
H2O cluster name:,H2O_from_python_root_oftlh2
H2O cluster total nodes:,1
H2O cluster free memory:,7.095 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,1
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:54321


## Training Data: Already Classified Exoplanets

Exoplanets that are already classified as "CONFIRMED" or "FALSE POSITIVES" are used to train the machine learning algorithm

In [21]:
exoplanets_csv = "data/exoplanet_results.csv"
data = h2o.import_file(exoplanets_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [22]:
#Using classification as binary factor, outputs possible factors
data['koi_disposition'] = data['koi_disposition'].asfactor()
data['koi_disposition'].levels()

[['CONFIRMED', 'FALSE POSITIVE']]

We split the data into three sets - a training set, a validation set, and a testing set. The training set is used to actual train the model. The validation set is used to select the best model for the applicable data set. The test set is used to determine the accuracy of the model

In [23]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [24]:
#note that these sets are not even. h2oai uses a seed to maintain some variability
print train.nrow
print valid.nrow
print test.nrow

5150
1066
1100


In [25]:
y = 'koi_disposition'
x = list(data.columns)

In [26]:
#remove columns not used for prediction model
x.remove(y)  #remove the response
x.remove('rowid') 
x.remove('kepid')
x.remove('kepoi_name')
x.remove('kepler_name')

## First Model: Generalized Logistic Regression Model
W e first run a test using one of the simplest machine learning method, the generalized linear model. Because this is a binary classification, we use a logistic regression as a classification method. This will give us a test case and some preliminary results. We can implement more advanced techniques later

In [27]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [28]:
#Simple GLM
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

In [29]:
glm_fit1.train(x=x, y=y, training_frame=train)


glm Model Build progress: |███████████████████████████████████████████████| 100%


### Testing Model with Test Set

In [30]:
glm_perf1 = glm_fit1.model_performance(test)

In [31]:
print glm_perf1.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.467997156851: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,254.0,92.0,0.2659,(92.0/346.0)
FALSE POSITIVE,83.0,671.0,0.1101,(83.0/754.0)
Total,337.0,763.0,0.1591,(175.0/1100.0)





This confusion matrix gives the results of running this model on the test set we specified above. The columns represent the model predictions, while the rows represent the actual data. We see that this model can classify confirmed exoplanets 70-75% of the time, while it can classify False Positives ~90% of the time. Although these are rough estimates, they do give some indication to the performance of the model.

In [32]:
# Retreive test set AUC
print glm_perf1.auc()

0.90631276736


AUC: The expected true positive rate if the ranking is split just before a uniformly drawn random negative. 

From h2o.ai training manual: 

> For logistic regression (i.e.  binomial classification) models, look for AUC
> * Too close to 0.5:  model doesnt predict well
> * Too close to 1:  model predicts too well

According to this, this model predicts too well. This indicates that, most likely, there is/are variable(s) we are using in the prediction model that is tied too closely to the classification of CANDIDATE or FALSE POSITIVE. Unclear which variable(s) these are.  

### Apply Prediction Model to Exoplanet Candidates

In [33]:
#import candidate data file
candidate_csv = "data/candidates.csv"
candidateData = h2o.import_file(candidate_csv)


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [34]:
#run GLM w/ lambda prediction model on Candidate Data
glmpredict = glm_fit1.predict(candidateData)

glm prediction progress: |████████████████████████████████████████████████| 100%


In [35]:
candidateCombine = candidateData.cbind(glmpredict)
h2o.export_file(candidateCombine, "data/candidate_predict.csv")

Export File progress: |███████████████████████████████████████████████████| 100%


In [57]:
candidatePredictData = pd.read_csv('data/candidate_predict.csv').sort_values('CONFIRMED')

#### Most likey to be false positives

In [54]:
pd.DataFrame(candidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[0:5]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
1422,5131276,FALSE POSITIVE,0.0,1.0
1839,9790965,FALSE POSITIVE,0.0,1.0
1962,9025662,FALSE POSITIVE,0.0,1.0
32,10287723,FALSE POSITIVE,0.0,1.0
1266,6790592,FALSE POSITIVE,0.0,1.0


#### Most likely to be confirmed exoplanets

In [72]:
pd.DataFrame(candidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[:-6:-1]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
1158,5384713,CONFIRMED,0.909577,0.090423
109,9427402,CONFIRMED,0.908379,0.091621
302,2161536,CONFIRMED,0.905637,0.094363
1511,10747162,CONFIRMED,0.905629,0.094371
1986,7868967,CONFIRMED,0.899582,0.100418


In [19]:
h2o.cluster().shutdown(prompt=False)

H2O session _sid_8437 closed.
