# Applying Machine Learning to Exoplanet Candidate Classification

### Data produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu

Data used for prediction model:
* koi_period:     Orbital Period [days]
* koi_time0bk:    Transit Epoch [BKJD]
* koi_impact:     Impact Parameter
* koi_duration:   Transit Duration [hrs]
* koi_depth:      Transit Depth [ppm]
* koi_prad:       Planetary Radius [Earth radii]
* koi_teq:        Equilibrium Temperature [K]
* koi_insol:      Insolation Flux [Earth flux]
* koi_model_snr:  Transit Signal-to-Noise
* koi_steff:      Stellar Effective Temperature [K]
* koi_slogg:      Stellar Surface Gravity [log10(cm/s**2)]
* koi_srad:       Stellar Radius [Solar radii]

In [1]:
#initialize machine learning tool H2O.ai
import h2o
import pandas as pd
h2o.init(nthreads = 1, max_mem_size=8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp62qYQe
  JVM stdout: /tmp/tmp62qYQe/h2o_root_started_from_python.out
  JVM stderr: /tmp/tmp62qYQe/h2o_root_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster version:,3.10.4.8
H2O cluster version age:,1 month and 1 day
H2O cluster name:,H2O_from_python_root_8b08d9
H2O cluster total nodes:,1
H2O cluster free memory:,7.111 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,1
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


## Training Data: Already Classified Exoplanets

Exoplanets that are already classified as "CONFIRMED" or "FALSE POSITIVES" are used to train the machine learning algorithm

In [2]:
exoplanets_csv = "data/exoplanet_results.csv"
data = h2o.import_file(exoplanets_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
#Using classification as binary factor, outputs possible factors
data['koi_disposition'] = data['koi_disposition'].asfactor()
data['koi_disposition'].levels()

[['CONFIRMED', 'FALSE POSITIVE']]

We split the data into three sets - a training set, a validation set, and a testing set. The training set is used to actual train the model. The validation set is used to select the best model for the applicable data set. The test set is used to determine the accuracy of the model

In [4]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [5]:
#note that these sets are not even. h2oai uses a seed to maintain some variability
print train.nrow
print valid.nrow
print test.nrow

5150
1066
1100


In [6]:
y = 'koi_disposition'
x = list(data.columns)

In [7]:
#remove columns not used for prediction model
x.remove(y)  #remove the response
x.remove('rowid') 
x.remove('kepid')
x.remove('kepoi_name')
x.remove('kepler_name')

## First Model: Generalized Logistic Regression Model
We first run a test using one of the simplest machine learning method, the generalized linear model. Because this is a binary classification, we use a logistic regression as a classification method. This will give us a test case and some preliminary results. We can implement more advanced techniques later

In [8]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [9]:
#Simple GLM
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

In [10]:
glm_fit1.train(x=x, y=y, training_frame=train)


glm Model Build progress: |███████████████████████████████████████████████| 100%


### Testing Model with Test Set

In [11]:
glm_perf1 = glm_fit1.model_performance(test)

In [12]:
print glm_perf1.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.467997156851: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,254.0,92.0,0.2659,(92.0/346.0)
FALSE POSITIVE,83.0,671.0,0.1101,(83.0/754.0)
Total,337.0,763.0,0.1591,(175.0/1100.0)





This confusion matrix gives the results of running this model on the test set we specified above. The columns represent the model predictions, while the rows represent the actual data. We see that this model can classify confirmed exoplanets 70-75% of the time, while it can classify False Positives ~90% of the time. Although these are rough estimates, they do give some indication to the performance of the model.

In [13]:
# Retreive test set AUC
print glm_perf1.auc()

0.90631276736


AUC: The expected true positive rate if the ranking is split just before a uniformly drawn random negative. 

From h2o.ai training manual: 

> For logistic regression (i.e.  binomial classification) models, look for AUC
> * Too close to 0.5:  model doesnt predict well
> * Too close to 1:  model predicts too well

According to this, this model predicts too well. This indicates that, most likely, there is/are variable(s) we are using in the prediction model that is tied too closely to the classification of CANDIDATE or FALSE POSITIVE. Unclear which variable(s) these are.  

### Apply Prediction Model to Exoplanet Candidates

In [14]:
#import candidate data file
candidate_csv = "data/candidates.csv"
candidateData = h2o.import_file(candidate_csv)


Parse progress: |█████████████████████████████████████████████████████████| 100%


In [15]:
#run GLM
glmpredict = glm_fit1.predict(candidateData)

glm prediction progress: |████████████████████████████████████████████████| 100%


In [16]:
candidateCombine = candidateData.cbind(glmpredict)
h2o.export_file(candidateCombine, "data/candidate_predict.csv")

Export File progress: |███████████████████████████████████████████████████| 100%


In [17]:
candidatePredictData = pd.read_csv('data/candidate_predict.csv').sort_values('CONFIRMED')

#### Most likey to be false positives

In [18]:
pd.DataFrame(candidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[0:5]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
1422,5131276,FALSE POSITIVE,0.0,1.0
1839,9790965,FALSE POSITIVE,0.0,1.0
1962,9025662,FALSE POSITIVE,0.0,1.0
32,10287723,FALSE POSITIVE,0.0,1.0
1266,6790592,FALSE POSITIVE,0.0,1.0


#### Most likely to be confirmed exoplanets

In [19]:
pd.DataFrame(candidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[:-6:-1]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
1158,5384713,CONFIRMED,0.909577,0.090423
109,9427402,CONFIRMED,0.908379,0.091621
302,2161536,CONFIRMED,0.905637,0.094363
1511,10747162,CONFIRMED,0.905629,0.094371
1986,7868967,CONFIRMED,0.899582,0.100418


### Model 2: Random Forest Algorithm

In [20]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [21]:
rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1)
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [22]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [23]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)
print rf_perf1.confusion_matrix()
print rf_perf2.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.524022988677: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,309.0,37.0,0.1069,(37.0/346.0)
FALSE POSITIVE,47.0,707.0,0.0623,(47.0/754.0)
Total,356.0,744.0,0.0764,(84.0/1100.0)



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.496279334426: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,303.0,43.0,0.1243,(43.0/346.0)
FALSE POSITIVE,43.0,711.0,0.057,(43.0/754.0)
Total,346.0,754.0,0.0782,(86.0/1100.0)





In [24]:
# Retreive test set AUC
print rf_perf1.auc()
print rf_perf2.auc()

0.974442664173
0.97604874197


In [25]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [26]:
print rf_fit3.auc(xval=True)

0.974657178809


### Apply Prediction Model to Exoplanet Candidates

In [27]:
rfpredict = rf_fit1.predict(candidateData)

drf prediction progress: |████████████████████████████████████████████████| 100%


In [28]:
rfcandidateCombine = candidateData.cbind(rfpredict)
h2o.export_file(rfcandidateCombine, "data/candidate_predict_rf.csv")

Export File progress: |███████████████████████████████████████████████████| 100%


In [29]:
rfcandidatePredictData = pd.read_csv('data/candidate_predict_rf.csv').sort_values('CONFIRMED')

#### Most likey to be false positives

In [30]:
pd.DataFrame(rfcandidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[0:10]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
1527,9026248,FALSE POSITIVE,0.0,1.0
1483,8741367,FALSE POSITIVE,0.0,1.0
1146,5881307,FALSE POSITIVE,0.0,1.0
1477,10552263,FALSE POSITIVE,0.0,1.0
1250,5903301,FALSE POSITIVE,0.0,1.0
646,7303287,FALSE POSITIVE,0.0,1.0
1896,9098590,FALSE POSITIVE,0.0,1.0
1473,9466486,FALSE POSITIVE,0.0,1.0
1601,11860294,FALSE POSITIVE,0.0,1.0
2085,7695445,FALSE POSITIVE,0.0,1.0


#### Most likely to be confirmed exoplanets

In [31]:
pd.DataFrame(rfcandidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[:-11:-1]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
343,5005121,CONFIRMED,0.997559,0.002441
381,11852982,CONFIRMED,0.991005,0.008995
684,10155434,CONFIRMED,0.98919,0.01081
712,9266431,CONFIRMED,0.989068,0.010932
350,11192235,CONFIRMED,0.988824,0.011176
464,7749773,CONFIRMED,0.982855,0.017145
669,5978361,CONFIRMED,0.982461,0.017539
16,7135852,CONFIRMED,0.977588,0.022412
549,6291837,CONFIRMED,0.97646,0.02354
306,8410415,CONFIRMED,0.975744,0.024256


### Model 3: Gradient Boosting Machine

In [32]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [33]:
# Initialize and train the GBM estimator:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1', seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [34]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500, seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [35]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
print gbm_perf1.confusion_matrix()
print gbm_perf2.confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431513158048: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,293.0,53.0,0.1532,(53.0/346.0)
FALSE POSITIVE,43.0,711.0,0.057,(43.0/754.0)
Total,336.0,764.0,0.0873,(96.0/1100.0)



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.498293114556: 


0,1,2,3,4
,CONFIRMED,FALSE POSITIVE,Error,Rate
CONFIRMED,306.0,40.0,0.1156,(40.0/346.0)
FALSE POSITIVE,40.0,714.0,0.0531,(40.0/754.0)
Total,346.0,754.0,0.0727,(80.0/1100.0)





In [36]:
# Retreive test set AUC
print gbm_perf1.auc()
print gbm_perf2.auc()

0.97362620935
0.978592017908


### Apply Prediction Model to Exoplanet Candidates

In [37]:
gbmpredict = gbm_fit1.predict(candidateData)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [38]:
gbmcandidateCombine = candidateData.cbind(gbmpredict)
h2o.export_file(gbmcandidateCombine, "data/candidate_predict_gbm.csv")

Export File progress: |███████████████████████████████████████████████████| 100%


In [39]:
gbmcandidatePredictData = pd.read_csv('data/candidate_predict_gbm.csv').sort_values('CONFIRMED')

#### Most likey to be false positives

In [40]:
pd.DataFrame(gbmcandidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[0:10]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
2120,5858919,FALSE POSITIVE,0.00356,0.99644
1286,9655129,FALSE POSITIVE,0.003914,0.996086
1850,5388229,FALSE POSITIVE,0.004524,0.995476
1740,5716244,FALSE POSITIVE,0.00466,0.99534
1665,4255944,FALSE POSITIVE,0.004888,0.995112
1585,7458743,FALSE POSITIVE,0.004979,0.995021
1798,11495989,FALSE POSITIVE,0.005548,0.994452
1948,4862924,FALSE POSITIVE,0.005708,0.994292
1504,8868481,FALSE POSITIVE,0.006043,0.993957
1473,9466486,FALSE POSITIVE,0.006083,0.993917


#### Most likely to be confirmed exoplanets

In [41]:
pd.DataFrame(gbmcandidatePredictData, columns = ['kepid', 'predict','CONFIRMED','FALSE POSITIVE'])[:-11:-1]

Unnamed: 0,kepid,predict,CONFIRMED,FALSE POSITIVE
218,3629967,CONFIRMED,0.955072,0.044928
464,7749773,CONFIRMED,0.953844,0.046156
221,3448130,CONFIRMED,0.953196,0.046804
543,4815520,CONFIRMED,0.949934,0.050066
185,6049190,CONFIRMED,0.949532,0.050468
656,9957627,CONFIRMED,0.949176,0.050824
332,9472000,CONFIRMED,0.947182,0.052818
261,7761918,CONFIRMED,0.946757,0.053243
616,11394027,CONFIRMED,0.945315,0.054685
344,9003401,CONFIRMED,0.944689,0.055311


In [43]:
h2o.cluster().shutdown(prompt=False)

H2O session _sid_a646 closed.
