# Applying Machine Learning to Exoplanet False Positives

### This file was produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu
### Wed Jun 21 13:19:21 2017

#### COLUMN koi_disposition: Exoplanet Archive Disposition
#### COLUMN koi_period:     Orbital Period [days]
#### COLUMN koi_time0bk:    Transit Epoch [BKJD]
#### COLUMN koi_impact:     Impact Parameter
#### COLUMN koi_duration:   Transit Duration [hrs]
#### COLUMN koi_depth:      Transit Depth [ppm]
#### COLUMN koi_prad:       Planetary Radius [Earth radii]
#### COLUMN koi_teq:        Equilibrium Temperature [K]
#### COLUMN koi_insol:      Insolation Flux [Earth flux]
#### COLUMN koi_model_snr:  Transit Signal-to-Noise
#### COLUMN koi_tce_plnt_num: TCE Planet Number
#### COLUMN koi_tce_delivname: TCE Delivery
#### COLUMN koi_steff:      Stellar Effective Temperature [K]
#### COLUMN koi_slogg:      Stellar Surface Gravity [log10(cm/s**2)]
#### COLUMN koi_srad:       Stellar Radius [Solar radii]
#### COLUMN ra:             RA [decimal degrees]
#### COLUMN dec:            Dec [decimal degrees]
#### COLUMN koi_kepmag:     Kepler-band [mag]

In [1]:
import h2o
h2o.init(nthreads = 1, max_mem_size=8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpWZ6_mq
  JVM stdout: /tmp/tmpWZ6_mq/h2o_root_started_from_python.out
  JVM stderr: /tmp/tmpWZ6_mq/h2o_root_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster version:,3.10.4.8
H2O cluster version age:,1 month
H2O cluster name:,H2O_from_python_root_xbshgh
H2O cluster total nodes:,1
H2O cluster free memory:,7.111 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,1
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


In [2]:
exoplanets_csv = "data/exoplanet_results.csv"
data = h2o.import_file(exoplanets_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
data.shape

(7316, 20)

In [4]:
data['is_exoplanet'] = data['is_exoplanet'].asfactor()
data['is_exoplanet'].levels()

[['0', '1']]

In [5]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [6]:
print train.nrow
print valid.nrow
print test.nrow

5150
1066
1100


In [7]:
y = 'is_exoplanet'
x = list(data.columns)

In [8]:
x.remove(y)  #remove the response
x.remove('rowid') #remove id number
x.remove('kepid')
x

[u'koi_period',
 u'koi_time0bk',
 u'koi_impact',
 u'koi_duration',
 u'koi_depth',
 u'koi_prad',
 u'koi_teq',
 u'koi_insol',
 u'koi_model_snr',
 u'koi_tce_plnt_num',
 u'koi_tce_delivname',
 u'koi_steff',
 u'koi_slogg',
 u'koi_srad',
 u'ra',
 u'dec',
 u'koi_kepmag']

In [9]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [10]:
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

In [11]:
glm_fit1.train(x=x, y=y, training_frame=train)


glm Model Build progress: |███████████████████████████████████████████████| 100%


In [12]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [14]:
print glm_perf1
print glm_perf2


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.0940519022348
RMSE: 0.306678825866
LogLoss: 0.298286469598
Null degrees of freedom: 1099
Residual degrees of freedom: 1081
Null deviance: 1369.91752662
Residual deviance: 656.230233115
AIC: 694.230233115
AUC: 0.937075481823
Gini: 0.874150963647
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.442927585345: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,662.0,92.0,0.122,(92.0/754.0)
1,47.0,299.0,0.1358,(47.0/346.0)
Total,709.0,391.0,0.1264,(139.0/1100.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4429276,0.8113976,214.0
max f2,0.1726848,0.8712513,304.0
max f0point5,0.5566905,0.8078335,173.0
max accuracy,0.5330501,0.8754545,180.0
max precision,0.9952107,1.0,0.0
max recall,0.0127836,1.0,383.0
max specificity,0.9952107,1.0,0.0
max absolute_mcc,0.4429276,0.7199487,214.0
max min_per_class_accuracy,0.4255224,0.8687003,218.0


Gains/Lift Table: Avg response rate: 31.45 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9553765,2.6011561,2.6011561,0.8181818,0.8181818,0.0260116,0.0260116,160.1156069,160.1156069
,2,0.02,0.9352967,2.8901734,2.7456647,0.9090909,0.8636364,0.0289017,0.0549133,189.0173410,174.5664740
,3,0.03,0.9108972,2.8901734,2.7938343,0.9090909,0.8787879,0.0289017,0.0838150,189.0173410,179.3834297
,4,0.04,0.9004826,3.1791908,2.8901734,1.0,0.9090909,0.0317919,0.1156069,217.9190751,189.0173410
,5,0.05,0.8920335,2.6011561,2.8323699,0.8181818,0.8909091,0.0260116,0.1416185,160.1156069,183.2369942
,6,0.1,0.8203426,2.8323699,2.8323699,0.8909091,0.8909091,0.1416185,0.2832370,183.2369942,183.2369942
,7,0.15,0.7692012,2.7745665,2.8131021,0.8727273,0.8848485,0.1387283,0.4219653,177.4566474,181.3102119
,8,0.2,0.7073039,2.6589595,2.7745665,0.8363636,0.8727273,0.1329480,0.5549133,165.8959538,177.4566474
,9,0.3,0.5403936,2.1676301,2.5722543,0.6818182,0.8090909,0.2167630,0.7716763,116.7630058,157.2254335





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.0924844340845
RMSE: 0.304112535231
LogLoss: 0.293350494209
Null degrees of freedom: 1099
Residual degrees of freedom: 1080
Null deviance: 1369.91752662
Residual deviance: 645.371087259
AIC: 685.371087259
AUC: 0.938302080618
Gini: 0.876604161236
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.459917896577: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,672.0,82.0,0.1088,(82.0/754.0)
1,50.0,296.0,0.1445,(50.0/346.0)
Total,722.0,378.0,0.12,(132.0/1100.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4599179,0.8176796,205.0
max f2,0.1986326,0.8721844,292.0
max f0point5,0.5612604,0.8130329,175.0
max accuracy,0.4657436,0.88,203.0
max precision,0.9959224,1.0,0.0
max recall,0.0045545,1.0,390.0
max specificity,0.9959224,1.0,0.0
max absolute_mcc,0.4599179,0.7300918,205.0
max min_per_class_accuracy,0.4254614,0.8728324,215.0


Gains/Lift Table: Avg response rate: 31.45 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9613985,2.6011561,2.6011561,0.8181818,0.8181818,0.0260116,0.0260116,160.1156069,160.1156069
,2,0.02,0.9432196,2.8901734,2.7456647,0.9090909,0.8636364,0.0289017,0.0549133,189.0173410,174.5664740
,3,0.03,0.9202747,3.1791908,2.8901734,1.0,0.9090909,0.0317919,0.0867052,217.9190751,189.0173410
,4,0.04,0.9111951,2.8901734,2.8901734,0.9090909,0.9090909,0.0289017,0.1156069,189.0173410,189.0173410
,5,0.05,0.9054832,2.8901734,2.8901734,0.9090909,0.9090909,0.0289017,0.1445087,189.0173410,189.0173410
,6,0.1,0.8395551,2.7167630,2.8034682,0.8545455,0.8818182,0.1358382,0.2803468,171.6763006,180.3468208
,7,0.15,0.7903935,2.7745665,2.7938343,0.8727273,0.8787879,0.1387283,0.4190751,177.4566474,179.3834297
,8,0.2,0.7267433,2.6589595,2.7601156,0.8363636,0.8681818,0.1329480,0.5520231,165.8959538,176.0115607
,9,0.3,0.5520553,2.2543353,2.5915222,0.7090909,0.8151515,0.2254335,0.7774566,125.4335260,159.1522158






In [15]:
# Retreive test set AUC
print glm_perf1.auc()
print glm_perf2.auc()

0.937075481823
0.938302080618


In [16]:
print glm_fit2.auc(train=True)
print glm_fit2.auc(valid=True)

0.933090149336
0.940118343195


In [64]:
candidate_csv = "data/candidates.csv"
candidateData = h2o.import_file(candidate_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [65]:
glmpredict = glm_fit2.predict(candidateData)

glm prediction progress: |████████████████████████████████████████████████| 100%


In [58]:
glmpredict.describe()

Rows:2248
Cols:3




Unnamed: 0,predict,p0,p1
type,enum,real,real
mins,,0.00200461660718,0.0
mean,,0.540992150735,0.459007849265
maxs,,1.0,0.997995383393
sigma,,0.326781910368,0.326781910368
zeros,,0,1
missing,0,0,0
0,1,0.598028562065,0.401971437935
1,1,0.465241847369,0.534758152631
2,1,0.326275602618,0.673724397382


In [72]:
#header = ["Kepler ID", "IsExoplanet", "Exoplanet Probability"]
#table  = []
#for i in range(50):
    #table.append([int(candidateData[i,"kepid"]), glmpredict[i,"predict"], glmpredict[i,"p1"]])
#h2o.display.H2ODisplay(table, header)

In [73]:
candidateCombine = candidateData.cbind(glmpredict)
candidateCombine.describe()
h2o.export_file(candidateCombine, "data/candidate_predict.csv")

Rows:2248
Cols:23




Unnamed: 0,rowid,kepid,koi_disposition,koi_period,koi_time0bk,koi_impact,koi_duration,koi_depth,koi_prad,koi_teq,koi_insol,koi_model_snr,koi_tce_plnt_num,koi_tce_delivname,koi_steff,koi_slogg,koi_srad,ra,dec,koi_kepmag,predict,p0,p1
type,int,int,enum,real,real,real,real,real,real,int,real,real,int,enum,int,real,real,real,real,real,enum,real,real
mins,38.0,1026957.0,,0.259819659,129.7295,0.0,0.052,0.0,0.22,25.0,0.0,0.0,1.0,,2661.0,0.114,0.109,280.31488,36.74361,7.748,,0.00200461660718,0.0
mean,4953.38300712,7793188.69929,,130.530637572,170.09527256,0.537167871854,4.82746141459,1864.28773455,15.931610984,882.267734554,5356.90518298,45.5952402746,1.29481889042,,5639.76659039,4.33148100686,1.56631121281,291.789594786,43.9494900085,14.3381481317,,0.540992150735,0.459007849265
maxs,9562.0,12885212.0,,129995.7784,657.26877,64.5159,44.35,363130.0,13333.5,13184.0,7165673.12,6788.8,7.0,,10894.0,5.364,152.969,301.6618,52.220341,17.305,,1.0,0.997995383393
sigma,2609.69043662,2645966.94509,,2744.15611153,73.0627218262,1.99001464449,4.41385284965,12502.0940207,316.93728348,665.610849735,155887.724415,231.866563431,0.706922342861,,693.971279584,0.390856809511,5.8751460049,4.82784636236,3.6034967054,1.31090857203,,0.326781910368,0.326781910368
zeros,0,0,,0,0,3,0,1,0,0,1,1,0,,0,0,0,0,0,0,,0,1
missing,0,0,0,0,0,63,0,63,63,63,62,63,67,67,63,63,63,0,0,0,0,0,0
0,38.0,11138155.0,CANDIDATE,4.959319244,172.2585292,0.831,2.22739,9802.0,12.21,1103.0,349.4,696.5,1.0,q1_q17_dr25_tce,5712.0,4.359,1.082,292.16705,48.727589,15.263,1,0.598028562065,0.401971437935
1,59.0,11818800.0,CANDIDATE,40.4195037,173.56469,0.911,3.362,6256.0,7.51,467.0,11.29,36.9,1.0,q1_q17_dr25_tce,5446.0,4.507,0.781,294.31686,50.080231,15.487,1,0.465241847369,0.534758152631
2,63.0,11918099.0,CANDIDATE,7.2406612,137.75545,1.198,0.558,556.4,19.45,734.0,68.63,13.7,2.0,q1_q17_dr25_tce,5005.0,4.595,0.765,293.83331,50.23035,15.334,1,0.326275602618,0.673724397382


Export File progress: |███████████████████████████████████████████████████| 100%


In [74]:
h2o.cluster().shutdown(prompt=False)

H2O session _sid_a4c9 closed.
