# Machine Learning Engineering Nanodegree
## Capstone Project
### Yeshaswini Mohan
### August 12, 2018
## Exoplanet Hunting in Deep Space

This project will involve creating an supervised learning algorithm that can correctly predict whether a star system has a planet. There are two datasets: one named exoTrain.csv and one named exoTest.csv. The data from exoTrain.csv will be used to train the algorithm. 

The exoTrain.csv file has 5087 observations with 37 confirmed exoplanets and the exoTest.csv file has 570 observations with 5 confirmed exoplanets. 

The algorithm chosen for this project is a DecisionTreeClassifier. The accuracy of the classification can be measured with an accuracy score and an f1score. However, it would also be interesting to see if which systems the classifier has confirmed to have planets and compare it to the actual result. 

Lastly, in addition to the DecisionTreeClassifier, the XGBoost and the LightGBM methods will also be used and will be useful in comparing the results of the trained DecisionTreeClassifier. 

## Preprocessing
### Downloading and Splitting Training Data 

The data from exoTrain.csv will be downloaded and split into a training and testing set. 

The first column of the dataset is the label or the Y-values entered into the train_test_split. The values labeled 2 are systems with exoplanets. The values labeled 1 are systems without exoplanets. Therefore, there are 37 systems labeled with a 2, and the rest are labeled with a 1. 

In [1]:
#Importing necessary libraries for splitting data into training and testing sets
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split

#Importing exoTrain.csv
data = pd.read_csv("exoTrain.csv")

"""
The data from both datasets have their label array as their first column.
The dataset needs to be separated into the label and data inputs for the train_test_split function.
"""
X= data.iloc[:,1:]
Y = data.iloc[:,0]

#Splitting data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.05, random_state=28)



## Implementation and Refinement

In this portion, the baseline model for each algorithm will be created. The accuracy and fbeta score will be calculated for both the training and testing data obtained above. 

A trained model will then be created. To do this, GridSearchCV will be used to find the best parameters for a better accuracy. 

To test this model, the X values obtained above will be put into the model and the new Y label outputs will be compared with the original Y values. A list will be generated that shows the predicted exoplanet from the output Y values from the trained model. 

This will be done for all three algorithms. 

### Decision Tree Classifier 

#### Baseline Model

In [2]:
#Importing DecisionTreeClassifier and scoring metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import fbeta_score

#Baseline model being used
model10 = DecisionTreeClassifier()
mod10 = model10.fit(X_train, y_train)

#Finding the train and test predicted output values 
y_train_pred10 = mod10.predict(X_train)
y_test_pred10 = mod10.predict(X_test)

#Training accuracy scores
train_accuracy10 = accuracy_score(y_train, y_train_pred10)
test_accuracy10 = accuracy_score(y_test, y_test_pred10)

#Training fbeta (f1score)
train_fbe10 = fbeta_score(y_train, y_train_pred10,beta=0.5)
test_fbe10 = fbeta_score(y_test, y_test_pred10,beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy10)
print('The testing accuracy is', test_accuracy10)
print('The training fbeta is', train_fbe10)
print('The testing fbeta is', test_fbe10)

The training accuracy is 1.0
The testing accuracy is 0.988235294118
The training fbeta is 1.0
The testing fbeta is 0.997624703088


#### Finding Best Parameters

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn import datasets

parameters = {'min_samples_split':[2,3,4,5,6,7,8,9,10,11,12,13,14,15,], 
              'min_samples_leaf':[1,2,3,4,5,6,7,8,9,10]}
iris = datasets.load_iris()
gsv = GridSearchCV(model10,parameters,cv=10)
gsv.fit(iris.data, iris.target)
gsv.best_params_

{'min_samples_leaf': 3, 'min_samples_split': 2}

#### Training Model

In [4]:
model1 = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=2)

#Model trained with training data
mod1 = model1.fit(X_train, y_train)

# Making predictions
y_train_pred1 = mod1.predict(X_train)
y_test_pred1 = mod1.predict(X_test)

# Calculating accuracies
train_accuracy1 = accuracy_score(y_train, y_train_pred1)
test_accuracy1 = accuracy_score(y_test, y_test_pred1)
train_fbe1 = fbeta_score(y_train, y_train_pred1, average='weighted', beta=0.5)
test_fbe1 = fbeta_score(y_test, y_test_pred1, average='weighted', beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy1)
print('The testing accuracy is', test_accuracy1)
print('The training fbeta is', train_fbe1)
print('The testing fbeta is', test_fbe1)

The training accuracy is 0.997723509934
The testing accuracy is 1.0
The training fbeta is 0.997584767332
The testing fbeta is 1.0


#### Testing model with Training Data

In [5]:
#Predicted label value from model trained on training set
Y_pred1 = mod1.predict(X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_TestX1 = accuracy_score(Y, Y_pred1)
test_fbe_TestX1 = fbeta_score(Y, Y_pred1,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_TestX1)
print('The testing fbeta is', test_fbe_TestX1)

The testing accuracy is 0.997837625319
The testing fbeta is 0.997705677837


#### List of Predicted Exoplanet Systems

In [6]:
#Printing list of stars with exoplanets
print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(Y_pred1)):
    if Y_pred1[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:
0
1
2
3
4
5
6
7
8
10
11
15
16
17
18
19
20
21
22
24
26
28
30
31
33
34


### XGBoost

#### Baseline Model

In [7]:
#Import XGBoost
from xgboost import XGBClassifier
model20 = XGBClassifier()

#Fitting training data to baseline model
mod20 = model20.fit(X_train,y_train)

#Using trained model to predict training and testing data
y_train_pred20 = mod20.predict(X_train)
y_test_pred20 = mod20.predict(X_test)

#Obtaining accuracy score
train_accuracy20 = accuracy_score(y_train, y_train_pred20)
test_accuracy20 = accuracy_score(y_test, y_test_pred20)

#Obtaining fbeta score
train_fbe20 = fbeta_score(y_train, y_train_pred20, average='weighted', beta=0.5)
test_fbe20 = fbeta_score(y_test, y_test_pred20, average='weighted', beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy20)
print('The testing accuracy is', test_accuracy20)
print('The training fbeta is', train_fbe20)
print('The testing fbeta is', test_fbe20)

The training accuracy is 0.999793046358
The testing accuracy is 1.0
The training fbeta is 0.999792159263
The testing fbeta is 1.0


#### Finding Best Parameters

In [8]:
#Baseline model parameters
paramxg = model20.get_params()
print(paramxg)

{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': True, 'subsample': 1}


In [9]:
#Getting best parameters for DecisionTreeClassifier
paramxg2 = {'max_depth': [1,2,3,4,5,6,7], 
           'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5]}

iris = datasets.load_iris()
gsv = GridSearchCV(model20,paramxg2,cv=2)
gsv.fit(iris.data, iris.target)
gsv.best_params_

{'learning_rate': 0.1, 'max_depth': 1}

#### Training Model

In [10]:
#Training XGBoost model with obtained parameters
model2= XGBClassifier(learning_rate=0.1, max_depth = 1)

#Fitting training data to trained model
mod2 = model2.fit(X_train,y_train)

#Using trained model to predict training and testing data
y_train_pred2 = mod2.predict(X_train)
y_test_pred2 = mod2.predict(X_test)

#Obtaining accuracy score
train_accuracy2 = accuracy_score(y_train, y_train_pred2)
test_accuracy2 = accuracy_score(y_test, y_test_pred2)

#Obtaining fbeta score
train_fbe2 = fbeta_score(y_train, y_train_pred2, average='weighted', beta=0.5)
test_fbe2 = fbeta_score(y_test, y_test_pred2, average='weighted', beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy2)
print('The testing accuracy is', test_accuracy2)
print('The training fbeta is', train_fbe2)
print('The testing fbeta is', test_fbe2)

The training accuracy is 0.992342715232
The testing accuracy is 1.0
The training fbeta is 0.986254470741
The testing fbeta is 1.0


  'precision', 'predicted', average, warn_for)


#### Testing model with Training Data

In [11]:
#Predicted label value from model trained on training set
Y_pred2 = mod2.predict(X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_TestX2 = accuracy_score(Y, Y_pred2)
test_fbe_TestX2 = fbeta_score(Y, Y_pred2,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_TestX2)
print('The testing fbeta is', test_fbe_TestX2)

The testing accuracy is 0.992726557893
The testing fbeta is 0.986941711426


  'precision', 'predicted', average, warn_for)


#### List of Predicted Exoplanet Systems

In [12]:
#Printing list of stars with exoplanets
print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(Y_pred2)):
    if Y_pred2[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:


### LightGBM

#### Baseline Model

In [13]:
#Import LightGBM
import lightgbm as lgb

model30 = lgb.LGBMClassifier()

mod30 = model30.fit(X_train,y_train)

#Using trained model to predict training and testing data
y_train_pred30 = mod30.predict(X_train)
y_test_pred30 = mod30.predict(X_test)

#Obtaining accuracy score
train_accuracy30 = accuracy_score(y_train, y_train_pred30)
test_accuracy30 = accuracy_score(y_test, y_test_pred30)

#Obtaining fbeta score
train_fbe30 = fbeta_score(y_train, y_train_pred30, average='weighted', beta=0.5)
test_fbe30 = fbeta_score(y_test, y_test_pred30, average='weighted', beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy30)
print('The testing accuracy is', test_accuracy30)
print('The training fbeta is', train_fbe30)
print('The testing fbeta is', test_fbe30)

The training accuracy is 1.0
The testing accuracy is 1.0
The training fbeta is 1.0
The testing fbeta is 1.0


#### Finding Best Parameters

In [14]:
#Baseline model parameters
paramlgbm = mod30.get_params()
print(paramlgbm)

{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0}


In [15]:
#Getting best parameters for DecisionTreeClassifier
paramlgbm2 = {'learning_rate': [0.01, 0.1, 0.03, 0.3, 0.05, 0.5],
             'num_leaves': [5,10,20,30,40,50]}

iris = datasets.load_iris()
gsv = GridSearchCV(mod30,paramlgbm2,cv=10)
gsv.fit(iris.data, iris.target)
gsv.best_params_

{'learning_rate': 0.5, 'num_leaves': 5}

#### Training Model

In [16]:
model3 = lgb.LGBMClassifier(learning_rate = 0.5, num_leaves = 5)

mod3 = model3.fit(X_train,y_train)

#Using trained model to predict training and testing data
y_train_pred3 = mod3.predict(X_train)
y_test_pred3 = mod3.predict(X_test)

    
#Obtaining accuracy score
train_accuracy3 = accuracy_score(y_train, y_train_pred3)
test_accuracy3 = accuracy_score(y_test, y_test_pred3)

#Obtaining fbeta score
train_fbe3 = fbeta_score(y_train, y_train_pred3, average='weighted', beta=0.5)
test_fbe3 = fbeta_score(y_test, y_test_pred3, average='weighted', beta=0.5)

#Printing scores
print('The training accuracy is', train_accuracy3)
print('The testing accuracy is', test_accuracy3)
print('The training fbeta is', train_fbe3)
print('The testing fbeta is', test_fbe3)

The training accuracy is 0.984271523179
The testing accuracy is 0.996078431373
The training fbeta is 0.986030825161
The testing fbeta is 0.999213217939


  'recall', 'true', average, warn_for)


#### Testing model with Training Data

In [17]:
#Predicted label value from model trained on training set
Y_pred3 = mod3.predict(X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_TestX3 = accuracy_score(Y, Y_pred3)
test_fbe_TestX3 = fbeta_score(Y, Y_pred3,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_TestX3)
print('The testing fbeta is', test_fbe_TestX3)

The testing accuracy is 0.984863377236
The testing fbeta is 0.986678111086


#### List of Predicted Exoplanet Systems

In [18]:
#Printing list of stars with exoplanets
print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(Y_pred3)):
    if Y_pred3[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:
1
9
22
25
33
239
501
519
586
798
893
1010
1078
1157
1188
1210
1236
1639
1660
1780
1827
1866
2016
2101
2106
2164
2385
2405
2443
2488
2678
2689
3274
3326
3462
3560
3669
3872
3890
3978
3989
4119
4122
4579
4618
4709
4859
4914
5017
5049


## Testing Trained Models

To test this model, the X values obtained above will be put into the model and the new Y label outputs will be compared with the original Y values. A list will be generated that shows the predicted exoplanet from the output Y values from the trained model. 

This will be done for all three algorithms. 

### Downloading and Splitting Testing Data

The data from exoTest.csv will be downloaded and split into a training and testing set. 

As with the exoTrain data, the first column of the dataset is the label or the Y-values entered into the train_test_split. The values labeled 2 are systems with exoplanets. The values labeled 1 are systems without exoplanets. Therefore, there are 37 systems labeled with a 2, and the rest are labeled with a 1. 

In [19]:
#Importing exoTest.csv
data_Test = pd.read_csv("exoTest.csv")

"""
As before, the dataset needs to be separated into the label and data inputs for the train_test_split function.
"""
new_X = data_Test.iloc[:,1:]
new_Y= data_Test.iloc[:,0]

## Predicting Using Trained Models 
To test this model, the new_X values obtained above will be put into the model and the new predicted Y label outputs will be compared with the original new_Y values. A list will be generated that shows the predicted exoplanet from the output Y values from the trained model. 

This will be done for all three algorithms. 

### DecisionTreeClassifier

#### Using Trained Model to Predict Testing Data

In [20]:
#Predicted label value from model trained on training set
new_Y_pred1 = mod1.predict(new_X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_Test1 = accuracy_score(new_Y, new_Y_pred1)
test_fbe_Test1 = fbeta_score(new_Y, new_Y_pred1,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_Test1)
print('The testing fbeta is', test_fbe_Test1)

The testing accuracy is 0.985964912281
The testing fbeta is 0.98790437869


#### List of Predicted Exoplanet Systems

In [21]:
#Printing list of stars with exoplanets
print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(new_Y_pred1)):
    if new_Y_pred1[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:
0
1
25
124
246
264
366


### XGBoost

#### Using Trained Model to Predict Testing Data

In [22]:
#Predicted label value from model trained on training set
new_Y_pred2 = mod2.predict(new_X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_Test2 = accuracy_score(new_Y, new_Y_pred2)
test_fbe_Test2 = fbeta_score(new_Y, new_Y_pred2,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_Test2)
print('The testing fbeta is', test_fbe_Test2)

The testing accuracy is 0.991228070175
The testing fbeta is 0.984259858786


  'precision', 'predicted', average, warn_for)


#### List of Predicted Exoplanet Systems

In [23]:
#Printing list of stars with exoplanets

print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(new_Y_pred2)):
    if new_Y_pred2[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:


### LigthGBM

#### Using Trained Model to Predict Testing Data

In [24]:
#Predicted label value from model trained on training set
new_Y_pred3 = mod3.predict(new_X)

#Accuracy score and fbeta score for predicted value from test dataset
train_accuracy_Test3 = accuracy_score(new_Y, new_Y_pred3)
test_fbe_Test3 = fbeta_score(new_Y, new_Y_pred3,average='weighted', beta=0.5)

#Printing scores
print('The testing accuracy is', train_accuracy_Test3)
print('The testing fbeta is', test_fbe_Test3)

The testing accuracy is 0.980701754386
The testing fbeta is 0.982092327593


#### List of Predicted Exoplanet Systems

In [25]:
#Printing list of stars with exoplanets

print('The stars from the list that are predicted to have exoplanets are:')

for i in range(len(new_Y_pred3)):
    if new_Y_pred3[i]==2:
        print(i)

The stars from the list that are predicted to have exoplanets are:
165
178
232
249
262
545
