# Lab Notebook 22: Discovery of exoplanets

We will investigate a problem very common in astrophysics, the detection of exoplanets. Exoplanets are planets outside of our solar system that orbit any star other than our sun. By definition exoplanets do not emit light themselves, which makes their detection quite tricky. A common technique is using the transit method, which investigates the brightness of a star over time. If a star possesses a planet orbiting it, the brightness will reduce periodically, when the planet is transiting our view axis.

The data we use here is derived from observations made by the NASA Kepler space telescope.

**Training set (exoTrain.csv):**

5087 rows or observations  
198 columns or features  
Column 1 is the label vector, columns 2 - 3198 are the flux values over time

37 confirmed exoplanet-stars and 5050 non-exoplanet-stars

**Testset (exoTest.csv):**

570 rows or observations  
3198 columns or features  
Column 1 is the label vector, columns 2 - 3198 are the flux values over time

5 confirmed exoplanet-stars and 565 non-exoplanet-stars

We will now preprocess the data and then train a convolutional neural network to find these exoplanets. This is a hard problems since the exoplanets are so rare. As a result, the dataset is very imbalanced.

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
import keras

from keras.models import Sequential #the model is built adding layers one after the other
from keras.layers import Dense #fully connected layers: every output talks to every input
from keras.layers import Dropout #for regularization

from sklearn import metrics as sk_met
from sklearn.metrics import confusion_matrix

## Data acquisition

First, read in the datasets into pandas frames and check out summary statistics via describe.

Reformat the data into Xtrain, ytrain, Xtest, ytest arrays. Note that the class labels here are 1 and 2, so it is best to subtract 1 so get the more conventional labels 0 and 1.

Plot the flux intensity over time for one example of an exoplant star (eg. index 0) and a non-exoplanet star (eg. index 100). You should see a qualitative difference.

## Data preparation for CNN via FFT, scaling and filtering

In one of the last exercises, we used a convolutional neural network (CNN) to classify the MNIST dataset. The superior performance of this approach comes from the fact that the filters applied to the input gather and aggregate information of regions rather than just looking at individual values or pixels. This can also be of use in our problem, however we have to make some transformations to make the data more suited for the classification task.

The signature we are looking for is a periodic reduction of the light intensity, signaling the transit of an exoplanet. It would be beneficial if we could use this knowledge to extract the relevant information from the raw data. Looking at the Fourier transformed version of the data therefore is a straightforward first step.

1. Apply a **fast fourier transform** (from scipy.fftpack) to Xtrain and Xtest. The result is complex so take the absolute value. Since the fft is symmetric around zero, it is enough to keep the first half of the spectrum only.
2. Use *normalize* from sklearn.preprocessing to make each data observation a vector of length 1. This reduces large values
3. Use *gaussian_filer* (from scipy.ndimage) with gamma=10 to reduce the noise
4. Lastly, use MinMaxScaler from sklearn.preprocessing to keep the data between 0 and 1.

   Show a plot of the data for the same two stars you looked at above after all this processing and filtering.

## Train a CNN

Use train_test_split to split off 20% of the training data as a validation set (the test data stays unchanged).

In order to use a CNN, we first have to transform our training data, which means reshaping the training set to have the dimensions (4069, 1599, 1), and similar for the test set.

Since the data set is so imbalanced, we will try to use the class_weight parameter to give more weight to the rare class. First, compute the "balanced" class weights. You can use the utility at https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html or write your own. See also https://www.tensorflow.org/tutorials/structured_data/imbalanced_data for how the class_weight dictionary needs to look like.

**Proposed structure of the CNN:**

1. A 1d convolution layer (Conv1D) with 16 filters, input shape matching the dataset, kernel_size=3, activation = 'relu', kernel_regularizer='l2', padding='same'.
2. A max pooling layer (MaxPooling1D) with pool_size=2 and strides =2
3. A 30% dropout layer
4. A flatten operation
5. A dense layer with 32 neurons
6. a 50% dropout layer
7. A final output layer with output dimension 2 and sigmoid activation for classification. The appropriate loss function is then *keras.losses.SparseCategoricalCrossentropy()*

Use the adam optimizer with learning rate = 0.01


**Try training the model for 20 epochs** Remember to pass the "validation_data" and the "class_weight" to model.fit(). 

Plot the training and validation loss and accuracy versus epochs. Have you trained enough?

Inspect the confusion matrix for the CNN on the test data. Note that in keras, calling model.predict gives you the output probabilities, not the class label. You need to compute that from the probabilities.

Compute accuracy, precision, and recall metrics for train and test data. 

**Questions:**

1. Which metric, in your opinion, is most useful for the present problem? Do you have a useful classifier?
2. Re-train the network a few times, and comment on the results.

## Train a RNN

Our time series data is in principle amenable to treatment by an RNN. To this end, we **return to the original, unprocessed data** without the FFT and filtering. 

Use again train_test_split to split off 20% of the training data as a validation set (the test data stays unchanged). Then reshape the features so that X_train for instance has the shape (4069, 3197, 1). 

Let's build a sequential network with 

1) a SimpleRNN layer of 16 of 32 neurons.
2) a dropout layer with 20%
3) a final dense layer with sigmoid activation as before for classification

Everything else can be the same as the CNN, including the class weights.

**Try training the RNN for 10 epochs**

Plot again training and validation loss and accuracy.

And the confusion matrix

And the final scores. How is the RNN doing in our classification task? Is the RNN or CNN better?