# Exoplanet Detection Technical Report:

### Objective:

Design a neural network to detect planets passing in front of their stars by analyzing observed brightness. 

#### Answer the question: <br>
Out of all the stars visible in the night sky (thought to be 5,000) how many stars have detectable planets?
<br>

### Data Collection:

The measured levels of light, known as "light curves", used in this project were observed by the Kepler Spacecraft. <br>
Kepler was designed to measure these transit events in an effort to estimate how many earth-like planets there are in the Milky Way.<br>
The data collected by Kepler is a timeseries of the magnitude of light measured from stars taken at about 30 minute intervals. 

Transit Detection Method:	
Advantages <br>
> Can be easily applied to many stars compared to other exoplanet detection methods.

Disadvantages <br>
> Only a small fraction of actual planets happen to have an orbital plane that puts them between their star an our observation point.
> This method is known to have a high false positive rate.


Model: <br>
The model chosen for this problem is a one dimensional convolutional neural network. <br>
Convolutional neural networks (CNN's) are often used in image processing and pattern detection problems where the sequence of the input is important. Transit events create sharp dips in the brightness of their stars, this CNN will learn to recognize these types of patterns and classify the star system as containing at least one exoplanet or not.

### Description of data
The data used in this project is from the third campaign of Kepler's first mission.
Each campaign lasted up to 90 days, then the spacecraft performed a rolling maneuver to optimize the alignment of its solar panels.

The magnitude of light is the output of the light sensitive instrument, measured in electrons per second. It was then adjusted by NASA to account for systematic error from the spacecraft and changes in the background light as Kepler trailed the Earth on its journey around the sun.	

### Data gathering and cleaning process:
To obtain the data, I used wget scripts made for NASA's Bulk Data API. These wget scripts are contained in batch files and are run from the terminal. There is a lot of data contained in the Kepler Archive and for computational concerns, the data used in this project was limited to small portions of some of Kepler's missions. Still, the amount of data downloaded from NASA for this project took up about 250 gigabytes of memory.<br>

#### Extracting
The light curves are contained in .fit files along with a lot of other data including the image of the star and other readings from its many instruments.
The relevant data was extracted by looping through each .fit file and saving the corrected light flux levels in a seperate file. 

#### Handling Null values <br>
The light curves contained some "one-off" single missing values throughout the time series as well as longer "strings" of missing data that could last as long as 20 hours.
The "one-off" missing values were filled with the mean of the nearest two recorded light levels. This kind of mean imputation was used because the missing values would not stand out as patterns to be detected by the neural network. It also follows the logic that the magnitude of light between two points in time would be between the two known points unless acted on by some external factor. The assumption that there is no external factor is one we want to be making for these missing values unless there is some evidence that that is not the case.<br>

The longer "strings" of missing data were filled in with the next non-null value in the time series. This has the potential to create "skips" in the data that could be recognized by the neural network and skew its results if there is some correlation between these skips and whether the star is classified as having a planet or not. To account for this, the light curves were randomized across time so that the missing values would be less likely to occur in some systematic pattern.

### Data Exploration:
This is a good example of what the light curve looks like with several very obvious transits.
You can see the sharp decreases in brightness when the planet passes in front of the star.
<img src="assets/transit_light_curve.png">

A relatively common feature in these light curves is solar flares. <br>
At first glance they look like outliers, but these are most likely massive explosions that greatly increase the brightness of the star for a short time.
<img src="assets/solar_flare.png">

### Preprocessing
Scaling:
To compare the magnitude of light from different stars, the flux levels were normalized for each individual star. Thus the resulting values represent how different a particular light level is from the average brightness that individual star. This perserves transit events but now stars with different brightnesses can be compared.

Setting the time interval to be the same for all data sets. <br>
To compare data retrieved from different Kepler missions. All datasets were limited to the size of the smallest dataset, which was about 66 days long. The disadvantage of this is it is likely to exclude transit events for plantes with an orbital period of more than 66 days. This is a relatively small time window, the shorted orbital period in our solar system is Mercury's orbit of 88 days. However, the transit detection method is more sensitve to planets close to their star and most of the planets that have been discovered this way have an orbital period within this window. 

### Modeling:
The topology of this CNN is fairly simple. There are 3 convolutional layers, 3 pooling layers, and 2 hidden layers. The patterns that transits make aren't complicated images that require many layers of neurons to interpret. The more complicated the model becomes the less well it generalizes to the unseen data. Especially when generalizing to different campaigns of the Kepler mission or readings from other telescopes measuring light curves.

### Conclusions:

The model classified the hold out training set with 85% accuracy compared to a baseline of 31%. When applied to unseen data without confirmed planets mixed in. It classified planets up to 15 times better than chance.
The model predicted 98 planets, of which 12 were confirmed planets. There were 40 confirmed planetary systems in the unseen dataset.