# Exoplanet
<!-- Author: Xiaorong Yan -->
<!-- Date:  2021/12/15-->

The goal of this project is to use skills I learned from data science class to solve a problem I find interesting in, in this case, astronomy related. Since I major in CS and minor in astronomy, this will be my first ever project for me to combine my knowledge in both field. The dataset I will be using is from NASA's exoplanet archive. Detail see below. 
Found in https://github.com/awesomedata/awesome-public-datasets.

Exoplanet data available from
https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=TOI
NASA Exoplanet Archive
To use up-to-date data, you can either: 
a) download from the link
b) use NASA's API (refer to https://exoplanetarchive.ipac.caltech.edu/docs/program_interfaces.html)

From NASA Exoplanet Archive's front page, we can see that there are total of 4877 confirmed exoplanet, 173 of which came from TESS[1], which has 4708 candidates (as of December 2021). If we can confirm or disqualify potential candidate from TESS, we will have more exoplanets to study, maybe even confirm more Earth like exoplanets. From the data science class, I recalled that we used sklearn to train models on dataset to predict results. Further examing the dataset confirmed my idea that using what I have learned to predict condidates is indeed doable. So here we go. 




[1] Transit Surveys. Launched in April 2018, TESS is surveying the sky for two years to find transiting exoplanets around the brightest stars near Earth.


In [26]:
# importing libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier


# displaying all data in a dataframe
pd.set_option('display.max_columns', None)



The downloaded csv files are not well formatted since the top 1*n cells are name description, so pd.read_csv() does not work nicely. So, we are ignoring the first n rows by using "skiprows" parameter in read_csv() function.

In [2]:
df_TOI = pd.read_csv('TOI_202112.csv', skiprows = 69)
df_TOI.head()

Unnamed: 0,toi,tid,tfopwg_disp,rastr,ra,decstr,dec,st_pmra,st_pmraerr1,st_pmraerr2,st_pmralim,st_pmdec,st_pmdecerr1,st_pmdecerr2,st_pmdeclim,pl_tranmid,pl_tranmiderr1,pl_tranmiderr2,pl_tranmidlim,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_trandurh,pl_trandurherr1,pl_trandurherr2,pl_trandurhlim,pl_trandep,pl_trandeperr1,pl_trandeperr2,pl_trandeplim,pl_rade,pl_radeerr1,pl_radeerr2,pl_radelim,pl_insol,pl_insolerr1,pl_insolerr2,pl_insollim,pl_eqt,pl_eqterr1,pl_eqterr2,pl_eqtlim,st_tmag,st_tmagerr1,st_tmagerr2,st_tmaglim,st_dist,st_disterr1,st_disterr2,st_distlim,st_teff,st_tefferr1,st_tefferr2,st_tefflim,st_logg,st_loggerr1,st_loggerr2,st_logglim,st_rad,st_raderr1,st_raderr2,st_radlim,toi_created,rowupdate
0,1000.01,50365310,FP,07h29m25.85s,112.357708,-12d41m45.46s,-12.69596,-5.964,0.085,-0.085,0.0,-0.076,0.072,-0.072,0.0,2459230.0,0.001657,-0.001657,0,2.171348,0.000264,-0.000264,0,2.01722,0.319588,-0.319588,0,656.886099,37.77821,-37.77821,0,5.818163,1.910546,-1.910546,0,22601.948581,,,,3127.204052,,,,9.604,0.013,-0.013,0,485.735,11.9515,-11.9515,0,10249.0,264.7,-264.7,0,4.19,0.07,-0.07,0,2.16986,0.072573,-0.072573,0,2019-07-24 15:58:33,2021-10-29 12:59:15
1,1001.01,88863718,PC,08h10m19.31s,122.580465,-05d30m49.87s,-5.513852,-4.956,0.102,-0.102,0.0,-15.555,0.072,-0.072,0.0,2459250.0,0.001925,-0.001925,0,1.931671,8e-06,-8e-06,0,3.18,0.173,-0.173,0,1030.0,207.83,-207.83,0,10.3168,3.21459,-3.21459,0,42432.8,,,,3998.0,,,,9.42344,0.006,-0.006,0,295.862,5.91,-5.91,0,7070.0,126.4,-126.4,0,4.03,0.09,-0.09,0,2.01,0.09,-0.09,0,2019-07-24 15:58:33,2021-10-29 12:59:15
2,1002.01,124709665,FP,06h58m54.47s,104.726966,-10d34m49.64s,-10.580455,-1.462,0.206,-0.206,0.0,-2.249,0.206,-0.206,0.0,2459202.0,0.001161,-0.001161,0,1.867588,0.000152,-0.000152,0,2.211864,0.094625,-0.094625,0,1657.147109,69.07734,-69.07734,0,36.432872,21.315702,-21.315702,0,20641.445701,,,,3057.065736,,,,9.299501,0.058,-0.058,0,943.109,106.333,-106.333,0,8924.0,124.0,-124.0,0,,,,0,5.73255,,,0,2019-07-24 15:58:33,2021-10-29 12:59:15
3,1003.01,106997505,FP,07h22m14.39s,110.559945,-25d12m25.26s,-25.207017,-0.939,0.041,-0.041,0.0,1.64,0.055,-0.055,0.0,2458493.0,0.00535,-0.00535,0,2.74323,0.00108,-0.00108,0,3.167,0.642,-0.642,0,383.41,0.781988,-0.781988,0,,,,0,1177.36,,,,1631.0,,,,9.3003,0.037,-0.037,0,7728.17,1899.57,-1899.57,0,5388.5,567.0,-567.0,0,4.15,1.64,-1.64,0,,,,0,2019-07-24 15:58:33,2021-10-29 12:59:15
4,1004.01,238597883,FP,08h08m42.77s,122.178195,-48d48m10.12s,-48.802811,-4.496,0.069,-0.069,0.0,9.347,0.062,-0.062,0.0,2459230.0,0.002365,-0.002365,0,3.577575,0.000669,-0.000669,0,2.934708,0.343917,-0.343917,0,501.602877,35.86739,-35.86739,0,5.050111,1.345575,-1.345575,0,8092.969136,,,,2419.060447,,,,9.1355,0.006,-0.006,0,356.437,4.6175,-4.6175,0,9219.0,171.1,-171.1,0,4.14,0.07,-0.07,0,2.1504,0.060467,-0.060467,0,2019-07-24 15:58:33,2021-10-29 12:59:15


The PS dataset containing all confirmed exoplanets, but it also contains exoplanets found by other methods (not TESS). The TOI (TESS Object of Interest) dataframe contains all the candidates collected from TESS. These two dataframes' column use different format and contains columns that we do not need for our project. Therefore, we need to tidy up the data and only leave what's useful.

In [9]:
# The first step is to see what columns in the data frame we need to keep by making their name more descriptive.
print(df_TOI.shape)
df_TOI.columns

(4708, 65)


Index(['toi', 'tid', 'tfopwg_disp', 'rastr', 'ra', 'decstr', 'dec', 'st_pmra',
       'st_pmraerr1', 'st_pmraerr2', 'st_pmralim', 'st_pmdec', 'st_pmdecerr1',
       'st_pmdecerr2', 'st_pmdeclim', 'pl_tranmid', 'pl_tranmiderr1',
       'pl_tranmiderr2', 'pl_tranmidlim', 'pl_orbper', 'pl_orbpererr1',
       'pl_orbpererr2', 'pl_orbperlim', 'pl_trandurh', 'pl_trandurherr1',
       'pl_trandurherr2', 'pl_trandurhlim', 'pl_trandep', 'pl_trandeperr1',
       'pl_trandeperr2', 'pl_trandeplim', 'pl_rade', 'pl_radeerr1',
       'pl_radeerr2', 'pl_radelim', 'pl_insol', 'pl_insolerr1', 'pl_insolerr2',
       'pl_insollim', 'pl_eqt', 'pl_eqterr1', 'pl_eqterr2', 'pl_eqtlim',
       'st_tmag', 'st_tmagerr1', 'st_tmagerr2', 'st_tmaglim', 'st_dist',
       'st_disterr1', 'st_disterr2', 'st_distlim', 'st_teff', 'st_tefferr1',
       'st_tefferr2', 'st_tefflim', 'st_logg', 'st_loggerr1', 'st_loggerr2',
       'st_logglim', 'st_rad', 'st_raderr1', 'st_raderr2', 'st_radlim',
       'toi_created', 'rowupda

In [25]:
# column description from the website, https://exoplanetarchive.ipac.caltech.edu/docs/API_TOI_columns.html
dscpt = requests.get('https://exoplanetarchive.ipac.caltech.edu/docs/API_TOI_columns.html').text
soup = BeautifulSoup(dscpt, 'html.parser')
table = soup.find_all('table')
for t in table:
    print(t.text)




Database
        Column Name
Table Label
Description
 Uncertainties Column
        (positive +)
        (negative -)
Limit Column


toi†
TESS Object of Interest
 A number used to identify and track a TESS Object of Interest (TOI). 
        TOIs are identified and numbered by the TESS Project. A TOI name has 
        an integer and a decimal part of the format TOI-NNNNN.DD. The integer 
        part designates the target star; the two-digit decimal part identifies 
        a unique transiting object associated with that star. 
 
 


toipfx
TESS Object of Interest Prefix
The integer portion of the TOI Identifier, designating the target star. (See toi description above.)
 
 


tid†
TESS Input Catalog ID

        Target identification number, as listed in the TESS Input Catalog (TIC).
  
 
 


ctoi_alias
Community TESS Object of Interest Alias

    A number used to identify and track a Community-identified TESS Object of Interest (CTOI). A CTOI name has an integer and a decimal part, whe

In [None]:
# Renaming the columns
df_TOI.rename(columns={
    'toi': 'TESS Object of Interest',
    'tid':'TESS Object of Interest Prefix',
    'tfopwg_disp', 'rastr', 'ra', 'decstr', 'dec', 'st_pmra',
       'st_pmraerr1', 'st_pmraerr2', 'st_pmralim', 'st_pmdec', 'st_pmdecerr1',
       'st_pmdecerr2', 'st_pmdeclim', 'pl_tranmid', 'pl_tranmiderr1',
       'pl_tranmiderr2', 'pl_tranmidlim', 'pl_orbper', 'pl_orbpererr1',
       'pl_orbpererr2', 'pl_orbperlim', 'pl_trandurh', 'pl_trandurherr1',
       'pl_trandurherr2', 'pl_trandurhlim', 'pl_trandep', 'pl_trandeperr1',
       'pl_trandeperr2', 'pl_trandeplim', 'pl_rade', 'pl_radeerr1',
       'pl_radeerr2', 'pl_radelim', 'pl_insol', 'pl_insolerr1', 'pl_insolerr2',
       'pl_insollim', 'pl_eqt', 'pl_eqterr1', 'pl_eqterr2', 'pl_eqtlim',
       'st_tmag', 'st_tmagerr1', 'st_tmagerr2', 'st_tmaglim', 'st_dist',
       'st_disterr1', 'st_disterr2', 'st_distlim', 'st_teff', 'st_tefferr1',
       'st_tefferr2', 'st_tefflim', 'st_logg', 'st_loggerr1', 'st_loggerr2',
       'st_logglim', 'st_rad', 'st_raderr1', 'st_raderr2', 'st_radlim',
       'toi_created', 'rowupdate'
})