### Goals in this Notebook:
Make new files where the data is: 
> Clean, without null values. <br>
> Labeled as detectable planet around star (1) or not (0). <br>
> Set to the same time frame. <br>

### Imports:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import time
import numpy as np
import random

%matplotlib inline

### Read in the Files:

In [None]:
# These are the confirmed planet stars from the first download session
c_planets = pd.read_csv('../clean_planet_data/extracted_planets_1_again.csv')

# These are the confirmed planet stars from the second download session
c_planets_2 = pd.read_csv('../clean_planet_data/extracted_confirmed_planets_2_again.csv')

c4_kep = pd.read_csv('../clean_planet_data/extracted_kep_c4_7700_backup.csv')

In [None]:
# drop the last line of c4_kep because it only downloaded halfway before being stopped
c4_kep.drop(index=7713, inplace = True)

# Start Munging:

### Check for Duplicate Entries of Each Star:
If the value is greater than 1, there are duplicates.

In [None]:
c_planets['star_name'].value_counts().max()

In [None]:
c_planets_2['star_name'].value_counts().max()

In [None]:
c4_kep['star_name'].value_counts().max()

In [None]:
# combine planet sets before randomly selecting?
#     cut the bigger one down to the size of the smaller one to keep the number of columns the same

# doing this would show duplicates across the two files rather than looking at them individually
#     the could explain why I saw a decrease in the number of stars with planets after introducing this

### Randomly Select Lightcurves from the Duplicates in Confirmed Planets Set:

In [None]:
randomized_planets = pd.DataFrame(columns = c_planets.columns)

i = 0

# Selecting the row with the fewest NANs and add them to the randomized_planets df
for df in [c_planets, c_planets_2]:
    for star in df['star_name'].unique():
        
        # Print out some feedback to show progress
        i += 1
        if i % 100 == 0:
            print(i)

        checking = df[df['star_name'] == star] # select all rows whose stars have the same name
        rand_select = random.randint(checking.index.min(), checking.index.max() + 1) # randomly select one of the index numbers
        rand = checking[checking.index == rand_select] # pick out that row
        randomized_planets = pd.concat([randomized_planets, rand]) # add it to the new df

# Reset the index
randomized_planets.reset_index(drop = True, inplace = True)    

In [None]:
# make sure this is exhaustive
#     count uniques before and after to be sure

Make sure it worked: <br>
Is there only one row per star?

In [None]:
randomized_planets['star_name'].value_counts().max()

# Dealing with Nulls:

### Calculate Isolated Missing Values:
Fill 'one-off' missing values with mean imputation of the nearest two values.

In [None]:
is_null = c4_kep.isnull()

for i in range(c4_kep.shape[0]): # for each row
    
    # Print out some feedback to show progress
    if i % 50 == 0:
        print(i)
        
    for j in range(c4_kep.shape[1]-1):
        if is_null.iloc[i, j] == True:
            if j > 2: # skip the first two columns
                if not ((is_null.iloc[i, j-1] == True) | (is_null.iloc[i, j+1] == True)):
                    c4_kep.iloc[i, j] = np.mean([c4_kep.iloc[i, j-1], c4_kep.iloc[i, j+1]])

In [None]:
# Do this for c4_kep and confirmed planets

### Closing Gaps in Data:

In [None]:
# Shifting values to fill nulls
#     the column names will no longer be relevent
is_null = c4_kep.isnull()
df_squished = pd.DataFrame()

for i in range(c4_kep.shape[0]):
    if i % 50 == 0:
        print(i)
    
    k = 0 # reset the df_squished column index to 0 for each new row
    for j in range(c4_kep.shape[1]):
        if is_null.iloc[i, j] == False: # if this cell is not null
            df_squished.loc[i, k] = c4_kep.iloc[i, j]
            k += 1
#             else: # if cell is null
#                 count how long the missing string is and save that info

In [None]:
# Do this for c4_kep and confirmed planets

# Assign Labels

### Assign Labels to Stars with Planets:

In [None]:
# Bring in a table that lists all confirmed planets with their star names and other info
all_confirmed = pd.read_csv('../clean_planet_data/all_planets_list.csv')

In [None]:
# Assign labels to c4_kep
not_found = 0
stars_to_drop = []

for j in range(len(c4_kep)): # for every light curve
    if j % 250 == 0:
        print(j)

    count = 0
    for i in range(len(all_confirmed)): # look through each star name in the list of all confirmed planets
        try:
            if all_confirmed.loc[i, 'Alternative star names'].find(c4_kep.iloc[j, 0]) != -1:
                count += 1
                print(c4_kep.iloc[j, 0], ' found @ index: ', j, 'orbital period: ', all_confirmed.loc[i, 'Orbital period [days]'])
                c4_kep.loc[j, '1'] = 1
    
        except AttributeError: # if the alternate star names value are null
            try:
                if all_confirmed.loc[i, 'Star name'].find(c4_kep.iloc[j, 0]) != -1:
                    count += 1
                    print(c4_kep.iloc[j, 0], ' found @ index: ', j, 'on 2nd level of loop', 'orbital period: ', all_confirmed.loc[i, 'Orbital period [days]'])
                    c4_kep.loc[j, '1'] = 1
                    
            except AttributeError: # if this is null too, keep going. There are few of these cases in the set
                continue
                
    if count == 0:
        not_found += 1

In [None]:
# change label on confirmed stars with no planets under the timeframe we're looking at
#     add a 'detectable' column?

# drop stars from confirmed planets that cannot be found? how many are there?, can I get this data somewhere else?

### Make Detectable Planets Label for Confirmed Planets Set:
Label should only be positive if the planet has a detectable orbitable period.

In [None]:
# clean all_confirmed planets obrital period column to be usable (numeric and no weird symbols)

# for each confirmed planet star
#     search for it in all_confirmed
#     if there is no planet with that star name with an orbit less than the detectable period
#         drop it

#     if it can't be found in all_confirmed
#         drop it and tally how many of these there are

### Set the Light Curves to the Same Time Frame:
That way there are no nulls and we can compare all the light curves from all datasets.

In [None]:
# max row length should be the number of nonmissing values in the shortest clean light curve
# make sure the feature names are consistent and usable

### Save to a New File:

In [None]:
# df_squished.to_csv('../clean_planet_data/clean_labeled_kep_c4.csv', index=False)

In [None]:
# least_null_planets.to_csv('../clean_planet_data/clean_labeled_planets.csv', index=False)

### Done!