Data comes from [The Extrasolar Planet Encyclopedia](http://exoplanet.eu/). Thanks to Ilya Marchenko for sharing this dataset on [Kaggle](https://www.kaggle.com/ilyamarchenko/full-exoplanet-catalog?select=exoplanet_confirm_and_candidates.csv).

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

In [None]:
exo_full_dataset = pd.read_csv('/content/drive/My Drive/exoplanets.csv')
exo_full_dataset.head()

In [None]:
exo = exo_full_dataset.loc[:, ['radius', 'mass', 'planet_status', 'orbital_period', 'star_distance']] 

In [None]:
print("\nUnique values\n",exo.nunique())
print("\nNull values\n\n", exo.isna().sum())

## Create dummy example data

For all techniques we'll first demonstrate them on the simple DataFrame created below, then on the more realistic CSV file.

In [None]:
import numpy as np

alien_species = {"alien_height":[80, 63, 70, 93, np.nan], "alien_age":[12, np.nan, 87, 415, 892], "home_planet":["Mars", "Jupiter", "Europa", "Mars", "Europa"]}

alien_df = pd.DataFrame(alien_species)
alien_df.head()

## Imputation
First the simple DataFrame.

### `alien_df` Example

In [None]:
from sklearn.impute import SimpleImputer
features = alien_df.loc[:, ["alien_height", "alien_age"]]
print(features.head(), "\n")
imp = SimpleImputer()
imp.fit(features)
imputed = imp.transform(features)

# the rest of this code block reformats the data to print it in an educative way. don't sweat it!
# scikit learn often strips the column headers (it's due to converting arrays to numpy for math), so add them back like so:
imputed_alien_df = pd.DataFrame(imputed,columns=features.columns)
print(imputed_alien_df.head())
# adding back the categorical data
imputed_alien_df["home_planet"] = alien_df["home_planet"]

Now let's perform imputation on the exoplanets dataset.

### Exoplanets Example

In [None]:
exo_numbers = exo.loc[:, ['radius', 'mass', 'orbital_period', 'star_distance']]
print(exo.head())
print("\nNull values\n\n", exo.isna().sum(), "\n")
imp = SimpleImputer()
imp.fit(exo_numbers)
imputed = imp.transform(exo_numbers)
imputed_exo_df = pd.DataFrame(imputed,columns=exo_numbers.columns)

# reformatting the imputed data below
imputed_exo_df = pd.DataFrame(imputed,columns=exo_numbers.columns)
# adding back in the categorical data
imputed_exo_df["planet_status"] = exo["planet_status"]
print(imputed_exo_df.head())
print("\nNull values\n\n", imputed_exo_df.isna().sum())

## One-Hot Encoding
First the `alien_df` data.

### `alien_df` Example

In [None]:
enc_alien_df = pd.get_dummies(imputed_alien_df)

print(imputed_alien_df.head(), "\n")
print(enc_alien_df.head())

Now the exoplanet dataset.

### Exoplanets Example

In [None]:
print(imputed_exo_df.loc[:, "planet_status"].unique())
imputed_exo_df.head()

In [None]:
enc_exo_df = pd.get_dummies(imputed_exo_df)
print(imputed_exo_df.head(), "\n")
print(enc_exo_df.head())

## Further Practice
If you want to sharpen your skills, try feature scaling the exoplanet data below. Then you'd have a wholly preprocessed dataset!

(Since the one-hot encoded data occurs within the normal range of standard deviation, you don't need to worry about scaling it.)

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# remember how this goes?