# Data Preprocessing Assignment

#### NASA has reached out to you and asked for some assistance in trying to determine what features of an [exoplanet](https://en.wikipedia.org/wiki/Exoplanet#:~:text=An%20exoplanet%20or%20extrasolar%20planet,of%20detection%20occurred%20in%201992) have the largest impact on predicting the [stellar magnitude](https://earthsky.org/astronomy-essentials/what-is-stellar-magnitude) of the celestial body. But, before you can get down to the machine learning you'll first need to make the data usable! 

First you'll need to load the necessary modules and libraries. The ones loaded here should be everything you'll need.

In [None]:
!pip install category_encoders

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from category_encoders.leave_one_out import LeaveOneOutEncoder

Then you'll need to load the dataset into a Pandas dataframe

In [2]:
df = pd.read_csv('nasa_exoplanets.csv')

Let's see what the data looks like

In [3]:
df.head()

Unnamed: 0,name,light_years_from_earth,planet_mass,stellar_magnitude,discovery_date,planet_type,planet_radius,orbital_radius,orbital_period,eccentricity,solar_system_name,planet_discovery_method,planet_orbital_inclination,planet_density,right_ascension,declination,host_temperature,host_mass,host_radius
0,11 Comae Berenices b,305.0,19.4 Jupiters,4.74,2007,Gas Giant,1.08 x Jupiter,1.29 AU,326 days,0.23,11 Com,Radial Velocity,,,12h20m43.03s,+17d47m34.3s,4742.0,2.7,19.0
1,11 Ursae Minoris b,410.0,14.74 Jupiters,5.016,2009,Gas Giant,1.09 x Jupiter,1.53 AU,1.4 years,0.08,11 UMi,Radial Velocity,,,15h17m05.89s,+71d49m26.0s,4213.0,2.78,29.79
2,14 Andromedae b,247.0,4.8 Jupiters,5.227,2008,Gas Giant,1.15 x Jupiter,0.83 AU,185.8 days,0.0,14 And,Radial Velocity,,,23h31m17.42s,+39d14m10.3s,4813.0,2.2,11.0
3,14 Herculis b,59.0,4.66 Jupiters,6.61,2002,Gas Giant,1.15 x Jupiter,2.93 AU,4.9 years,0.37,14 Her,Radial Velocity,,,16h10m24.31s,+43d49m03.5s,5338.0,0.9,0.93
4,16 Cygni B b,69.0,1.78 Jupiters,6.25,1996,Gas Giant,1.2 x Jupiter,1.66 AU,2.2 years,0.68,16 Cyg B,Radial Velocity,,,19h41m51.97s,+50d31m03.1s,5750.0,1.08,1.13


One of the NASA scientists on the project thinks that the host star radius might have something to do with how large of planets could form around it. But looking at the graph below you can see there might be an issue with using it in an ML model. 

In [None]:
fig, ax = plt.subplots(1)
sns.distplot(df['host_radius'])

What sort of preprocessing could you use to fix the data issue?

In [None]:
df['host_radius_fixed'] = df['host_radius'].fillna(value = df['host_radius'].mean())

In [None]:
fig, ax = plt.subplots(1)
sns.distplot(df['host_radius_fixed'])

Now that you've fixed the first issue with the data, what do you notice? Could there potentially be a problem with the range of the data? And if there is how do you process it to be more useful?

In [None]:
#Instantiate some object here that would be useful
scaler = MinMaxScaler()

In [None]:
df['host_radius_fixed']  = scaler.fit_transform(df['host_radius_fixed'].values.reshape(-1, 1))

In [None]:
fig, ax = plt.subplots()
sns.distplot(df['host_radius_fixed'])

Another scientist has posited that the TYPE of the planet could have a large impact on its stellar magnitude. But the data has been saved as a string variabled and ML models can only take numeric data. What should you do? Hint: planet_type is considered to have *LOW CARDINALITY* for this exercise. 

In [None]:
df['planet_type'].value_counts()

In [None]:
df = pd.get_dummies(df, columns=['planet_type'])

In [None]:
df.head()

Let's see what the data looks like now that you've transformed the planet_type to something a model can ingest.

Another scientist believes that HOW a planet was discovered could give some clues into the stellar magnitude. Perhaps only really bright planets can be identified using a specific method? But again the variable was saved as in a string format. What method could you use to make it numeric? Hint: For the purposes of this exercise, planet_discovery_method is being treated as *HIGH CARDINALITY*.

In [None]:
df.columns

In [None]:
#Instatiate some object here that would be useful
loo = LeaveOneOutEncoder()

In [None]:
df['planet_discovery_method_transformed'] = loo.fit_transform(df['planet_discovery_method'], df['stellar_magnitude'])

In [None]:
df.head()

Let's check to make sure it worked

In [None]:
df[['planet_discovery_method', 'planet_discovery_method_transformed']].head(20)

#### With that we're ready to start our machine learning process! Now we just need to wait for the grant money to come in...