<a href="https://colab.research.google.com/github/rohansiddam/Python-Journey/blob/main/020%20-%20Lesson%2020%20(Hunting%20Exoplanets%20In%20Space).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 20: Hunting Exoplanets In Space


### Teacher-Student Activities

In the previous class, we learnt Fourier transformation, why to apply it on a dataset and how to apply it.

Also, we transformed both the `exo_train_df` and `exo_test_df` DataFrames by creating the `fast_fourier_transform()` function and then by applying it vertically on the DataFrame using the `apply()` function.

In this class, we will learn how to synthesize (or manufacture) the artificial data points in a dataset by applying an oversampling technique. Generally, in classification problems such as this one, the data is highly imbalanced.

**Imbalanced Dataset:**

In a highly imbalanced data, the number of data points for one class is very high compared to another class. The class having the most number of data points is called the **majority class** whereas the class having the least number of data points is called the **minority class**.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_2.png"/>


In the case of the exoplanets dataset, class `1` is a majority class because the dataset contains the maximum number of stars not having a planet.


The `exo_train_df` dataset has a total of `5087` stars in which only `37` stars have a planet and the remaining `5050` stars don't have a planet. The percentage of stars having a planet is
$\frac{37 \times 100}{5087} = 0.727$
% which is very low. Hence, the training dataset is highly imbalanced.


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_3.png"/>


The test dataset is also highly imbalanced because out of `570` data points, it contains only `5` stars labelled as class `2`.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_4.png"/>



So, the percentage of class `2` data points is
$\frac{5 \times 100}{570} = 0.877$
% which is also very low. Thus, the test dataset is also highly imbalanced.

Oversampling:
---

- The major problem with imbalanced data is that a prediction model will always be biased in favour of the majority class in making predictions. Recall that when we deployed the Random Forest Classifier model, it labelled every star in the test dataset as `1` even though the test dataset contains `5` stars belonging to class `2`.

- An oversampling technique synthesizes the artificial data points for the minority class data to balance a highly imbalanced dataset.
- An oversampling technique is required to remove the bias in favour of the majority class in a dataset.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_5_new.gif"/>

Hence, using an oversampling technique, we can artificially synthesize the minority class data in a training dataset so that both the classes have equal representation in the dataset.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_6.png"/>

**Note:** The oversampling technique is applied only to the training dataset. It is never applied to the test dataset.

Let's run all the codes in the code cells that we have already covered in the previous classes and begin this class from the **Activity 1: Oversampling For Classification Problems - SMOTE** section. You too run the code cells until the first activity.

---

#### Loading The Datasets
Create a Pandas DataFrame every time you start the Jupyter notebook.

Dataset links (don't click on them):

1. Train dataset

  https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset

  https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [None]:
# Load both the training and test datasets.
import numpy as np
import pandas as pd

exo_train_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv')
exo_test_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')

In [None]:
# The shapes of the 'exo_train_df' and 'exo_test_df' DataFrames.
print(exo_train_df.shape)
exo_test_df.shape

(5087, 3198)


(570, 3198)

In the previous classes, we have already checked the datasets don't have a missing value. So, we can skip that part.

---

#### Data Normalisation

After creating a DataFrame and inspecting data for the missing values, we can normalise the data.

$$x_{norm} = \frac{x_p - x_{mean}}{x_{max} - x_{min}}$$

In [None]:
# Function for mean normalisation.
def mean_normalise(series):
  norm_series = (series - series.mean()) / (series.max() - series.min())
  return norm_series

In [None]:
# Applying the 'mean_normalise()' function horizontally on the training DataFrame.
norm_train_df = exo_train_df.iloc[:, 1:].apply(mean_normalise, axis=1)
norm_train_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,-0.109164,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,-0.105708,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,0.221536,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,0.513506,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,-0.350022,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


In [None]:
# Inserting the 'LABEL' column to the 'norm_train_df' DataFrame.
norm_train_df.insert(loc=0, column='LABEL', value=exo_train_df['LABEL'])
norm_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,2,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,2,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,2,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


In [None]:
# Applying the 'mean_normalise()' function on the testing DataFrame.
norm_test_df = exo_test_df.iloc[:, 1:].apply(mean_normalise, axis=1)
norm_test_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,-0.052079,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,0.374634,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,0.11926,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,-0.188186,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,0.016842,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


In [None]:
# Inserting the 'LABEL' column to the 'norm_test_df' DataFrame.
norm_test_df.insert(loc=0, column='LABEL', value=exo_test_df['LABEL'])
norm_test_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,2,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,2,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,2,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


---

#### Transpose Of A DataFrame


In [None]:
# Transpose the 'exo_train_df' using the 'T' keyword.
exo_train_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5077,5078,5079,5080,5081,5082,5083,5084,5085,5086
LABEL,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,...,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
FLUX.1,93.85,-38.88,532.64,326.52,-1107.21,211.10,9.34,238.77,-103.54,-265.91,...,125.57,7.45,475.61,-46.63,299.41,-91.91,989.75,273.39,3.82,323.28
FLUX.2,83.81,-33.83,535.92,347.39,-1112.59,163.57,49.96,262.16,-118.97,-318.59,...,78.69,10.02,395.50,-55.39,302.77,-92.97,891.01,278.00,2.09,306.36
FLUX.3,20.10,-58.54,513.73,302.35,-1118.95,179.16,33.30,277.80,-108.93,-335.66,...,98.29,6.87,423.61,-64.88,278.68,-78.76,908.53,261.73,-3.29,293.16
FLUX.4,-26.98,-40.09,496.92,298.13,-1095.10,187.82,9.63,190.16,-72.25,-450.47,...,91.16,-2.82,376.36,-88.75,263.48,-97.33,851.83,236.99,-2.88,287.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
FLUX.3193,92.54,0.76,5.06,-12.67,-438.54,19.27,-0.44,95.30,4.53,3162.53,...,210.09,8.75,163.02,28.82,-74.95,151.75,-136.16,-3.47,-1.50,-25.33
FLUX.3194,39.32,-11.70,-11.80,-8.77,-399.71,-43.90,10.90,48.86,21.95,3398.28,...,3.80,-10.69,86.29,-20.12,-46.29,-24.45,38.03,65.73,-4.65,-41.31
FLUX.3195,61.42,6.46,-28.91,-17.31,-384.65,-41.63,-11.77,-10.62,26.94,3648.34,...,16.33,-9.54,13.06,-14.41,-3.08,-17.00,100.28,88.42,-14.55,-16.72
FLUX.3196,5.08,16.00,-70.02,-17.35,-411.79,-52.90,-9.25,-112.02,34.08,3671.97,...,27.35,-2.48,161.22,-43.35,-28.43,3.23,-45.64,79.07,-6.41,-14.09


---

#### Fast Fourier Transformation

Applying the Fourier Transformation on the datasets.

In [None]:
# Create a function and name it 'fast_fourier_transformation()' to apply Fast Fourier Transformation on the DataFrames.
import numpy as np

def fast_fourier_transform(star):
  fft_star = np.fft.fft(star, n=len(star))
  return np.abs(fft_star)

In [None]:
# Get a frequency array/series for both the training and test datasets.
freq = np.fft.fftfreq(len(exo_train_df.iloc[0, 1:]))
freq

array([ 0.        ,  0.00031279,  0.00062559, ..., -0.00093838,
       -0.00062559, -0.00031279])

This time we will apply the `fast_fourier_transform()` function vertically. So, before applying the function, we will transpose the original DataFrame. Then we will apply the `fast_fourier_transform()` function vertically. Then we will again take the transpose of the DataFrame.

**Note:** We don't want to transform the `LABEL` values. We want to transform the `FLUX` values only.

In [None]:
# Apply the 'fast_fourier_transform()' function on the transposed 'norm_train_df' DataFrame.
x_fft_train_T = norm_train_df.iloc[:, 1:].T.apply(fast_fourier_transform)
x_fft_train = x_fft_train_T.T
x_fft_train.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,6.004706e-15,2.273248,35.722883,33.978236,128.816503,18.835061,16.101795,10.722037,18.488302,16.400976,...,1.920231,16.400976,18.488302,10.722037,16.101795,18.835061,128.816503,33.978236,35.722883,2.273248
1,4.0943e-15,30.299298,36.918808,38.376852,22.149931,33.282191,27.734204,11.862346,14.001221,14.221386,...,11.833992,14.221386,14.001221,11.862346,27.734204,33.282191,22.149931,38.376852,36.918808,30.299298
2,3.742374e-15,66.80987,19.498262,170.26881,48.413391,88.178733,57.407061,38.684283,10.503268,46.482585,...,22.681374,46.482585,10.503268,38.684283,57.407061,88.178733,48.413391,170.26881,19.498262,66.80987
3,8.024386e-15,19.36972,52.151962,108.097894,100.659024,269.416639,77.435861,71.256558,54.895479,33.335462,...,26.711804,33.335462,54.895479,71.256558,77.435861,269.416639,100.659024,108.097894,52.151962,19.36972
4,4.881195e-15,113.576655,51.382781,146.597215,148.627668,103.842855,116.738548,28.957862,36.451207,69.375686,...,4.923027,69.375686,36.451207,28.957862,116.738548,103.842855,148.627668,146.597215,51.382781,113.576655


In [None]:
# Applying the 'fast_fourier_transform()' function on the transposed 'norm_test_df' DataFrame.
x_fft_test_T = norm_test_df.iloc[:, 1:].T.apply(fast_fourier_transform)
x_fft_test = x_fft_test_T.T
x_fft_test.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,8.950904e-16,23.184733,40.545472,17.994173,13.772498,16.594294,17.532422,8.094149,8.045917,9.549431,...,7.981027,9.549431,8.045917,8.094149,17.532422,16.594294,13.772498,17.994173,40.545472,23.184733
1,8.965352e-15,135.637702,42.680618,28.01766,17.672923,16.09955,62.999485,27.005547,26.735149,29.708628,...,9.883983,29.708628,26.735149,27.005547,62.999485,16.09955,17.672923,28.01766,42.680618,135.637702
2,1.369225e-14,101.62462,26.553454,11.634754,11.720122,46.153088,23.677302,22.208643,14.393021,3.278532,...,7.742408,3.278532,14.393021,22.208643,23.677302,46.153088,11.720122,11.634754,26.553454,101.62462
3,2.373386e-15,37.305651,20.537365,5.108229,16.309293,20.286675,18.969927,6.010526,8.76304,6.370417,...,16.668561,6.370417,8.76304,6.010526,18.969927,20.286675,16.309293,5.108229,20.537365,37.305651
4,6.684428e-15,7.138386,11.941614,12.808132,27.841397,39.681676,17.985758,30.233859,14.800046,8.50153,...,5.167375,8.50153,14.800046,30.233859,17.985758,39.681676,27.841397,12.808132,11.941614,7.138386


Our prediction model should be able to recognise the different frequency patterns for different planets and hopefully to classify the stars correctly.

---

#### Activity 1: Oversampling For Classification Problems - SMOTE^

There are 3 different methods to synthesize the artificial data points for a classification problem. They are:

1. Random oversampling

2. SMOTE

3. ADASYN

We will apply the SMOTE method to synthesize the artificial data points in the training dataset. The SMOTE method is the easiest one to understand. The term SMOTE stands for Synthetic Minority Over-Sampling Technique. How the SMOTE technique works, is beyond the scope of this course. But we will learn how to apply it to synthesize the artificial data points for a minority class.



<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C20/APT_C20_slide_7.png"/>

Before applying the SMOTE method, let's retrieve the `LABEL` data from the training and test DataFrames.

In [None]:
# Student Action: Get the 'y_train' and 'y_test' series from the 'norm_train_df' and 'norm_test_df' DataFrames respectively.
y_train = norm_train_df['LABEL']
y_test = norm_test_df['LABEL']

To apply the `SMOTE` method, we have to follow these steps:

1. From the `imblearn.over_sampling` library import the `SMOTE` module.

2. Then, call the `SMOTE()` function with `sampling_strategy=1` as an input. The `sampling_strategy=1` denotes that after resampling the dataset, the data points for both the majority and minority class should be in equal numbers. In this case, class `1` has `5050` data points, so class `2` should also have `5050` data points.

3. Apply the `fit_resample()` function from the `SMOTE` module to synthesize data for the minority class.

In [None]:
# Teacher Action: Apply the 'SMOTE()' function to balance the training data.
from imblearn.over_sampling import SMOTE
# Import the 'SMOTE' module from the 'imblearn.over_sampling' library.

# Call the 'SMOTE()' function with 'ratio=1' as input and store it in the 'sm' variable.
sm = SMOTE(sampling_strategy=1)
# Call the 'fit_sample()' function with 'x_fft_train' and 'y_train' datasets as inputs.
x_fft_train_res, y_fft_train_res = sm.fit_resample(x_fft_train,y_train)

In the code above,

1. We are storing the `SMOTE(sampling_strategy=1)` function in the `smote` variable.

2. Then, we are generating the artificial values for both the feature and target values using the `fit_resample()` function and then storing them in the `x_fft_train_res` and `y_fft_train_res` variables, respectively.

Let's check the type and shapes of the resampled datasets.

In [None]:
# Student Action: Check the type and shapes of the 'x_fft_train_res' and 'y_fft_train_res' datasets.
x_fft_train_res.shape

(10100, 3197)

In [None]:
y_fft_train_res.shape

(10100,)

In [None]:
x_fft_train.shape

(5087, 3197)

We now have `10100` data points for the training dataset containing `5050` class `1` values and `5050` class `2` values.

Let's verify it by using the `sum()` function.

---

#### Activity 2: The `value_counts()` Function^^

The `value_counts()` function return a Series which consists counts of unique values appearing in the respective DataFrame column. The resulting series is arranged in descending order. To apply the `value_counts()` on a DataFrame, use the following syntax:

**Syntax:** `Series.value_counts()`

In [None]:
# Student Action: Find the number of occurrences of class '1' and class '2' values in 'y_fft_train_res'.
y_fft_train_res.value_counts()

2    5050
1    5050
Name: LABEL, dtype: int64

As you can see, both the classes, i.e., `1` and `2` appear the equal number of times in the `y_fft_train_res` DataFrame.

Now, let's deploy the Random Forest Classifier prediction model again to see if the prediction model is able to identify the stars having a planet in the test dataset.

---

#### Activity 3: Importing The Required Libraries

Now, import the `RandomForestClassifier` module from the `sklearn.ensemble` library. Also, import the `confusion_matrix` and `classification_report` modules from the `sklearn.metrics` library.

In [None]:
# Student Action: Import the required modules from the 'sklearn.ensemble' and 'sklearn.metrics' libraries.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report

---

#### Activity 4: Applying The RandomForestClassifier Model

Now that we have processed the data to enable our prediction model little more robust, let's once again deploy the Random Forest Classifier model to see if it is able to detect the stars having a planet.

In [None]:
# Student Action: Deploy the random Forest Classifier prediction model.
rf = RandomForestClassifier(n_jobs = -1, n_estimators = 50)
rf.fit(x_fft_train_res,y_fft_train_res)

RandomForestClassifier(n_estimators=50, n_jobs=-1)

In [None]:
rf.score(x_fft_train_res,y_fft_train_res)

1.0

In [None]:
y_pred = rf.predict(x_fft_test)

In [None]:
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Let's quickly make the confusion matrix and classification report to test the efficacy of the model.

---

#### Activity 5: The Confusion Matrix & Classification Report

Now create the confusion matrix and classification report for the model deployed to see if the model is able to detect the stars having a planet.

In [None]:
# Student Action: Create the confusion matrix using the 'y_test' and 'y_pred' values as inputs.
confusion_matrix(y_test,y_pred)

array([[565,   0],
       [  5,   0]])

As you can see, the value in the second row and the second column is `0` which means the Random Forest Classifier model has failed to detect class `2` values. Thus, it failed to detect the stars having a planet.

Hence, this will lead to undefined precision, recall and f1-score values. Let's verify it by printing the classification report.

In [None]:
# Student Action: Print the classification report using the 'y_test' and 'y_pred' values as inputs.
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.99      1.00      1.00       565
           2       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


So, even after processing the data with normalisation, Fast Fourier Transformation and oversampling, the Random Forest Classifier prediction model has failed to detect the stars having a planet. One of the possible reasons for the failure of the Random Forest Classifier model could be its inability to form a right decision tree (recall that random forest is a collection of decision trees). This suggests that maybe we have to further process the data or we might have to apply a different prediction model.

Let's deploy the **XGBoost Classifier** model to see if it can detect the stars having a planet. If it successfully detects the class `2` values, then it means the XGBoost Classifier model is a more appropriate model here to make prediction compared to the Random Forest Classifier model. If not, then we will have to further process the data and then deploy the classification models again.

---

#### Activity 6: The XGBoost Classifier Model^^^

**How to deploy the XGBoost Classifier model?**

1. Import the `xgboost` library with `xg` as an alias.
2. Use the `XGBClassifier()` function of the `xgboost` library to initiate the model.
3. Call the `fit()` function with `x_fft_train_res` and `y_fft_train_res` to deploy the model.
4.  Call the `predict()` function on `x_fft_test` data to get the predicted values.

You can read about the XGBoost Python package by clicking on the link provided in the **Activities** section under the title **XGBoost Python Package**.



**CAUTION:** The XGBoost Classifier is a computationally heavy model. It requires a very high RAM, CPU and GPU to run. It will take some time to learn the feature variables through the training data and then make predictions on the test data. Hence, use it ONLY if all the other lightweight (requiring less RAM, CPU and GPU) prediction models fail.

In [None]:
# Teacher Action: Deploy the XGBoost Classifier model to detect the stars having a planet.
import xgboost as xg
# Call the 'XGBClassifier()' function and store it in the 'model' variable.
model = xg.XGBClassifier()
# Call the 'fit()' function with the 'x_fft_train_res' and 'y_fft_train_res' NumPy arrays as input.
model.fit(x_fft_train_res,y_fft_train_res)
# Make predictions on test data by calling the 'predict()' function with 'x_fft_test' data as input.
y_pred_2 = model.predict(x_fft_test)
# Predict the values of predicted values.
y_pred_2

array([2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

Now that we have got the predicted values, let's create a confusion matrix to check if the XGBoost Classifier model has detected any class `2` values in the test dataset.

In [None]:
# Student Action: Create the confusion matrix using the 'y_test' and 'y2_pred' values as inputs.
confusion_matrix(y_test,y_pred_2)

array([[565,   0],
       [  2,   3]])

As you can see, the value in the second row and the second column is greater than `0`. Hence, the XGBoost Classifier prediction model has successfully detected few stars belonging to class `2`. Finally, we can take a sigh of relief. However, it has also classified few stars as `1` which should also have been classified as `2`. Nonetheless, this is a great achievement because out of `570` stars in the test dataset, only `5` of them have a planet. And detecting them is like finding a needle in a haystack. So, we should be happy about finding at least 3.


Now, let's compute the precision, recall and f1-scores to test the efficacy of the XGBoost Classifier model. If the f1-score value is greater than 0.5, then we have a good classification model.

In [None]:
# Student Action: Print the classification report using the 'y_test' and 'y2_pred' values as inputs.
classification_report(y_test,y_pred_2)

'              precision    recall  f1-score   support\n\n           1       1.00      1.00      1.00       565\n           2       1.00      0.60      0.75         5\n\n    accuracy                           1.00       570\n   macro avg       1.00      0.80      0.87       570\nweighted avg       1.00      1.00      1.00       570\n'

As you can see, the precision, recall and f1-scores for the class `2` values are quite high. **The closer they are to ONE, the better is the classification model.**

This is not the best classification model, but it is a fairly good one. So, we don't have to further process the data. The three data processing activities, i.e., mean normalisation, Fourier Transformation and Oversampling are good enough for this problem statement wherein we hunt the exoplanets in space.



---

### **Project**

You can now attempt the **Applied Tech. Capstone Project 5 - Predicting A Pulsar Star** on your own.


**Applied Tech. Capstone Project 5 - Predicting A Pulsar Star:** https://colab.research.google.com/drive/11R6mh2c4cIscoD2NIcU69fiO6aWdgJJC

---