<a href="https://colab.research.google.com/github/rohansiddam/Python-Journey/blob/main/018%20-%20Lesson%2018%20(Hunting%20Exoplanets%20In%20Space%20-%20Data%20Normalization).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 18: Hunting Exoplanets In Space - Data Normalisation

### Teacher-Student Activities

In the last class, we evaluated the Random Forest Classifier model through a concept called the **confusion matrix**. We also calculated the precision, recall and f1-score values. They were undefined. Based on these three parameters, we concluded that our model needs a lot of improvement because it did not classify the stars having a planet as `2` rather it labelled every star as `1`.

In this lesson, we will process the data before deploying a prediction model so that it can learn the properties of the different stars through the training dataset.

Now, there is no right approach to the data processing method. It is an iterative process. It comes through experience and domain knowledge. (*The term 'domain' means a particular field of industry or academics*). For e.g., if you are a banker, then you would have the knowledge of the finance field. Similarly, if you are an astrophysicist, then you would have the technical knowledge of astronomy, quantum mechanics, optics etc. So, based on the knowledge of a respective field, data should be processed before deploying a prediction model.

In this class, we will perform the following data processing exercises:

1. Data Normalisation

2. Fast Fourier Transformation

3. Oversampling

Finally, we will check if the Random Forest Classifier model is still providing the expected results. If not, then we will deploy a much stronger prediction model called XGBoost Classifier. Generally, it doesn't require a lot of data pre-processing but it requires heavy computation resources such as high RAM, CPU, GPU and at least 4 cores processor. Thanks to Google Colab notebook, we have a decent computation power to run the XGBoost Classifier algorithm.

Hence, the XGBoost Classifier model must be used only when the data processing is not helping a prediction model in making accurate predictions.

Let's run all the codes in the code cells that we have already covered in the previous classes and begin this class from the **Activity 1: Data Normalisation** section. You too run the code cells until the first activity.


---

#### Loading The Datasets

Create a Pandas DataFrame every time you start the Jupyter notebook.

Dataset links (don't click on them):

1. Train dataset

   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset

   https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [None]:
# Load both the training and test datasets.
import numpy as np
import pandas as pd


exo_train_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv')
exo_test_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')

In [None]:
# The shapes of the 'exo_train_df' and 'exo_test_df' DataFrames.
print(exo_train_df.shape)
exo_test_df.shape

(5087, 3198)


(570, 3198)

In the previous classes, we have already checked the datasets don't have a missing value. So, we can skip that part.

---

#### Activity 1: Data Normalisation^^^

After creating a DataFrame and inspecting data for the missing values, we can normalise the data.

**What is data normalisation?**

Data normalisation is a process of standardising data. It brings every single data-point on a uniform scale. Let us try to understand this with the help of an example:



If you look at both the `exo_train_df` and `exo_test_df` DataFrames, they contain highly varying `FLUX` values.

You can get the data summary using the `describe()` function to look at the variation of the data.

**Syntax:** `dataframe_name.describe()`

In [None]:
# Teacher Action: Get the data description by calling the 'describe()' function.
exo_train_df.describe()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
count,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,...,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0,5087.0
mean,1.007273,144.5054,128.5778,147.1348,156.1512,156.1477,146.9646,116.838,114.4983,122.8639,...,348.5578,495.6476,671.1211,746.879,693.7372,655.3031,-494.784966,-544.594264,-440.2391,-300.536399
std,0.084982,21506.69,21797.17,21913.09,22233.66,23084.48,24105.67,24141.09,22906.91,21026.81,...,28647.86,35518.76,43499.63,49813.75,50871.03,53399.79,17844.46952,17722.339334,16273.406292,14459.795577
min,1.0,-227856.3,-315440.8,-284001.8,-234006.9,-423195.6,-597552.1,-672404.6,-579013.6,-397388.2,...,-324048.0,-304554.0,-293314.0,-283842.0,-328821.4,-502889.4,-775322.0,-732006.0,-700992.0,-643170.0
25%,1.0,-42.34,-39.52,-38.505,-35.05,-31.955,-33.38,-28.13,-27.84,-26.835,...,-17.6,-19.485,-17.57,-20.76,-22.26,-24.405,-26.76,-24.065,-21.135,-19.82
50%,1.0,-0.71,-0.89,-0.74,-0.4,-0.61,-1.03,-0.87,-0.66,-0.56,...,2.6,2.68,3.05,3.59,3.23,3.5,-0.68,0.36,0.9,1.43
75%,1.0,48.255,44.285,42.325,39.765,39.75,35.14,34.06,31.7,30.455,...,22.11,22.35,26.395,29.09,27.8,30.855,18.175,18.77,19.465,20.28
max,2.0,1439240.0,1453319.0,1468429.0,1495750.0,1510937.0,1508152.0,1465743.0,1416827.0,1342888.0,...,1779338.0,2379227.0,2992070.0,3434973.0,3481220.0,3616292.0,288607.5,215972.0,207590.0,211302.0


As you can see, the values in the `FLUX.1` column range between `-227,856.3` (minimum `FLUX.1` value) and `1,439,240` (maximum `FLUX.1` value).

**Note:** `-2.278563e+05` is equivalent to $-2.278563\times10^5$ and `1.439240e+06` equivalent to $1.439240\times10^6$

In the `FLUX.1` column, the difference in the maximum and minimum values, i.e.,

$$1,439,240 - (-227,856.3) = 1,667,096.3$$


is huge because of the $10^5$ and $10^6$ scales.

Similarly, the data in all the other `FLUX` columns also vary a lot because they lie on a huge scale. The big figures are less readable. For e.g., `122` (one hundred twenty two) is more readable than `1,439,240` (one million, four hundred thirty nine thousand, two hundred forty).

<b><font color=red>The data normalisation process lowers the scale and brings all the data-points on the same scale.</b>

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-orginal-normalised+data-apt-c18-01.png" height=400/>

**Why must data be normalised?**

The machine learning models are quite sensitive to the scale of data. They give more importance to the larger values while learning the properties of data. Hence, it becomes crucial for us to remove this bias by bringing down all the data-points on the same scale.


**How to normalise data?**

There are various methods of data normalisation. We will cover all of them throughout this course whenever we need them. For the time being, we will use the *mean normalisation* method. Let's understand the *mean normalisation* method.

Consider a series of numbers having the values

$$x_1, x_2, x_3, ... , x_N$$

where $N$ is the total number of values in a series.

Let

- $x_{mean}$ denote the mean (or average) value of a series

- $x_{min}$ denote the minimum value in a series and

- $x_{max}$ denote the maximum value in a series

The normalised value in a series is calculated as

$$x_{norm} = \frac{x_p - x_{mean}}{x_{max} - x_{min}}$$

where

$$x_p = x_1, x_2, x_3, ..., x_N$$

So after normalisation, the new values in the series will be

$$\left(\frac{x_1 - x_{mean}}{x_{max} - x_{min}}\right), \left(\frac{x_2 - x_{mean}}{x_{max} - x_{min}}\right), \left(\frac{x_3 - x_{mean}}{x_{max} - x_{min}}\right), ..., \left(\frac{x_N - x_{mean}}{x_{max} - x_{min}}\right)$$


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-xmin-xmax-apt-c18-01.png" height=400/>








- The average value of the series is $x_{mean} = 4087.4$

- The minimum value in the series is $x_{min} = 5$

- The maximum value in the series is $x_{max} = 20019$

So, after normalisation, the new series would have the following numbers.

$$\left[ \left( \frac{5 - 4087.4}{20019 - 5} \right), \left( \frac{192 - 4087.4}{20019 - 5} \right), \left( \frac{20019 - 4087.4}{20019 - 5} \right), \left( \frac{12 - 4087.4}{20019 - 5} \right), \left( \frac{209 - 4087.4}{20019 - 5} \right) \right]$$

$$\Rightarrow \left[-0.203977, -0.194634, 0.796023, -0.203627, -0.193784 \right]$$

or

$$\left[-\frac{203,977}{1,000,000}, -\frac{194,634}{1,000,000}, \frac{796,023}{1,000,000}, -\frac{203,627}{1,000,000}, -
\frac{193,784}{1,000,000} \right]$$

As you can see, after normalisation all the new values are on the same scale which is $\frac{1}{1,000,000}$ or $10^{-6}$.

So, now let's create a function which normalises data in a series. It should take a Pandas series as an input and should return a normalised series as an output.





In [None]:
exo_train_df['FLUX.1'].mean()

144.50544525260383

In [None]:
# Student Action: Create a function to normalise a Pandas series using the mean normalisation method.
def  mean_normalize(series):
  normalize_series = ((series - series.mean()) / (series.max()-series.min()))
  return normalize_series

Now, let's test the `mean_normalise()` function on the $[5, 192, 20019, 12, 209]$ series. If we get the desired output, then it means the function is working correctly.

In [None]:
# Student Action: Test the 'mean_normalise()' function on the '[5, 192, 20019, 12, 209]' series.
list1 = [5, 192, 20019, 12, 209]
series1 = pd.Series(list1)
print(series1)

0        5
1      192
2    20019
3       12
4      209
dtype: int64


In [None]:
(5 + 192 + 20019 + 12 + 209)/ 5

4087.4

In [None]:
(5 - 4087.4) / (20019 - 5)

-0.20397721594883583

In [None]:
mean_normalize(series1)

0   -0.203977
1   -0.194634
2    0.796023
3   -0.203627
4   -0.193784
dtype: float64

Now, let's apply the `mean_normalise()` function on the `exo_train_df` DataFrame to normalise only the `FLUX` values.

Using the `iloc[]` function, we will first exclude the `LABEL` column from the DataFrame and then will apply the `mean_normalise()` function on the `exo_train_df` DataFrame using the `apply()` function.

**The `apply()` function:**
- It is used to apply a function to each row or column in the DataFrame.

**Syntax:** `dataframe.apply(function_name, axis)`

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-func+name-dataframe-apt-c18-01.png" height=400/>

**Note:** Whenever you apply a function, say `function_name()` on a DataFrame using the `apply()` function, remove the brackets from the name of the function (i.e., `function_name`) to be applied. That's why the syntax is `dataframe.apply(function_name, axis)`

A DataFrame has two axes (axes is plural of axis).

- The first axis is the vertical axis. It is represented as `axis = 0`

- The second axis is the horizontal axis. It is represented as `axis = 1`

The DataFrame axes define whether an operation needs to be applied row-wise or column-wise. Refer to the image shown below.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-16/dataframe-axes.png' width=600>

1. If `axis = 0`, then it means the function needs to be applied **vertically**. In other words, the function will be applied on all the rows but only **one column at a time.** So, on the `exo_train_df` DataFrame, if the `mean_normalise()` function is applied **vertically**, then it will be applied in the following order:

    - `exo_train_df.iloc[:, 1]`, i.e., all the rows and the `FLUX.1` column at a time.
    
    - `exo_train_df.iloc[:, 2]`, i.e., all the rows and the `FLUX.2` column at a time.

    - `exo_train_df.iloc[:, 3]`, i.e., all the rows and the `FLUX.3` column at a time.

    ...

    - `exo_train_df.iloc[:, 3197]`, i.e., all the rows and the `FLUX.3197` column at a time.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-xaxis-0-apt-c18.gif" height=400/>

2. If `axis = 1`, then it means the function needs to be applied **horizontally**. This means the function will be applied on all the columns but only **one row at a time.** So, on the `exo_train_df` DataFrame, if the `mean_normalise()` function is applied **horizontally**, then it will be applied in the following order:

    - `exo_train_df.iloc[0, :]`, i.e., the first row and all the columns at a time.
    
    - `exo_train_df.iloc[1, :]`, i.e., the second row and all the columns at a time.

    - `exo_train_df.iloc[2, :]`, i.e., the third row and all the columns at a time.

      ...

    - `exo_train_df.iloc[5086, :]`, i.e., the last row and all the columns at a time.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-xaxis-1-apt-c18.gif"  height=400/>

We will apply the `mean_normalise()` function **horizontally** on the `exo_train_df` DataFrame to normalise the `FLUX` values for a star at a time.


In [None]:
# Teacher Action: Apply the 'mean_normalise' function horizontally on the training DataFrame.
norm_train_df = exo_train_df.iloc[:,1:].apply(mean_normalize, axis = 1)
# After applying the 'mean_normalise' function on the 'exo_train_df' DataFrame, let's print the first 5 rows of the new DataFrame.
norm_train_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,-0.109164,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,-0.105708,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,0.221536,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,0.513506,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,-0.350022,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


In [None]:
exo_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


You can see that, all the data-points are on the same scale after mean normalisation. Notice that we didn't normalise the `LABEL` data as we intended.

Now, let's insert the `LABEL` column to the `norm_train_df` DataFrame to get the full DataFrame with the normalised `FLUX` values.

We can obtain the `LABEL` column from the `exo_train_df` DataFrame using the `exo_train_df['LABEL']` method.

To insert a column in a DataFrame, use the `insert()` function.

**Syntax:** `dataframe.insert(loc=column_index, column=column_name, value=some_pandas_series)`

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-index-coloumn-apt-c18-01.png" />

It takes three inputs.

- The first input should be the desired column index of the new column after its insertion.

- The second input should be the desired column name.

- The third input should be the values of the new column.





In [None]:
# Teacher Action: Apply the 'insert()' function to add the 'LABEL' column to the 'norm_train_df' DataFrame.
norm_train_df.insert(loc = 0, column = 'LABEL', value = exo_train_df['LABEL'])
# After inserting the 'LABEL' column to the 'norm_train_df' DataFrame, print its first five rows.
norm_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.053834,0.047391,0.00651,-0.023699,-0.031772,-0.08641,-0.093128,-0.068161,-0.05765,...,-0.056482,-0.071934,-0.071934,0.009738,0.024779,0.052993,0.018843,0.033024,-0.003127,-0.031759
1,2,-0.050411,-0.042317,-0.081922,-0.052351,-0.115212,-0.104794,-0.126816,-0.124861,-0.122681,...,0.006648,-0.039721,-0.039721,-0.027988,0.004116,0.013124,-0.006847,0.02226,0.03755,0.043849
2,2,0.243983,0.245509,0.235186,0.227365,0.208538,0.212981,0.212283,0.222467,0.199285,...,-0.037161,0.002382,0.002382,-0.017715,-0.013523,-0.001456,-0.009299,-0.017259,-0.036384,-0.048782
3,2,0.518501,0.551177,0.480659,0.474051,0.504754,0.496863,0.511941,0.494687,0.496425,...,0.016215,0.001435,0.001435,0.054324,0.038636,-0.012562,-0.006456,-0.019827,-0.019889,0.029163
4,2,-0.399904,-0.401872,-0.404199,-0.395473,-0.381734,-0.373293,-0.36007,-0.368986,-0.356861,...,-0.212262,-0.141752,-0.141752,-0.125499,-0.157156,-0.155246,-0.141038,-0.135528,-0.145458,-0.18159


Now, you normalise the `FLUX` values in the `exo_test_df` DataFrame using the `mean_normalise()` function. Make sure that you apply the function horizontally to normalise the `FLUX` values for a star at a time.

In [None]:
# Student Action: Apply the 'mean_normalize()' function on the testing DataFrame. Store the new DataFrame in the 'norm_test_df' variable.
norm_test_df = exo_test_df.iloc[:,1:].apply(mean_normalize, axis = 1)
# After applying, the 'mean_normalise()' function on the 'exo_test_df' DataFrame, print the first 5 rows of the new DataFrame.
norm_test_df.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,-0.052079,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,0.374634,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,0.11926,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,-0.188186,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,0.016842,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


In [None]:
exo_test_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,...,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,2,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,...,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,2,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,2,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


Now, you insert the `LABEL` column to the `norm_test_df` DataFrame with the corresponding `LABEL` values.

In [None]:
# Student Action: Apply the 'insert()' function to add the 'LABEL' column to the 'norm_test_df' DataFrame.
norm_test_df.insert(loc = 0, column = 'LABEL', value = exo_test_df['LABEL'])
# After inserting the 'LABEL' column to the 'norm_test_df' DataFrame, print its first five rows.
norm_test_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,0.273347,0.228221,0.196676,0.110003,0.10413,0.08869,0.040926,0.014337,0.013534,...,0.031635,0.042578,0.031451,-0.005393,0.028904,0.102708,0.071576,0.080408,0.616438,0.130742
1,2,0.394038,0.39148,0.39268,0.390974,0.388955,0.386673,0.38634,0.382364,0.381035,...,-0.047311,-0.075404,-0.092643,-0.118456,-0.134109,-0.150638,-0.164944,-0.171944,-0.166961,-0.14879
2,2,0.64815,0.627582,0.591444,0.519002,0.466046,0.385214,0.340496,0.281192,0.162553,...,0.018179,-0.034769,-0.032201,-0.041117,-0.057967,-0.128412,-0.067972,-0.119374,-0.023437,0.027941
3,2,-0.232813,-0.233212,-0.238944,-0.235869,-0.208281,-0.220224,-0.222214,-0.208586,-0.197319,...,0.056186,0.047254,0.047254,0.039873,0.021893,0.025227,0.025075,-0.017912,-0.059585,-0.04674
4,2,-0.006994,0.003426,0.006382,0.00761,0.003316,-0.000167,0.010016,-0.009471,0.008195,...,-0.006247,-0.016795,-0.001531,0.001095,-0.004439,-0.027127,-0.025421,-0.016852,-0.020089,0.002564


---

#### Activity 2: Transpose Of A DataFrame^

After the data normalisation, we need to apply Fast Fourier Transformation (which we will learn in the next class). To apply Fast Fourier Transformation, we need to follow three steps.

1. Interchange rows and columns with each other so that columns become rows and rows become columns.

2. Apply the `fft()` function on the DataFrame to apply Fast Fourier Transformation. This we will study in the next class.

3. Interchange rows and columns again.



So, now we will first learn how to interchange the rows and columns. The process of interchanging rows and columns is called **transpose**. It's a very simple process.

**For Example:**

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C18/whj-block-transpose-apt-c18-01.png" height=400/>

To transpose a DataFrame, you can use the `T` keyword.

**Syntax:** `dataframe.T`

In [None]:
# Student Action: Transpose the 'exo_train_df' using the 'T' keyword.
exo_train_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5077,5078,5079,5080,5081,5082,5083,5084,5085,5086
LABEL,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.00,...,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
FLUX.1,93.85,-38.88,532.64,326.52,-1107.21,211.10,9.34,238.77,-103.54,-265.91,...,125.57,7.45,475.61,-46.63,299.41,-91.91,989.75,273.39,3.82,323.28
FLUX.2,83.81,-33.83,535.92,347.39,-1112.59,163.57,49.96,262.16,-118.97,-318.59,...,78.69,10.02,395.50,-55.39,302.77,-92.97,891.01,278.00,2.09,306.36
FLUX.3,20.10,-58.54,513.73,302.35,-1118.95,179.16,33.30,277.80,-108.93,-335.66,...,98.29,6.87,423.61,-64.88,278.68,-78.76,908.53,261.73,-3.29,293.16
FLUX.4,-26.98,-40.09,496.92,298.13,-1095.10,187.82,9.63,190.16,-72.25,-450.47,...,91.16,-2.82,376.36,-88.75,263.48,-97.33,851.83,236.99,-2.88,287.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
FLUX.3193,92.54,0.76,5.06,-12.67,-438.54,19.27,-0.44,95.30,4.53,3162.53,...,210.09,8.75,163.02,28.82,-74.95,151.75,-136.16,-3.47,-1.50,-25.33
FLUX.3194,39.32,-11.70,-11.80,-8.77,-399.71,-43.90,10.90,48.86,21.95,3398.28,...,3.80,-10.69,86.29,-20.12,-46.29,-24.45,38.03,65.73,-4.65,-41.31
FLUX.3195,61.42,6.46,-28.91,-17.31,-384.65,-41.63,-11.77,-10.62,26.94,3648.34,...,16.33,-9.54,13.06,-14.41,-3.08,-17.00,100.28,88.42,-14.55,-16.72
FLUX.3196,5.08,16.00,-70.02,-17.35,-411.79,-52.90,-9.25,-112.02,34.08,3671.97,...,27.35,-2.48,161.22,-43.35,-28.43,3.23,-45.64,79.07,-6.41,-14.09


In [None]:
result = pd.concat([exo_train_df,exo_test_df])
print(result.shape)

(5657, 3198)


In [None]:
exo_train_df.shape

(5087, 3198)

In [None]:
exo_test_df.shape

(570, 3198)

As you can see, the rows have become columns and columns have become rows.

In the next class, we will apply Fast Fourier Transformation on the DataFrame.

---

### **Project**

You can now attempt the **Applied Tech.  Project 18 - Data Normalisation** on your own.


**Applied Tech.  Project 18 - Data Normalisation:** https://colab.research.google.com/drive/1n8xEPUsQUXNyXkrYn7WJKa9qFagtUWTW?usp=sharing

---