# **Feature Engineering**

Feature Scaling is a technique to standardize the independent features present in the data. It is performed during the data pre-processing to handle highly varying values. If feature scaling is not done then machine learning algorithm tends to use greater values as higher and consider smaller values as lower regardless of the unit of the values. For example it will take 10 m and 10 cm both as same regardless of their unit. In this article we will learn about different techniques which are used to perform feature scaling.

## 1. Absolute Maximum Scaling

This method of scaling requires two-step:

    We should first select the maximum absolute value out of all the entries of a particular measure.
    Then after this we divide each entry of the column by this maximum value.


$$
X~scaled = \frac{X~i - max(|X|)}{max(|X|)}
$$

After performing the above-mentioned two steps we will observe that each entry of the column lies in the range of -1 to 1. But this method is not used that often the reason behind this is that it is too sensitive to the outliers. And while dealing with the real-world data presence of outliers is a very common thing. 

For the demonstration purpose we will use the dataset `SampleFile.csv` in the `../data/` directory. This dataset is a simpler version of the original house price prediction dataset having only two columns from the original dataset. The first five rows of the original data are shown below:

In [6]:
import pandas as pd
df = pd.read_csv('../data/SampleFile.csv')
df.head(10)

Unnamed: 0,LotArea,MSSubClass
0,8450,60
1,9600,20
2,11250,60
3,9550,70
4,14260,60
5,14115,50
6,10084,20
7,10382,60
8,6120,50
9,7420,190


In [7]:
len(df)

1460

Now let's apply the first method which is of the absolute maximum scaling. For this first, we are supposed to evaluate the absolute maximum values of the columns.

In [8]:
import numpy as np
max_vals = np.max(np.abs(df))
max_vals

215245

In [9]:
print((df - max_vals) / max_vals)

       LotArea  MSSubClass
0    -0.960742   -0.999721
1    -0.955400   -0.999907
2    -0.947734   -0.999721
3    -0.955632   -0.999675
4    -0.933750   -0.999721
...        ...         ...
1455 -0.963219   -0.999721
1456 -0.938791   -0.999907
1457 -0.957992   -0.999675
1458 -0.954856   -0.999907
1459 -0.953834   -0.999907

[1460 rows x 2 columns]


## 2. Min-Max Scaling

This method of scaling requires below two-step:

    First we are supposed to find the minimum and the maximum value of the column.
    Then we will subtract the minimum value from the entry and divide the result by the difference between the maximum and the minimum value.

$$
X~scaled = \frac{X~i - X~min}{X~max - X~min}
$$

As we are using the maximum and the minimum value this method is also prone to outliers but the range in which the data will range after performing the above two steps is between 0 to 1.

In [10]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, 
                         columns=df.columns)
scaled_df.head()

Unnamed: 0,LotArea,MSSubClass
0,0.03342,0.235294
1,0.038795,0.0
2,0.046507,0.235294
3,0.038561,0.294118
4,0.060576,0.235294


## 3. Normalization

This method is more or less the same as the previous method but here instead of the minimum value we subtract each entry by the mean value of the whole data and then divide the results by the difference between the minimum and the maximum value.

$$
X~scaled = \frac{X~i - X~mean}{X~max - X~min}
$$

In [11]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0  0.999975    0.007100
1  0.999998    0.002083
2  0.999986    0.005333
3  0.999973    0.007330
4  0.999991    0.004208


## 4. Standardization

This method of scaling is basically based on the central tendencies and variance of the data. 

    First we should calculate the mean and standard deviation of the data we would like to normalize it.
    Then we are supposed to subtract the mean value from each entry and then divide the result by the standard deviation.

This helps us achieve a normal distribution of the data with a mean equal to zero and a standard deviation equal to 1.

$$
X~scaled = \frac{X~i - X~mean}{\sigma}
$$

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.207142    0.073375
1 -0.091886   -0.872563
2  0.073480    0.073375
3 -0.096897    0.309859
4  0.375148    0.073375
