## Feature Scaling

The attributes have very different scales. It is difficult to compare them and see which of them have the lowest variance. Let's scale them.

The two common approaches to bringing different features onto the same scale are normalization and standardization. Normalization concept is implemented in Python using MinMaxScaler and the standardization concept is implemented using StandardScaler.

Normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling. **Normalization is useful when the data is needed in the bounded intervals.**

The standardization technique is used to center the feature columns at mean 0 with a standard deviation of 1 so that the feature columns have the same parameters as a standard normal distribution. Unlike Normalization, **standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values**.

### When to use MinMaxScaler or StandardScaler? 
MinMaxScaler is useful when the data has a bounded range or when the distribution is not Gaussian. For example, in image processing, pixel values are typically in the range of 0-255. Scaling these values using MinMaxScaler ensures that the values are within a fixed range and contributes equally to the analysis. Similarly, when dealing with non-Gaussian distributions such as a power-law distribution, MinMaxScaler can be used to ensure that the range of values is scaled between 0 and 1.

StandardScaler is useful when the data has a Gaussian distribution or when the algorithm requires standardized features. For example, in linear regression, the features need to be standardized to ensure that they contribute equally to the analysis. Similarly, when working with clustering algorithms such as KMeans, StandardScaler can be used to ensure that the features are standardized and contribute equally to the analysis.

In [None]:
# LIBRARIES

import numpy as np
import pandas as pd

from sklearn.feature_selection import mutual_info_classif

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_excel("DryBeanDataset/Dry_Bean_Dataset.xlsx")

In [None]:
df.describe()

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
corr = sns.heatmap(df.corr(numeric_only=True).round(3), annot=True, linewidths=.5, ax=ax);
fig.tight_layout()

It seems that the attributes are all real-valued and positive.

In [None]:
df.hist(figsize=(20,15), bins=50);

In [None]:
from sklearn.preprocessing import MinMaxScaler
 
df_minmax = df.copy()
 
mmscaler = MinMaxScaler()
features = ['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 
            'Extent', 'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2','ShapeFactor3', 'ShapeFactor4']

df_minmax[features] = mmscaler.fit_transform(df_minmax[features])

In [None]:
df_minmax.describe()

In [None]:
df_minmax.hist(figsize=(20,15), bins=50);

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
corr = sns.heatmap(df_minmax.corr(numeric_only=True).round(3), annot=True, linewidths=.5, ax=ax);
fig.tight_layout()

In [None]:
from sklearn.preprocessing import StandardScaler

df_standard = df.copy()

sc = StandardScaler()
features = ['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 
            'Extent', 'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2','ShapeFactor3', 'ShapeFactor4']
  
df_standard[features] = sc.fit_transform(df_standard[features])

In [None]:
df_standard.describe()

In [None]:
df_standard.hist(figsize=(20,15), bins=50);

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
corr = sns.heatmap(df_standard.corr(numeric_only=True).round(3), annot=True, linewidths=.5, ax=ax);
fig.tight_layout()