# Bringing features onto the same scale

Credits
- Sebastian Raschka: (http://sebastianraschka.com/Articles/2014_about_feature_scaling.html)
- Jason Brownlee: (http://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/)

# Standardization
The result of **standardization** (or **Z-score normalization**) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with   

$\mu = 0$ and $\sigma = 1$

where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean; standard scores (also called ***z*** scores) of the samples are calculated as follows:

\begin{equation} z = \frac{x - \mu}{\sigma}\end{equation} 

Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms.


In [None]:
from sklearn.preprocessing import StandardScaler
import pandas
import numpy

In [None]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv('./Datasets/pima-indians-diabetes.txt', header=None)
array = dataframe.values

In [None]:
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

In [None]:
print(X[0:5,:])

In [None]:
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

In [None]:
# summarize transformed data
print(rescaledX[0:5,:])

# Min-Max scaling

Often also simply called "normalization" - a common cause for ambiguities).  
In this approach, the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:

\begin{equation} X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}} \end{equation}

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [None]:
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

# Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

In [None]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer

In [None]:
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

In [None]:
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

The rows are normalized to length 1

 # Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

In [None]:
# binarization
from sklearn.preprocessing import Binarizer

In [None]:
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

In [None]:
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

# Let's visualize the effect


In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('./Datasets/wine.data', header=None,usecols=[0,1,2])

df.columns=['Class label', 'Alcohol', 'Malic acid']

df.head()

### Apply Standardization

In [None]:
from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic acid']])
df_std = std_scale.transform(df[['Alcohol', 'Malic acid']])

In [None]:
print('Mean after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
      .format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
      .format(df_std[:,0].std(), df_std[:,1].std()))

### MinMaxScaler

In [None]:
minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic acid']])
df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic acid']])

In [None]:
print('Min-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
      .format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
      .format(df_minmax[:,0].max(), df_minmax[:,1].max()))

#### Plotting

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

def plot():
    plt.figure(figsize=(8,6))

    plt.scatter(df['Alcohol'], df['Malic acid'], 
            color='green', label='input scale', alpha=0.5)

    plt.scatter(df_std[:,0], df_std[:,1], color='red', 
            label='Standardized [$N  (\mu=0, \; \sigma=1)$]', alpha=0.3)

    plt.scatter(df_minmax[:,0], df_minmax[:,1], 
            color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)

    plt.title('Alcohol and Malic Acid content of the wine dataset')
    plt.xlabel('Alcohol')
    plt.ylabel('Malic Acid')
    plt.legend(loc='upper left')
    plt.grid()
    
    plt.tight_layout()

plot()
plt.show()

# Standadization or Normalization ..?


<img src="Images/Stay_Tuned.jpg" width="50%">