Scaling refers to putting the values in the same range or same scale so that no variable is dominated by the other.Most of the times dataset will contain features highly varying in magnitudes, units and range.If left alone, these algorithms only take in the magnitude of features neglecting the units. 

Scalability is about handling huge amounts of data and performing a lot of computations in a cost-effective and time-saving way. It Improves the Productivity, Modularity, Minimizing human involvement and reduces the cost.

There are two types of scaling of your data that you may want to consider: normalization and standardization.These can both be achieved using the scikit-learn library.

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
A value is normalized as follows:
        y = (x - min) / (max - min)
Where the minimum and maximum values pertain to the value x being normalized.
Most Commonly used object to normalize the dataset using the scikit-learn is MinMaxScaler.

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. It is sometimes referred to as “whitening".This can be thought of as subtracting the mean value or centering the data. Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation.You can still standardize your data if this expectation is not met, but you may not get reliable results.
A value is standardized as follows:
        y = (x - mean) / standard_deviation
mean = sum(x) / count(x)
standard_deviation = sqrt( sum( (x - mean)^2 ) / count(x))

Most Commonly used object to standardize the dataset using the scikit-learn is StandardScaler.

Let us see both the scaling tecniques.

In [1]:
# Import followed by Name of library and add a shortcut name 'np'
import numpy as np

In [2]:
# pyplot is a module in matplotlib
import matplotlib.pyplot as plt

In [3]:
import pandas as pd

In [21]:
#Import the MinMax scaler from sklearn
from sklearn.preprocessing import MinMaxScaler

In [None]:
#Import the StandardScaler scaler from sklearn
from sklearn.preprocessing import StandardScaler

In [5]:
# pass the data from file into a variable 'dataset'
dataset = pd.read_csv('Scaling_Data.csv')

In [6]:
dataset

Unnamed: 0,User,Age,Gender,Salary,Purchase
0,151234,23,M,10000,N
1,151235,43,F,450000,Y
2,151236,54,M,78000,Y
3,151237,22,M,20000,N
4,151238,19,M,37000,N
5,151239,16,M,49000,Y
6,151240,64,F,21000,N
7,151241,83,F,80000,N
8,151242,25,M,35000,N
9,151243,64,M,90000,Y


In [7]:
dataset.head()


Unnamed: 0,User,Age,Gender,Salary,Purchase
0,151234,23,M,10000,N
1,151235,43,F,450000,Y
2,151236,54,M,78000,Y
3,151237,22,M,20000,N
4,151238,19,M,37000,N


In [19]:
# Split the data set into features and Dependent variables
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, -1:].values

In [9]:
print(X)


[[151234 23 'M' 10000]
 [151235 43 'F' 450000]
 [151236 54 'M' 78000]
 [151237 22 'M' 20000]
 [151238 19 'M' 37000]
 [151239 16 'M' 49000]
 [151240 64 'F' 21000]
 [151241 83 'F' 80000]
 [151242 25 'M' 35000]
 [151243 64 'M' 90000]
 [151244 33 'M' 45000]
 [151245 46 'F' 55000]
 [151246 34 'F' 65000]
 [151247 41 'M' 20000]
 [151248 39 'F' 78000]
 [151249 54 'M' 100000]
 [151250 23 'F' 9000]
 [151251 57 'M' 80000]
 [151252 85 'F' 90000]
 [151253 24 'F' 40000]
 [151254 55 'M' 120000]
 [151255 66 'M' 150000]
 [151256 38 'M' 50000]]


In [10]:
print(y)

[['N']
 ['Y']
 ['Y']
 ['N']
 ['N']
 ['Y']
 ['N']
 ['N']
 ['N']
 ['Y']
 ['N']
 ['N']
 ['Y']
 ['N']
 ['Y']
 ['Y']
 ['N']
 ['Y']
 ['Y']
 ['N']
 ['Y']
 ['Y']
 ['N']]


Split the data into training and test sets. This set will be used for standard scaler

In [11]:
# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Instantiate the standard scalet and pass it to variable 'sc'

In [12]:
sc=StandardScaler()

In [13]:
X_train[:, 3:] =sc.fit_transform(X_train[:, 3:])

We need to scale only the columns which have numerical data.
In our dataset we observe that the row number 3 'salary' contains datanumerical data with maximum deviation.
Let us try to scale the 'Salaries'. Hence the X_train has the column number 3.

.fit : Fits the scaler using available training data. As part of this. training data is used to estimate the mean and standard deviation and applying y = (x - mean) / standard_deviation, it calculates the value. This transformation is done to the index no 3 of the training dataset.Tr 
Transform: Applies the scale to training data. As part of this, ou can use the normalized data to train your model. 
X_train[:, 3:] : Passing fit_transform to this means, passing the scaled data to the data being used in the current program. This scaled data will be used going forward.

In [15]:
X_train[:5]

array([[151247, 41, 'M', -1.0613009845228898],
       [151256, 38, 'M', -0.04769892065271404],
       [151250, 23, 'F', -1.4329550746086208],
       [151242, 25, 'M', -0.5544999525878019],
       [151240, 64, 'F', -1.0275142490605507]], dtype=object)

Here we see that the index 3 i.e. Salary column is scaled and all the values lie between 0 and 1.

In [17]:
X_test[:, 3:] =sc.fit_transform(X_test[:, 3:])

Below, we are applying the same scaling to the test set

In [18]:
X_test[:5]

array([[151245, 46, 'F', -0.6804469389595195],
       [151244, 33, 'M', -0.7523251367369335],
       [151255, 66, 'M', 0.002395939925913871],
       [151248, 39, 'F', -0.5151270840714671],
       [151254, 55, 'M', -0.21323865340632822]], dtype=object)

Let us apply the MinMax scalre and see the data. Before that we need to split the features and DVV into new train and test set. we cannot use the old set as we have already applied the standard scaler.

In [23]:
# Split the data
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [24]:
scaler = MinMaxScaler()

Instantiate the MinMax scaler and pass it to variable scaler

In [26]:
X_train1

array([[151247, 41, 'M', 20000],
       [151256, 38, 'M', 50000],
       [151250, 23, 'F', 9000],
       [151242, 25, 'M', 35000],
       [151240, 64, 'F', 21000],
       [151251, 57, 'M', 80000],
       [151238, 19, 'M', 37000],
       [151236, 54, 'M', 78000],
       [151239, 16, 'M', 49000],
       [151252, 85, 'F', 90000],
       [151243, 64, 'M', 90000],
       [151241, 83, 'F', 80000],
       [151253, 24, 'F', 40000],
       [151237, 22, 'M', 20000],
       [151234, 23, 'M', 10000],
       [151249, 54, 'M', 100000],
       [151246, 34, 'F', 65000]], dtype=object)

In [27]:
X_train1[:, 3:] = scaler.fit_transform(X_train1[:, 3:])

In [28]:
X_train1

array([[151247, 41, 'M', 0.12087912087912088],
       [151256, 38, 'M', 0.4505494505494505],
       [151250, 23, 'F', 0.0],
       [151242, 25, 'M', 0.2857142857142857],
       [151240, 64, 'F', 0.13186813186813184],
       [151251, 57, 'M', 0.7802197802197802],
       [151238, 19, 'M', 0.3076923076923077],
       [151236, 54, 'M', 0.7582417582417582],
       [151239, 16, 'M', 0.43956043956043955],
       [151252, 85, 'F', 0.8901098901098901],
       [151243, 64, 'M', 0.8901098901098901],
       [151241, 83, 'F', 0.7802197802197802],
       [151253, 24, 'F', 0.34065934065934067],
       [151237, 22, 'M', 0.12087912087912088],
       [151234, 23, 'M', 0.010989010989010992],
       [151249, 54, 'M', 0.9999999999999999],
       [151246, 34, 'F', 0.6153846153846154]], dtype=object)

In [29]:
X_test1[:, 3:] = scaler.fit_transform(X_test1[:, 3:])

In [30]:
X_test1

array([[151245, 46, 'F', 0.024691358024691343],
       [151244, 33, 'M', 0.0],
       [151255, 66, 'M', 0.25925925925925924],
       [151248, 39, 'F', 0.08148148148148147],
       [151254, 55, 'M', 0.18518518518518517],
       [151235, 43, 'F', 1.0]], dtype=object)

In [None]:
We see here that the scaled values are different than the standard scaler. Every scaler will not provide the same values

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them.
If often can, but not always. A good tip is to create rescaled copies of your dataset and race them against each other
using your test harness and a handful of algorithms you want to spot check.

Data rescaling is an important part of data preparation before applying machine learning algorithms.
Normalization is recommended when there is a normal distribution in the features.
Standardization works all the time. Standardization would be the ultimate recommendation which will always increase
Performance.