# Standardisation

**Standardisation** (or **z-score normalization**) is a procedure during which the feature values are rescaled so that they have the properties of a *standard normal distribution* with $\mu = 0$ and $\sigma = 1$

In [2]:
import os, re
import numpy as np

In [4]:
data_set_path = re.sub('Algorithm', 'Dataset/insurance.csv', os.getcwd())

In [6]:
data_set = np.loadtxt(data_set_path, delimiter=',', skiprows=1, usecols=[1,2])

In [7]:
data_set.shape

(63, 2)

In [8]:
np.min(data_set, axis = 0)

array([0., 0.])

In [9]:
np.max(data_set, axis = 0)

array([124. , 422.2])

The above `data_set` has two columns with ranges
1. column 1 (0, 124)
2. column 2 (0, 422)

In [10]:
def z_score(x):
    return (x-np.mean(x)) / np.std(x)

In [11]:
def standardise(data_set):
    for j in range(data_set.shape[1]):
        data_set[:,j] = z_score(data_set[:,j])
    
    return None

In [12]:
standardise(data_set)

In [13]:
np.min(data_set, axis = 0)

array([-0.9887287 , -1.13338762])

In [14]:
np.max(data_set, axis = 0)

array([4.36397304, 3.74011686])

After standardisation, the range of the two columns
1. column 1 (-0.98, 4.36)
2. column 2 (-1.13, 3.74)

# Using StandardScaler

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
# reload the dataset
data_set = np.loadtxt(data_set_path, delimiter=',', skiprows=1, usecols=[1,2])

In [18]:
data_set_scaled = StandardScaler().fit_transform(data_set)

In [19]:
np.min(data_set_scaled, axis = 0)

array([-0.9887287 , -1.13338762])

In [20]:
np.max(data_set_scaled, axis = 0)

array([4.36397304, 3.74011686])