# Overview
Data normalization and scaling are techniques used to adjust the values of a dataset so that they are on a similar scale and have the same properties. This is important because many statistical and machine learning algorithms assume that the data is normally distributed, or that the features are on a similar scale. If this assumption is not met, the algorithms may produce biased or inaccurate results. 

By normalizing or scaling the data, the data is transformed into a consistent and interpretable form that is suitable for further analysis and modeling. This is an important step in the data science process that helps to ensure the validity and accuracy of the results obtained from any further analysis or modeling.


In this course, we will cover the following topics:

I. Data normalization: Transforming the values of a dataset so that they are in a specific range, usually between 0 and 1.
II. Data scaling: Transforming the numerical values of a dataset to a specific range.

# Learning Objectives
In this module, the learners will:

* Apply normalization and scaling techniques to datasets
* Choose the appropriate technique between normalization and scaling, depending on the dataset
* Identify datasets that need to be scaled and/or normalized before proceeding with analysis

Let's get started!

## What is data normalization?
Data normalization is the process of transforming the values of a dataset so that they are in a specific range, usually between 0 and 1. Data normalization is often used when the scale of the features in a dataset varies significantly.

## Why is it important?
Data normalization is important in the context of data processing and analysis because it helps to ensure that the features of a dataset are on a similar scale and have similar properties. For example, consider a dataset that has features that represent different units of measurement, such as height in inches and weight in pounds. If the scale of these features is not normalized, the results of a machine learning algorithm may be biased towards the feature with the larger scale, since it will have a greater influence on the outcome.

Normalization helps to mitigate this issue by transforming the values of the features so that they are on a similar scale, usually between 0 and 1. This makes it possible to compare the features in a consistent and interpretable way and also enables us to apply algorithms that assume our data is on a consistent scale.

Normalization also helps to handle cases where the scale of the features in a dataset varies significantly. For example, if a feature represents a range of values that is much larger than that of another feature, normalizing the data can help to ensure that the influence of each feature on the analysis or modeling is proportional to its importance.

## Z-score
Z-score normalization, also known as standardization, is a technique for transforming data by subtracting the mean and dividing by the standard deviation of the data. This results in a new dataset with a mean of 0 and a standard deviation of 1. Here's an example of how to apply z-score normalization in Python:

In [2]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create a sample dataset
data = np.array([1, 2, 3, 4, 5])

# Apply z-score normalization
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data.reshape(-1, 1)) #The data needs to be reshaped to a 2d array in order for fit_transform to work

print(normalized_data)

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


As we can see, the values of the data have been transformed to their z-scores, with a mean of 0 and a standard deviation of 1. We can now use these normalized values for further analysis or modeling.

## Unit vector
Unit vector normalization, also known as vector normalization or vector scaling, is a technique used to transform a vector into a unit vector, which has a magnitude of 1. This is achieved by dividing each component of the vector by its length or magnitude.

Unit vector normalization is important for data wrangling because it enables us to compare vectors with different magnitudes on the same scale. In machine learning, it is often used as a preprocessing step for algorithms that are sensitive to the scale of the input features, such as K-nearest neighbors (KNN) and support vector machines (SVM).

Here is an example of how to apply unit vector normalization in Python:

In [3]:
from sklearn.preprocessing import Normalizer
import numpy as np

# Create a sample dataset
data = np.array([[1, 2], [3, 4]])

# Apply unit vector normalization
normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data)

print(normalized_data)

[[0.4472136  0.89442719]
 [0.6        0.8       ]]


As we can see, each row of the data has been transformed into a unit vector (i.e., a vector with a length of 1). Each row of the 'normalized_data' array is a unit vector because the 'Normalizer' class in scikit-learn applies L2 normalization to each sample in the input data by default.

L2 normalization (also known as Euclidean normalization) involves dividing each element of a vector by its L2 norm or magnitude. The L2 norm of a vector x with n elements is defined as the square root of the sum of the squared values of its elements. By dividing each element of a vector by its L2 norm, we can transform the vector into a unit vector with a length or magnitude of 1. That is, we are scaling the vector so that it points in the same direction but has a magnitude of 1. We can now use these normalized values for further analysis or modeling.

## Mean subtraction
Mean normalization is a technique for transforming data by subtracting the mean from each data point. This results in a new dataset with a mean of 0.

Here's an example of how to apply mean normalization in Python:

In [4]:
import numpy as np

# Create a sample dataset
data = np.array([1, 2, 3, 4, 5])

# Calculate the mean of the data
mean = np.mean(data)

# Apply mean normalization
normalized_data = data - mean

print(normalized_data)

[-2. -1.  0.  1.  2.]


As we can see, the values of the data have been transformed to have a mean of 0. We can now use these normalized values for further analysis or modeling.

## Quantile
Quantile normalization is a technique for transforming data by making the distribution of the data identical to a reference distribution. This is achieved by sorting the values of each column of the data and then replacing each value with the corresponding value from the reference distribution.

Here's an example of how to apply quantile normalization in Python:

In [5]:
import numpy as np
from scipy.stats import rankdata

# Create a sample dataset
data = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])

# Create a reference distribution by computing the average ranks of each column
reference_distribution = np.mean(rankdata(data, axis=0), axis=1)

# Sort the values of each column
sorted_data = np.sort(data, axis=0)

# Replace each value with the corresponding value from the reference distribution
quantile_normalized_data = np.zeros_like(sorted_data)
for i in range(sorted_data.shape[1]):
    quantile_normalized_data[:, i] = reference_distribution[i]

# Invert the sorting of each column to get the original order
quantile_normalized_data = np.sort(quantile_normalized_data, axis=0)
quantile_normalized_data = np.flip(quantile_normalized_data, axis=0)

print("Original data:\n",data,"\n")
print("Reference distribution:\n",quantile_normalized_data,"\n")
print("Normalized data:\n",quantile_normalized_data)

Original data:
 [[10 20 30]
 [40 50 60]
 [70 80 90]] 

Reference distribution:
 [[1 2 3]
 [1 2 3]
 [1 2 3]] 

Normalized data:
 [[1 2 3]
 [1 2 3]
 [1 2 3]]


As we can see, the distribution of the data has been transformed to be identical to the reference distribution. We can now use these normalized values for further analysis or modeling. Note that this method works best for datasets with a similar distribution across columns and that it can be computationally expensive for large datasets.