# Feature Scaling: Scale Numerical Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).


## Overview

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This tutorial covers normalization and standardization techniques for scaling numerical data prior to modeling.

## Learning Objectives

- Understand why data scaling is a recommended pre-processing step for many machine learning algorithms
- Learn how data scaling can be achieved through normalizing or standardizing real-valued input and output variables
- Apply standardization and normalization techniques to improve predictive modeling algorithm performance

### Tasks to complete

- Load and examine the diabetes dataset
- Apply MinMaxScaler transformation
- Apply StandardScaler transformation
- Compare model performance with different scaling approaches

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy and scikit-learn libraries
- Knowledge of basic statistical and machine learning concepts

## Get Started

To start, we install the required packages and import the necessary libraries.

### Install required packages

In [None]:
%pip install matplotlib numpy pandas scikit-learn

### Import necessary libraries

In [None]:
from matplotlib import pyplot as plt
from numpy import asarray, mean, std
from pandas import DataFrame, read_csv
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler


## Numerical Data Scaling Methods

* **Normalization** scales each input variable separately to the range of 0-1, which is
the range for
floating-point values where we have the most precision.
* **Standardization** scales
each input variable separately by subtracting the mean (called centering) and dividing by the
standard deviation to shift the distribution to have a mean of zero and a standard deviation of
one.

### Data Normalization

Normalization is a rescaling of the data from the original range so that all values are within the
new range of 0 and 1. Normalization requires that you know or are able to accurately estimate
the minimum and maximum observable values. You may be able to estimate these values from
your available data.

In [None]:
# example of a normalization

# define data
data = asarray([[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07], [4, 0.1]])
print(data)

# define min max scaler
# Transform features by scaling each feature to a given range.
# This estimator scales and translates each feature individually such
# that it is in the given range on the training set, e.g. between
# zero and one.
scaler = MinMaxScaler()

# Fit to data, then transform it.
scaled = scaler.fit_transform(data)
print(scaled)

### Data Standardization

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed
values is 0 and the standard deviation is 1. This can be thought of as subtracting the mean
value or centering the data. Like normalization, standardization can be useful, and even
required in some machine learning algorithms when your data has input values with differing
scales. Standardization assumes that your observations fit a Gaussian distribution (bell curve)
with a well-behaved mean and standard deviation. You can still standardize your data if this
expectation is not met, but you may not get reliable results.

In [None]:
# example of a standardization

# define data
data = asarray([[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07], [4, 0.1]])
print(data)

# define standard scaler
# Standardize features by removing the mean and scaling to unit variance.
scaler = StandardScaler()

# Fit to data, then transform it.
scaled = scaler.fit_transform(data)
print(scaled)

## Diabetes Dataset

The dataset classifies patient data as
either an onset of diabetes within five years or not. 

```
Number of Instances: 768
Number of Attributes: 8 plus class 
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

In [None]:
# load and summarize the diabetes dataset

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)
print(dataset.head())

# summarize the shape of the dataset
print(dataset.shape)

# summarize each variable
print(dataset.describe())

This confirms the 8
input variables, one output variable, and 768 rows of data. A statistical summary of the input
variables is provided show that each variable has a very different scale. This makes it a good
dataset for exploring data scaling methods.

We can create a histogram for each input variable. The plots confirm the differing scale
for each input variable and show that the variables have different scales.


In [None]:
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]

# show the plot
plt.show()

Next, let's fit and evaluate a machine learning model on the raw dataset. We will use
a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated
stratified k-fold cross-validation.

In [None]:
# evaluate knn on the raw diabetes dataset

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)
data = dataset.values

# separate into input and output columns
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
X = X.astype("float32")
y = LabelEncoder().fit_transform(y.astype("str"))

# define and configure the model
model = KNeighborsClassifier()

# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report model performance
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

In this case we can see that the model achieved a mean classification accuracy of about
71.7 percent.

## `MinMaxScaler` Transform

We can apply the `MinMaxScaler` to the diabetes dataset directly to normalize the input variables.
We will use the default configuration and scale values to the range 0 and 1. First, a `MinMaxScaler`
instance is defined with default hyperparameters. Once defined, we can call the `fit.transform()`
function and pass it to our dataset to create a transformed version of our dataset.

### Summary of each input variable

In [None]:
# visualize a minmax scaler transform of the diabetes dataset

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# retrieve just the numeric input values
data = dataset.values[:, :-1]

# perform a min-max scaler transform of the dataset
trans = MinMaxScaler()
data = trans.fit_transform(data)

# convert the array back to a dataframe
dataset = DataFrame(data)

# summarize
print(dataset.describe())

We can see that the
distributions have been adjusted and that the minimum and maximum values for each variable
are now 0.0 and 1.0 respectively.

### Histogram plots of the variables

In [None]:
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]

# show the plot
plt.show()

Histogram plots of the variables are created, although the distributions don't look much
different from their original distributions seen in the previous section. We can confirm that the
minimum and maximum values are zero and one respectively, as we expected.

### Model evaluation

Next, let's evaluate the same KNN model as the previous section, but in this case, on a
MinMaxScaler transform of the dataset.

In [None]:
# evaluate knn on the diabetes dataset with minmax scaler transform

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)
data = dataset.values

# separate into input and output columns
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
X = X.astype("float32")
y = LabelEncoder().fit_transform(y.astype("str"))

# define the pipeline
trans = MinMaxScaler()
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[("t", trans), ("m", model)])

# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report pipeline performance
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

We can see that the `MinMaxScaler` transform results in a lift in
performance from 71.7 percent accuracy without the transform to about 73.9 percent with the
transform.

## `StandardScaler` Transform

We can apply the `StandardScaler` to the diabetes dataset directly to standardize the input
variables. We will use the default configuration and scale values to subtract the mean to center
them on 0.0 and divide by the standard deviation to give the standard deviation of 1.0. First, a
`StandardScaler` instance is defined with default hyperparameters. Once defined, we can call
the fit transform() function and pass it to our dataset to create a transformed version of our
dataset.

### Summary of each input variable

In [None]:
# visualize a standard scaler transform of the diabetes dataset

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)

# retrieve just the numeric input values
data = dataset.values[:, :-1]

# perform a robust scaler transform of the dataset
trans = StandardScaler()
data = trans.fit_transform(data)

# convert the array back to a dataframe
dataset = DataFrame(data)

# summarize
print(dataset.describe())

We can see that the
distributions have been adjusted and that the mean is a very small number close to zero and
the standard deviation is very close to 1.0 for each variable.

### Histogram plots of the variables

In [None]:
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]

# show the plot
plt.show()

Histogram plots of the variables are created, although the distributions don't look much
different from their original distributions seen in the previous section other than their scale on
the x-axis. We can see that the center of mass for each distribution is centered on zero, which is
more obvious for some variables than others.

### Model evaluation

Next, let's evaluate the same KNN model as the previous section, but in this case, on a
StandardScaler transform of the dataset. The complete example is listed below.

In [None]:
# evaluate knn on the diabetes dataset with standard scaler transform

# load the dataset
dataset = read_csv("../../Data/pima-indians-diabetes.csv", header=None)
data = dataset.values

# separate into input and output columns
X, y = data[:, :-1], data[:, -1]

# ensure inputs are floats and output is an integer label
X = X.astype("float32")
y = LabelEncoder().fit_transform(y.astype("str"))

# define the pipeline
trans = StandardScaler()
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[("t", trans), ("m", model)])

# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

# report pipeline performance
print("Accuracy: %.3f (%.3f)" % (mean(n_scores), std(n_scores)))

We can see that the StandardScaler transform results in a lift in
performance from 71.7 percent accuracy without the transform to about 74.1 percent with the
transform, slightly higher than the result using the MinMaxScaler that achieved 73.9 percent.

## Conclusion

Data scaling techniques like normalization and standardization can significantly improve model performance. In this tutorial, we saw how StandardScaler improved KNN model accuracy from 71.7% to 74.1%, while MinMaxScaler achieved 73.9% accuracy. The choice between scaling methods depends on your specific data and model requirements.

## Clean up

Remember to:
- Delete any downloaded datasets
- Shut down your SageMaker notebook instance when finished
- Remove any unnecessary resources

