# Preprocessing - Scale Transforms
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
!pip install azureml



In [2]:
import azureml.dataprep as dprep

DataPrep provides a number of transformation functions to scale numeric columns.

## Min-Max Scaler

The min-max scaler scales all values in a column to a desired range (typically [0, 1]). Also commonly known as feature scaling, or unity-based normalization.

First, we will load a dataset containing information about baseball players. We will only keep a few columns.

In [3]:
df = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/baseball-players.csv')
df = df.keep_columns(columns=['playerID', 'weight', 'height'])
df = df.to_number(columns=['weight', 'height'])
df.head(25)

Unnamed: 0,playerID,weight,height
0,aardsda01,205,75
1,aaronha01,180,72
2,aaronto01,190,75
3,aasedo01,190,75
4,abadan01,184,73
5,abadfe01,220,73
6,abadijo01,192,72
7,abbated01,170,71
8,abbeybe01,175,71
9,abbeych01,169,68


Using `get_profile()`, we can see the shape of the numeric columns, such as the min, max, count, and number of error values.

In [4]:
df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
playerID,FieldType.STRING,aardsda01,zwilldu01,18589.0,0.0,18589.0,0.0,0.0,0.0,,,,,,,,,,,,,,
weight,FieldType.DECIMAL,65,320,18589.0,0.0,18589.0,0.0,872.0,0.0,129.299,160.0,159.997,169.998,183.553,196.655,222.46,245.242,279.792,185.563,20.9983,440.928,0.674785,1.173
height,FieldType.DECIMAL,43,83,18589.0,0.0,18589.0,0.0,809.0,0.0,62.9458,68.9976,68.9972,70.3849,72.0,73.9975,76.0,78.3518,80.6296,72.2353,2.59899,6.75476,-0.150168,0.965635


To apply min-max scaling, we can simply call the function `min_max_scaler` on the dataflow and specify the column name. This will trigger a full data scan over the column to determine the min and max values and perform the scaling. Note that the min and max values of the column is preserved at this point. If the same dataflow steps are performed over a different data set, the min-max scaler must be re-executed.

In [5]:
df2 = df.min_max_scale(column='weight')
df2.head(25)

Unnamed: 0,playerID,weight,height
0,aardsda01,0.54902,75
1,aaronha01,0.45098,72
2,aaronto01,0.490196,75
3,aasedo01,0.490196,75
4,abadan01,0.466667,73
5,abadfe01,0.607843,73
6,abadijo01,0.498039,72
7,abbated01,0.411765,71
8,abbeybe01,0.431373,71
9,abbeych01,0.407843,68


By looking at the data profile, we can see that the weight column is now scaled - the min is 0 and the max is 1. Any error values and missing values from the source column are preserved.

In [6]:
df2.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
playerID,FieldType.STRING,aardsda01,zwilldu01,18589.0,0.0,18589.0,0.0,0.0,0.0,,,,,,,,,,,,,,
weight,FieldType.DECIMAL,0,1,18589.0,0.0,18589.0,0.0,872.0,0.0,0.252152,0.372547,0.372535,0.411756,0.464912,0.516333,0.616861,0.706902,0.842321,0.472795,0.0823462,0.0067809,0.674785,1.173
height,FieldType.DECIMAL,43,83,18589.0,0.0,18589.0,0.0,809.0,0.0,62.9458,68.9976,68.9972,70.3849,72.0,73.9975,76.0,78.3518,80.6296,72.2353,2.59899,6.75476,-0.150168,0.965635


We can also specify a custom range for the scaling. Instead of [0, 1], let's choose [-10, 10].

In [7]:
df3 = df.min_max_scale(column='weight', range_min=-10, range_max=10)
df3.head(10)

Unnamed: 0,playerID,weight,height
0,aardsda01,0.980392,75.0
1,aaronha01,-0.980392,72.0
2,aaronto01,-0.196078,75.0
3,aasedo01,-0.196078,75.0
4,abadan01,-0.666667,73.0
5,abadfe01,2.156863,73.0
6,abadijo01,-0.039216,72.0
7,abbated01,-1.764706,71.0
8,abbeybe01,-1.372549,71.0
9,abbeych01,-1.843137,68.0


In some cases, we may want to manually provide the min and max of the data in the source column. For example, we may want to avoid a full data scan because the dataset is large and we already know the min and max. We can provide the known min and max to the `min_max_scaler` function. The column will be scaled using the provided values. Let's say we want to scale the height column with 50 inches (`data_min`) becoming 0 (`range_min`). The program will scan the data to get `data_max`, which will become 1 (`range_max`).

In [8]:
df4 = df.min_max_scale(column='height', data_min=50)
df4.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
playerID,FieldType.STRING,aardsda01,zwilldu01,18589.0,0.0,18589.0,0.0,0.0,0.0,,,,,,,,,,,,,,
weight,FieldType.DECIMAL,65,320,18589.0,0.0,18589.0,0.0,872.0,0.0,129.299,160.0,159.997,169.998,183.553,196.655,222.46,245.242,279.792,185.563,20.9983,440.928,0.674785,1.173
height,FieldType.DECIMAL,-0.212121,1,18589.0,0.0,18589.0,0.0,809.0,0.0,0.392298,0.575684,0.575672,0.617726,0.666667,0.727198,0.787879,0.859144,0.928102,0.673796,0.0787573,0.00620272,-0.150168,0.965635


### Using a Min-Max Scaler Builder

For more flexibility when constructing the arguments for the min-max scaling, we can use a min-max scaler builder.

In [9]:
builder = df.builders.min_max_scale(column='weight')
builder

MinMaxScalerBuilder
    column: 'weight'
    range_min: 0
    range_max: 1
    data_min: None
    data_max: None

Calling `builder.learn()` will trigger a full data scan to see what `data_min` and `data_max` are. We can choose whether to use these values, or set custom values.

In [10]:
builder.learn()
builder

MinMaxScalerBuilder
    column: 'weight'
    range_min: 0
    range_max: 1
    data_min: 65.0
    data_max: 320.0

If we want to provide custom values for any of the arguments, we can simply update the builder object.

In [11]:
builder.range_max = 10
builder.data_min = 50
builder

MinMaxScalerBuilder
    column: 'weight'
    range_min: 0
    range_max: 10
    data_min: 50
    data_max: 320.0

When we are satisfied with the arguments, we will call `builder.to_dataflow()` to get the result. Note that the min and max values of the source column is preserved by the builder at this point. If we need to get the true `data_min` and `data_max` values again, we will need to set those arguments on the builder to `None`, and then call `builder.learn()` again.

In [12]:
df5 = builder.to_dataflow()
df5.head(25)

Unnamed: 0,playerID,weight,height
0,aardsda01,5.74074,75
1,aaronha01,4.81481,72
2,aaronto01,5.18519,75
3,aasedo01,5.18519,75
4,abadan01,4.96296,73
5,abadfe01,6.2963,73
6,abadijo01,5.25926,72
7,abbated01,4.44444,71
8,abbeybe01,4.62963,71
9,abbeych01,4.40741,68
