# Quantile Transformation
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.

DataPrep has the ability to perform quantile transformation to a numeric column. This transformation can transform the data into a normal or uniform distribution. Values bigger than the learnt boundaries will simply be clipped to the learnt boundaries when applying quantile transformation.

Let's load a sample of the median income of california households in different suburbs from the 1990 census data. From the data profile, we can see that the minimum value and maximum value is 0.9946 and 15 respectively.

In [1]:
!pip install azureml



In [2]:
import azureml.dataprep as dprep

df = dprep.read_csv(path='./data/median_income.csv').set_column_types(type_conversions={
    'median_income': dprep.TypeConverter(dprep.FieldType.DECIMAL)
})
df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Median,Upper Quartile,Standard Deviation,Mean
median_income,FieldType.DECIMAL,0.9946,15.0,250.0,0.0,0.0,2.6907,3.6307,4.77335,2.026679,4.007843


Let's now apply quantile transformation to `median_income` and see how that affects the data. We will apply quantile transformation twice, one that maps the data to a Uniform(0, 1) distribution, one that maps it to a Normal(0, 1) distribution.

From the data profile, we can see that the min and max of the uniform median income is strictly between 0 and 1 and the mean and standard deviation of the normal median income is close to 1 and 0 respectively.

*note: for normal distribution, we will clip the values at the ends as the 0th percentile and the 100th percentile are -Inf and Inf respectively.*

In [3]:
df = df.quantile_transform(source_column='median_income', new_column='median_income_uniform', quantiles_count=5)
df = df.quantile_transform(source_column='median_income', new_column='median_income_normal', 
                           quantiles_count=5, output_distribution="Normal")
df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Median,Upper Quartile,Standard Deviation,Mean
median_income,FieldType.DECIMAL,0.9946,15.0,250.0,0.0,0.0,2.6907,3.6307,4.77335,2.026679,4.007843
median_income_normal,FieldType.DECIMAL,-7.941345,7.941444,250.0,0.0,0.0,-0.67159,-0.000337,0.66781,1.021506,-0.060922
median_income_uniform,FieldType.DECIMAL,0.0,1.0,250.0,0.0,0.0,0.250934,0.499866,0.747861,0.25283,0.484762


Let's now save the dataflow which we will later load in the operationalization notebook.

In [3]:
from tempfile import mkdtemp
from os import path

tmp_dir = mkdtemp()
package_path = path.join(tmp_dir, 'quantile_transform.dprep')
package = dprep.Package(arg=df)
package.save(package_path)
print('Package saved to: "{}"'.format(package_path))

Package saved to: "/tmp/tmp29cvg68a/quantile_transform.dprep"
