# Impute missing values
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

DataPrep has the ability to impute missing values in specified columns. In this case, we will attempt to impute the missing _Latitude_ and _Longitude_ values in the input data.

In [2]:
!pip install azureml

Collecting azureml
  Downloading https://files.pythonhosted.org/packages/ab/e8/76cd2cb6784b9039affd2c659eed1b3f46baf2e6b87a10b072a20b5b0113/azureml-0.2.7-py2.py3-none-any.whl
Installing collected packages: azureml
Successfully installed azureml-0.2.7


In [3]:
import azureml.dataprep as dprep

In [4]:
# loading input data
df = dprep.read_csv(r'data\crime0-10.csv')
df = df.keep_columns(['ID', 'Arrest', 'Latitude', 'Longitude'])
df = df.to_number(['Latitude', 'Longitude'])
df.head(10)

Unnamed: 0,ID,Arrest,Latitude,Longitude
0,10140490,False,41.973309,-87.800175
1,10139776,False,42.008124,-87.65955
2,10140270,False,,
3,10139885,False,41.902152,-87.754883
4,10140379,False,41.88561,-87.657009
5,10140868,False,41.679311,-87.644545
6,10139762,False,41.825501,-87.690578
7,10139722,True,41.857828,-87.715029
8,10139774,False,41.9701,-87.669324
9,10139697,False,41.78758,-87.685233


The third record from input data has _Latitude_ and _Longitude_ missing. To impute those missing values, we can use `ImputeMissingValuesBuilder` to learn a fixed program which imputes the columns with either a calculated `MIN`, `MAX` or `MEAN` value or a `CUSTOM` value. When `group_by_columns` is specified, missing values will be imputed by group with `MIN`, `MAX` and `MEAN` calculated per group.

Firstly, let us quickly see check the `MEAN` value of _Latitude_ column.

In [5]:
df_mean = df.summarize(group_by_columns=['Arrest'],
                       summary_columns=[dprep.SummaryColumnsValue(column_id='Latitude',
                                                                 summary_column_name='Latitude_MEAN',
                                                                 summary_function=dprep.SummaryFunction.MEAN)])
df_mean = df_mean.filter(dprep.col('Arrest') == 'false')
df_mean.head(1)

Unnamed: 0,Arrest,Latitude_MEAN
0,False,41.878961


The `MEAN` value of _Latitude_ looks good. So we will impute _Latitude_ with it. As for `Longitude`, we will impute it using `42` based on external knowledge.

In [6]:
# impute with MEAN
impute_mean = dprep.ImputeColumnArguments(column_id='Latitude',
                                          impute_function=dprep.ReplaceValueFunction.MEAN)
# impute with custom value 42
impute_custom = dprep.ImputeColumnArguments(column_id='Longitude',
                                            custom_impute_value=42)
# get instance of ImputeMissingValuesBuilder
impute_builder = df.builders.impute_missing_values(impute_columns=[impute_mean, impute_custom],
                                                   group_by_columns=['Arrest'])
# call learn() to learn a fixed program to impute missing values
impute_builder.learn()
# call to_dataflow() to get a dataflow with impute step added
df_imputed = impute_builder.to_dataflow()

In [7]:
# check impute result
df_imputed.head(10)

Unnamed: 0,ID,Arrest,Latitude,Longitude
0,10140490,False,41.973309,-87.800175
1,10139776,False,42.008124,-87.65955
2,10140270,False,41.878961,42.0
3,10139885,False,41.902152,-87.754883
4,10140379,False,41.88561,-87.657009
5,10140868,False,41.679311,-87.644545
6,10139762,False,41.825501,-87.690578
7,10139722,True,41.857828,-87.715029
8,10139774,False,41.9701,-87.669324
9,10139697,False,41.78758,-87.685233


As the result above, the missing _Latitude_ has been imputed with the `MEAN` value of `Arrest=='false'` group, and the missing _Longitude_ has been imputed with `42`.