# Assertions
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

Frequently, the data we work with while cleaning and preparing data is just a subset of the total data we will need to work with in production. It is also common to be working on a snapshot of a live dataset that is continuously updated and augmented.

In these cases, some of the assumptions we make as part of our cleaning might turn out to be false. Columns that originally only contained numbers within a certain range might actually contain a wider range of values in later executions. These errors often result in either broken pipelines or bad data.

AzureML DataPrep supports creating assertions on data, which are evaluated as the pipeline is executed. These assertions enable us to verify that our assumptions on the data continue to be accurate and, when not, to handle failures in a clean way.

To demonstrate, we will load a dataset and then add some assertions based on what we can see in the first few rows.

In [1]:
import sys
sys.version

'3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) \n[GCC 7.2.0]'

In [2]:
import azureml

In [3]:
from azureml.dataprep import smart_read_file

df = smart_read_file('./data/crime0-10.csv')
df.get_profile()

  app.launch_new_instance()


Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.DECIMAL,1.01397e+07,1.01409e+07,10.0,0.0,10.0,0.0,0.0,0.0,10139700.0,10139700.0,10139700.0,10139800.0,10139800.0,10140400.0,10140900.0,10140900.0,10140900.0,10140100.0,409.806,167941.0,0.688352,-1.15364
Case Number,FieldType.STRING,HY329177,HY330421,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,07/05/2015 10:10:00 PM,07/05/2015 11:50:00 PM,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Block,FieldType.STRING,011XX W MORSE AVE,121XX S FRONT AVE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.DECIMAL,460,1811,10.0,0.0,10.0,0.0,0.0,0.0,460.0,473.0,460.0,610.0,975.0,1320.0,1811.0,1811.0,1811.0,1008.7,435.056,189273.0,0.27388,-1.23243
Primary Type,FieldType.STRING,ARSON,THEFT,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Description,FieldType.STRING,$500 AND UNDER,TO VEHICLE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Location Description,FieldType.STRING,ALLEY,VEHICLE NON-COMMERCIAL,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Arrest,FieldType.BOOLEAN,False,True,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Domestic,FieldType.BOOLEAN,False,True,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,


We can see there are latitude and longitude columns present in this dataset. By definition, these are constrained to specific ranges of values. We can assert that this is indeed the case so that if any records come through with invalid values, we detect them.

In [4]:
from azureml.dataprep import f_and, value

df = df.assert_value('Latitude', f_and(value <= 90, value >= -90), error_code='InvalidLatitude')
df = df.assert_value('Longitude', f_and(value <= 180, value >= -180), error_code='InvalidLongitude')
df.keep_columns(['Latitude', 'Longitude']).get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
Latitude,FieldType.DECIMAL,41.679311,42.008124,10.0,0.0,10.0,0.0,1.0,0.0,41.679311,41.722619,41.679311,41.816021,41.88561,41.970902,42.008124,42.008124,42.008124,41.876613,0.103645,0.010742,-0.450413,-1.027551
Longitude,FieldType.DECIMAL,-87.800175,-87.644545,10.0,0.0,10.0,0.0,1.0,0.0,-87.800175,-87.782058,-87.800175,-87.724992,-87.685233,-87.658915,-87.644545,-87.644545,-87.644545,-87.69737,0.051257,0.002627,-0.817565,-0.812919


Any assertion failures are represented as Errors in the resulting dataset. From the profile above, you can see that the Error Count for both of these columns is 1. We can use a filter to retrieve the error and see what value caused the assertion to fail.

In [5]:
from azureml.dataprep import col

error_df = df.filter(col('Latitude').is_error())
error = error_df.head(10)['Latitude'][0]
print(error.originalValue)

None


Our assertion failed because we were not removing missing values from our data. At this point, we have two options: we can go back and edit our code to avoid this error in the first place or we can resolve it now. In this case, we will just filter these out.

In [6]:
from azureml.dataprep import f_not, LocalFileOutput
clean_df = df.filter(f_not(col('Latitude').is_error()))
clean_df.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.DECIMAL,1.01397e+07,1.01409e+07,9.0,0.0,9.0,0.0,0.0,0.0,10139700.0,10139700.0,10139700.0,10139800.0,10139800.0,10140400.0,10140900.0,10140900.0,10140900.0,10140000.0,427.717,182942.0,0.804034,-1.11933
Case Number,FieldType.STRING,HY329177,HY330421,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,07/05/2015 10:10:00 PM,07/05/2015 11:50:00 PM,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Block,FieldType.STRING,011XX W MORSE AVE,118XX S PEORIA ST,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.DECIMAL,460,1811,9.0,0.0,9.0,0.0,0.0,0.0,460.0,520.0,460.0,767.5,1020.0,1320.0,1811.0,1811.0,1811.0,1066.78,418.313,174986.0,0.186197,-1.17969
Primary Type,FieldType.STRING,ARSON,THEFT,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Description,FieldType.STRING,$500 AND UNDER,TO VEHICLE,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Location Description,FieldType.STRING,ALLEY,VEHICLE NON-COMMERCIAL,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Arrest,FieldType.BOOLEAN,False,True,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Domestic,FieldType.BOOLEAN,False,True,9.0,0.0,9.0,0.0,0.0,0.0,,,,,,,,,,,,,,
