# Automating data quality monitoring - Constraints Distributional Measures

Suppose you are training a model to predict whether a loan for a certain member of LendingClub will be accepted or not.

Here we have sampled 1000 members' data to illustrate how the constraints can be used to automatically enforce data quality.

When training an ML model and deploying it in production, sometimes we notice a significant performance degradation. This can often be a consequence of drift in the data that we used for training the model, and the data that needs to be predicted in some batch while the model runs in production.

'Distributional Measures Constraints' can be used to make data quality checks in a data pipeline automatically. In some cases it's too expensive to perform manual inspection of incoming data every time, so automated checks save a lot of time and effort in these situations. If some of the defined constraints fail, then we might need to take some action before passing the data to the model for prediction.

In [1]:
!pip install seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from whylogs import get_or_create_session
from whylogs.util.protobuf import message_to_json
from whylogs.core.statistics.constraints import DatasetConstraints
import warnings
warnings.filterwarnings('ignore')

# create session
session = get_or_create_session()

WARN: Missing config


Read the data for the example. The data contains features about 1000 LendingClub members.

In [3]:
data = pd.read_csv('data/lending_club_1000.csv')
data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,90671227,,4800.0,4800.0,4800.0,36 months,13.49,162.87,C,C2,...,,,Cash,N,,,,,,
1,90060135,,21600.0,21600.0,21600.0,60 months,9.49,453.54,B,B2,...,,,Cash,N,,,,,,
2,90501423,,24200.0,24200.0,24200.0,36 months,9.49,775.09,B,B2,...,,,Cash,N,,,,,,
3,90186302,,3600.0,3600.0,3600.0,36 months,11.49,118.7,B,B5,...,,,Cash,N,,,,,,
4,90805192,,8000.0,8000.0,8000.0,36 months,10.49,259.99,B,B3,...,,,Cash,N,,,,,,


In [4]:
# remove trailing whitespace characters from the feature 'term'
data.loc[:, 'term'] = [str.strip(term) if not isinstance(term, float) else term for term in data['term'].values]

Let's suppose that the first 80% of the data (800 rows) were used for training the ML model, and the last 20% of the data is a new unseen batch of data that needs to used to make a prediction.

In [5]:
# split the data according to the hypothetic situation
data_train = data.iloc[:800, :]
data_test = data.iloc[800:, :]

We are going to log the 'data_train' dataframe, to obtain a DatasetProfile for the training data. This profile will later be used to check for drift with the production data. 

In [6]:
profile = session.log_dataframe(data_train, "test.data")

Now, let's also log the 'data_test' dataframe, to obtain the target DatasetProfile for the comparison and for executing the data checks.

In [7]:
test_profile = session.log_dataframe(data_test, "test.data")

In the following subsections of this notebook, we present some simple and efficient ways to check for data quality and drift.

For visual inspection of the data we are going to use whylogs' NotebookProfileViewer. The assumption is that visually exploring the production data each time a new batch arrives is too expensive, but for the purpose of this notebook and for better understanding of the constraints, we are going to include visual representation.

### Define the NotebookProfileViewer and set the target and reference profiles

In [8]:
from whylogs.viz import NotebookProfileViewer

In [9]:
visualization = NotebookProfileViewer()  # create the profile viewer
# set the training data profile as the refrence profile, and the production data profile as the target profile
# we want to use the training data features' distributions as the reference for incoming data
visualization.set_profiles(target_profile=test_profile, reference_profile=profile) 

## Distributional Measure: Chi-Squared test p-value

To compare two discrete distributions, we can use the Chi-Squared test, or the KL-Divergence. We first illustrate how to use the 'columnChiSquaredTestPValueGreaterThanConstraint' which defines a constraint on the Chi-Squared test's p-value, to be greater than some value (default is 0.05)

In [10]:
from whylogs.core.statistics.constraints import columnChiSquaredTestPValueGreaterThanConstraint

In [11]:
visualization.distribution_chart(feature_names="grade")

From the bar charts, we can see that there is some difference in the distribution of the feaure "grade" in the training and testing data set. But this visual inspection does not provide the statistical difference between the distribution and can't be used when implementing data checks automatically.

We are going to define a 'columnChiSquaredTestPValueGreaterThanConstraint' constraint which will use the values of the feature 'grade' from the training data set as a reference distribution, and we will set the p-value to 0.05.

This constraint can then be applied on the profile of the production data set, so that we can check if the new distribution of the feature 'grade' in the production data successfully passes the constraint for the Chi-Squared test p-value to be greater than 0.05, with reference to the distribution of the same feature in the training data set. If so, then this means that the two features likely come from the same distribution.

In [12]:
# define Chi-Squared test p-value greater than constraint
grade_chi_squared_p_value_greater_than = columnChiSquaredTestPValueGreaterThanConstraint(
    reference_distribution = data_train['grade'].values,
    p_value=0.05
)

<b>Note: It is not always necessary to pass every value in a feature as the reference distribution in the 'Distributional Measures Constraints'. You can pass values generated from a theoretical distribution, or a random sample created from a reference feature's values.<b>

## Distributiona Measure: KL-Divergence

### KL-Divergence constraint: Discrete distribution

Another constraint that can be used to detect data drift is the KL-Divergence constraint. This constraint can be applied to two discrete or two continuous distributions. In this subsection we illustrate both cases.

In [13]:
from whylogs.core.statistics.constraints import columnKLDivergenceLessThanConstraint

Let's plot the distribution of the feature 'term'. From the bar charts we can see that the frequency of the two categories of the feautre '36 months' and '60 months' have drastically changed in the testing data set, compared to the training data set.

In [14]:
visualization.distribution_chart(feature_names="term")

To check this in a data pipeline, we can define a 'columnKLDivergenceLessThanConstraint' using the values of the feature 'term' from the training data set as a reference distribution. We also need to define some threshold, above which we can consider the drift to be significant. In this case, let's assume that we did some research and came to the conclusion that the optimal threshold value is 0.5.

In [15]:
term_kl_divergence_less_than = columnKLDivergenceLessThanConstraint(
    reference_distribution = data_train['term'].values, 
    threshold = 0.5
)

### KL Divergence constraint - continuous distribution

In [16]:
visualization.double_histogram(feature_names="int_rate")

From this figure we can see that tthe distribution of the feature 'int_rate' in the training and the production data set is somewhat similar.

We can define a KL-Divergence constraint in the same way for continuous distributions as for discrete. First, we specify that the reference distribution is the values of 'int_rate' in the training data set, and with some careful consideration (hypothetically), we have calculated that the drift should be identified as significant if it the KL-Divergence is larger than 2.5.

In [17]:
int_rate_kl_divergence_less_than = columnKLDivergenceLessThanConstraint(
    reference_distribution = data_train['int_rate'].values,
    threshold = 2.5,
)

# Distributional Measures: KS test p-value

Just like we had the Chi-Squared test p-value constraint for discrete features, we also have the KS test p-value constraint, which can be applied to test if two continuous features come from the same distribution.

This is another constraint that can be used, instead of the KL-Divergence for continuous distributions. You are free to choose whichever suits your use case best.

In [18]:
from whylogs.core.statistics.constraints import parametrizedKSTestPValueGreaterThanConstraint

In [19]:
visualization.double_histogram(feature_names="loan_amnt")

The distributions of 'loan_amnt' seems a bit different in both data sets.

We can define the 'parametrizedKSTestPValueGreaterThanConstraint' in a similar way to all of the other distributional measures constraints. We set the reference_distribution to be equal to the values of the feature 'loan_amnt' in the training data set, and we set the p-value to 0.05.

In [20]:
loan_amnt_ks_test_p_value_greater_than = parametrizedKSTestPValueGreaterThanConstraint(
    reference_distribution = data_train['loan_amnt'].values,
    p_value = 0.05,
)

# Single Feature Distributional Measure: Entropy

Another use-case we might be interested in is deciding which features provide value, or don't have a large amount of uncertanity within their values. For these reason we can use the 'approximateEntropyBetweenConstraint'.

In [21]:
from whylogs.core.statistics.constraints import approximateEntropyBetweenConstraint

In [22]:
visualization.distribution_chart(feature_names="debt_settlement_flag")

Here we see that the distribution of the feature 'debt_settlement_flag' is very similar in both data sets.

To define the constraint, we can calculate the entropy on the training data, or take some apriori value for the entropy which we believe is the groud truth, and then define the bounds within which we expect this value to vary in the data sets.

We define the 'approximateEntropyBetweenConstraint' by specifying the lower and upper values of the interval of valid entropy. In this case we have chosen some reference value 0.15 and we define the bounds to be in the range (0.15-0.05, 0.15+0.05).

In [23]:
debt_settlement_flag_approx_entropy_between = approximateEntropyBetweenConstraint(
    lower_value = 0.1,
    upper_value = 0.20,
)

We have now defined all of the constraints we are interested in. To apply them to a data set, we first create a DataSetConstraints object. This object has arguments for 'value_constraints', 'summary_constraints', 'table_shape_constraints', and 'multi_column_value_constraints'. Since the distributional measures are all of type 'SummaryConstraint' we will only set the 'summary_constraints' argument.

We pass a dictionary with keys equalt to the featre names, and a value equal to the list of constraints we want to apply to that feature. The first argument in the initialization is a properties object, but we do not need to set it explicitly.

In [24]:
constraints = DatasetConstraints(
    None,
    summary_constraints = {
        "grade": [grade_chi_squared_p_value_greater_than],
        "term": [term_kl_divergence_less_than],
        "int_rate": [int_rate_kl_divergence_less_than],
        "loan_amnt": [loan_amnt_ks_test_p_value_greater_than],
        "debt_settlement_flag": [debt_settlement_flag_approx_entropy_between]
    }
)

Now, what really interests us is how the constraints behave on the new unseen data that our model needs to use for prediction.

If we log the production data with the constraints and then apply them, we can see that some constraints have failed.

According to this report, we have a failure in the features 'grade', 'term' and 'loan_amnt' for which we saw that the distributions were somewhat different in the two data sets. The outcomes of the constraints signify that there is a high drift for these features in the new data set.

You can further automate the process of data quality checks by defining some behaviour in case a constraint fails, such as in the cases show in this example notebook.

In [25]:
test_report = test_profile.apply_summary_constraints(constraints.summary_constraint_map)
visualization.constraints_report(constraints)