# Advanced Constraints

The purpose of this tutorial is to demonstrate the advanced capabilities of the **Constraint Engine**. For an introduction to the constraints, refer to [Dataset & Constraints](https://github.com/ydataai/academy/blob/master/3%20-%20dataset-metadata/Metadata%20%26%20Constraints.ipynb).

The plan of this notebook is as follows:

1. **Rows versus Columns constraints**: we introduce the difference between rows and columns constraints and demonstrate how they can be used together.
2. **Custom Constraints**: we show how to create your own constraints based on any python function. In addition, we show an advanced example to validate constraints with time dependencies between the rows.

For more information please check the [API reference from ydata-sdk documentation](https://docs.sdk.ydata.ai/latest/api/).


## Authenticate with your YData account

In [1]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Advanced constraints - transactional dataset with temporal dependency

To illustrate this tutorial, we will generate a toy dataset representing the transaction on a bank account.   
Each row represents a transaction. The dataset has only 3 columns:

1. `amount`: the amount of the transaction
2. `balance`: the current balance after the transaction is applied
3. `constant`: a constant value (for the purpose of demonstrating some constraints)

It is a very generous example, so the amount can only be positive such that your balance keeps growing! The amount is generated randomly according to normal distribution.

In [2]:
import numpy as np
import pandas as pd

from ydata.dataset import Dataset

def calculate_balance(df):
    df['balance'] = df['amount'].cumsum()
    return df

n = 10 ** 3
mu, sigma = 0, 0.1
data = 100 * np.abs(np.random.normal(mu, sigma, n))
df = pd.DataFrame()
df['amount'] = data
df['constant'] = 10
df = calculate_balance(df)
df = df.reset_index()

dataset = Dataset(df)

In [3]:
dataset.head()

Unnamed: 0,index,amount,constant,balance
0,0,6.230388,10,6.230388
1,1,8.751423,10,14.98181
2,2,16.900303,10,31.882113
3,3,19.435222,10,51.317336
4,4,15.795862,10,67.113198


# Rows versus Columns constraints

Generally, when it comes to validate constraints on a tabular dataset, there are two different aspects:
1. the constraints that apply to the data points (e.g.: each row should have the column `balance` positive)
2. the constraints that apply to quantity linked to the columns (e.g.: the average of the column `balance` must be positive)

They are conceptually different because the **rows** constraints are applied to each data points independently while the **columns** constraint are applied to an aggregation of all rows and represent a **domain** validation.

Both are useful to validate the quality of your data and perfectly works together seamlessly. However, there is conceptual difference on the output: the **rows** constraints return a mask indicating for each rows if the constraint is violated; the **columns** constraints returns a boolean indicating if a column violates the constraint or not.

In the example below, add two columns constraints:
- the mean of the columns `amount` and `constant` must be between 7 and 10
- the standard deviation of the columns `amount` and `constant` must be between 0 and 8

In [5]:
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.columns import MeanBetween, StandardDeviationBetween

# Leveraging pre-defined Constraints
c1 = MeanBetween(lower_bound=7, upper_bound=10, columns=['amount', 'constant'])
c2 = StandardDeviationBetween(lower_bound=0, upper_bound=8, columns=['amount', 'constant'])

ce = ConstraintEngine()
ce.add_constraints([c1, c2])
ce.validate(dataset)

In the field `violation_per_constraint`, it is possible to access more information for each constraint. For instance, `StandardDeviationBetween` shows that 1 column out of 2 columns have been violated.

In [6]:
ce.summary()

{'rows_violation_count': np.int64(0),
 'rows_violation_ratio': 0.0,
 'violation_per_constraint': {'MeanBetweenRange on columns [amount, constant]': {'column_violation_count': np.int64(0),
   'column_violation_ratio': 0.0,
   'violated_columns': [],
   'validation_time': 0.01},
  'StandardDeviationBetween on columns [amount, constant]': {'column_violation_count': np.int64(1),
   'column_violation_ratio': 0.5,
   'violated_columns': ['constant'],
   'validation_time': 0.01}}}

## Columns constraints

In this section, we provide multiple examples of columns constraints.

In [7]:
from ydata.constraints.columns import (MeanBetween, StandardDeviationBetween, QuantileBetween, UniqueValuesBetween, SumLowerThan, Constant)

We would like to make sure that the mean value of the columns `amount` and the column `constant` is between 7 and 10 (included):

In [8]:
c = MeanBetween(lower_bound=7, upper_bound=10, columns=['amount', 'constant'])
c.validate(dataset)

Unnamed: 0,amount,constant
0,True,True


We would like to make sure that the standard deviation is comprised between 0 (excluded) and 8:

In [9]:
c = StandardDeviationBetween(lower_bound=0, upper_bound=8, columns=['amount', 'constant'])
c.validate(dataset)

Unnamed: 0,amount,constant
0,True,False


Because the column `constant` is constant, its standard deviation is exactly 0 and thus, the column violate the constraint!

It is also possible to constraint any quantile to any particular interval. For instance, the following constraint check that the 25% centile is between 3 and 3.2:

In [10]:
c = QuantileBetween(quantile=0.25, lower_bound=3, upper_bound=3.2, columns='amount')
c.validate(dataset)

Unnamed: 0,amount
0,False


It is possible check that the number of unique value in a column belong to a given interval. The following example is equivalent to check that a column is constant as it requires the number of unique values to be between 0 (excluded) and 1:

In [11]:
c = UniqueValuesBetween(lower_bound=0, upper_bound=1, columns='constant')
c.validate(dataset)

Unnamed: 0,constant
0,True


For such a trivial example, we directly provide a constraint `Constant`:

In [12]:
c = Constant(columns='constant')
c.validate(dataset)

Unnamed: 0,constant
0,True


Finally, it is possible check that the sum of a column is lower than a value value:

In [13]:
c = SumLowerThan(value=10001, columns='constant')
c.validate(dataset)

Unnamed: 0,constant
0,True


## Mixing Rows and Columns constraints

The Constraint Engine accepts both type of constraints:

In [16]:
from ydata.constraints.rows import GreaterThan

# Rows constraint
c1 = GreaterThan(columns=['amount', 'balance'], value=1)

# Column constraint
c2 = MeanBetween(lower_bound=7, upper_bound=10, columns=['amount', 'constant'])
c3 = StandardDeviationBetween(lower_bound=0, upper_bound=8, columns=['amount', 'constant'])

ce = ConstraintEngine()
ce.add_constraints([c1, c2, c3])
ce.validate(dataset)

In [17]:
ce.summary()

{'rows_violation_count': np.int64(66),
 'rows_violation_ratio': 0.066,
 'violation_per_constraint': {"GreaterThan(columns=['amount', 'balance'], value=1)": {'rows_violation_count': np.int64(66),
   'rows_violation_ratio': 0.066,
   'validation_time': 0.01},
  'MeanBetweenRange on columns [amount, constant]': {'column_violation_count': np.int64(0),
   'column_violation_ratio': 0.0,
   'violated_columns': [],
   'validation_time': 0.01},
  'StandardDeviationBetween on columns [amount, constant]': {'column_violation_count': np.int64(1),
   'column_violation_ratio': 0.5,
   'violated_columns': ['constant'],
   'validation_time': 0.01}}}

Remember that for **rows** constraints, the percentage refers to the number of violated rows (e.g. 7.5% for the constraint `GreaterThan`) while it refers to columns for **columns** constraints (e.g. 50% of columns are violated for the constraint `StandardDeviationBetween`).

# Custom constraints

The constraints engine comes with several pre-defined constraints to validate your dataset. 

However, in some cases, your constraints cannot be directly expressed by these default constraints. In thise case, you can create your own custom constraints from any Python function or lambda.

In [18]:
from ydata.constraints.rows import CustomConstraint

## A simple example

In the following example, we create a `CustomConstraint` to validate that each row is strictly lower than 10. Keep in mind that this is just a toy example and that in reality, the function could be arbitrarily complex depending on your own use-case:

In [19]:
c = CustomConstraint(name='Positive', columns=['constant', 'amount'], check=lambda x: x < 10)
mask = c.validate(dataset)
mask

Unnamed: 0,constant,amount
0,False,True
1,False,True
2,False,False
3,False,False
4,False,False
...,...,...
995,False,True
996,False,True
997,False,True
998,False,False


We can check the number of non-violated rows. As expected, all rows in `constaint` are violated while about 25% are violated for the columne `amount`.

In [20]:
mask.sum()

constant      0
amount      632
dtype: int64

We can easily select the violated rows for any columns. For instance, to select all violated rows for the column `amount`:

In [21]:
df[~mask['amount']]

Unnamed: 0,index,amount,constant,balance
2,2,16.900303,10,31.882113
3,3,19.435222,10,51.317336
4,4,15.795862,10,67.113198
7,7,16.880093,10,101.226684
12,12,13.390411,10,145.702538
...,...,...,...,...
979,979,13.666237,10,8305.625657
980,980,17.022473,10,8322.648130
985,985,14.463560,10,8351.245571
992,992,20.084837,10,8399.200244


## Custom constraints for series

A more complex example is the validation of time series when the rows have calculated values that depends on the previous rows.   
To handle such complex yet common use-cases, we provide a special synthax to refer to previous rows.

Imagine the following example: our dataset represents transactions on a bank account, and a structural constraint that **must** be respected is that the total amount at transaction **n** must be equal to the total amount at step **n-1** plus the amount of the transaction **n**.
If such constraint is violated it means that the data has some integrity issues. Although the example is fairly simple to understand, checking such integrity constraint with temporal dependency can be tricky.    

However, this is made simple with the `Constraint Engine` and the `Custom Constraint` as illustrated on the following example:

In [22]:
def check_balance_with_serie(df):
    return df['balance|n'] == df['balance|n-1'] + df['amount|n']  # 

In [23]:
# The custom synthax `column|n-k` is understood as the value `n-k` of the column `column` where `n` is the current row.
c = CustomConstraint(name="Balance Check with Serie", check=check_balance_with_serie,
                          available_columns=['balance|n', 'amount|n', 'balance|n-1'])

There is no restriction on the order of the temporal dependencies. For instance, `amount|n-5` would refer to the amount five rows behind the current one.

It is important to specify what colums you want to make available to your function using the parameter `available_columns` as above. It serves the purpose of preventing side effects and optimizing the validation of such constraint.

In [24]:
mask = c.validate(dataset)
mask.sum()

0    1000
dtype: int64

Because our dataset was constructed precisely to respect this constraint, we obtain no violated rows, as expected.

# Conclusion

The `Constraint Engine` is a versatile tool to validate constraints and detect issues in your dataset. Because structural constraints takes two forms, we offer two different types of constraints:

1. Rows constraints that check the validaty of each row,
2. Column constraints that check the validity of a column property.

When your constraint cannot be expressed with the default `Constraint` objects that we offer, it is possible to define `CustomConstraint` with no restriction, using Python function or lambda.

Finally, we also allow to define constraints with time dependency between the rows, allowing to check the integrity of complex relationships.