In [1]:
import logging
import numpy as np
import pandas as pd

from validata.parser import Parser
from validata.validator import Validator
from validata.comparators import Comparator

%load_ext autoreload
%autoreload 2

In [7]:
# Enable to see what is going on under the hood
logging.basicConfig(level=logging.DEBUG)

## Create some data

In [8]:
n = 10

households = pd.DataFrame({
    "id": np.arange(n),
    "size": (np.random.standard_exponential(n) + 1).astype(int),
    "income_1": (np.random.standard_exponential(n) * 30e3).astype(int),
    "income_2": (np.random.standard_exponential(n) * 20e3).astype(int),
})

households

Unnamed: 0,id,size,income_1,income_2
0,0,2,19686,8285
1,1,5,108988,2910
2,2,1,8487,14445
3,3,1,24735,10088
4,4,2,11487,3897
5,5,2,79127,823
6,6,1,92632,15430
7,7,1,12646,1673
8,8,1,40850,1969
9,9,1,7091,23737


## Performing a single validation

Validata lets you perform validations on each record in the data set. In the simplest form, a validation is formulated as a single logical condition, for example:

```
income_1 < 30000
```

In this example, `income_1` is a column in a data set. The `<` sign is a comparison operator (aka Comparator) and `30000` the value to compare against. The general form for a validation thus is:

```
<column name> <comparison operator> <value>
```

The `Parser` class allows you to perform a single validation, as is demonstrated below:

In [9]:
# Create a Parser and supply it with a validation
ps = Parser("income_1 < 30000")

# Perform the validation on the households data set
ps.evaluate(households)

DEBUG:validata.parser:Processing token: income_1 < 30000 [BARE WORD]
DEBUG:validata.parser:Evaluating expression: income_1 < 30000
DEBUG:validata.evaluator:Selected columns: income_1.
DEBUG:validata.evaluator:Using comparator: LtComparator.


Unnamed: 0,income_1
0,True
1,False
2,True
3,True
4,True
5,False
6,False
7,True
8,False
9,True


As you can see, the `Parser` applied the logical validation to each row of the `households` data set. It returned `True` for all rows with `income_1` smaller than `30000` and `False` otherwise.

To see which other types of comparisons are available, use the `list()` method of the `Comparator` base class like so:

In [10]:
# Show which comparison operators are available
Comparator.list()

{'!=', '<', '<=', '==', '>', '>=', 'between', 'in', 'missing', 'not missing'}

### Combining Validations

The `Parser` also allows you to combine validation checks using `and` or `or`. When `and` is used both the left hand and the right hand condition need to be `True` for the entire validation to be `True`. With `or` only one condition needs to be `True`. Some simple examples:

In [11]:
# Both income columns need to be larger than 30000
both_incomes_high = "income_1 > 30000 and income_2 > 30000"

# Either income is high
one_income_high = "income_1 > 30000 or income_2 > 30000"

# Feel free to test using the Parser
ps = Parser(one_income_high)
ps.evaluate(households)

DEBUG:validata.parser:Processing token: income_1 > 30000 [BARE WORD]
DEBUG:validata.parser:Evaluating expression: income_1 > 30000
DEBUG:validata.evaluator:Selected columns: income_1.
DEBUG:validata.evaluator:Using comparator: GtComparator.
DEBUG:validata.parser:Processing token:  or  [OR]
DEBUG:validata.parser:Processing  or  expression.
DEBUG:validata.parser:Processing token: income_2 > 30000
DEBUG:validata.parser:Evaluating right hand side expression.
DEBUG:validata.evaluator:Selected columns: income_2.
DEBUG:validata.evaluator:Using comparator: GtComparator.


Unnamed: 0,0
0,False
1,True
2,False
3,False
4,False
5,True
6,True
7,False
8,True
9,False


In addition, you can group conditions using brackets `(...)`. Especially when using `and` and `or` together, this prevents ambiguous statements. For example: 

In [23]:
# Cases where an income may be missing
extra_income = """
    (size == 1 and income_1 missing and income_2 missing) or
(size == 2 and (income_1 missing or income_2 missing)
"""

ps = Parser(extra_income)
ps.evaluate(households)

DEBUG:validata.parser:Processing token: ( [GROUP_OPEN]
DEBUG:validata.parser:Entering nested expression.
DEBUG:validata.parser:Processing token: size == 1 [BARE WORD]
DEBUG:validata.parser:Evaluating expression: size == 1
DEBUG:validata.evaluator:Selected columns: size.
DEBUG:validata.evaluator:Using comparator: EqComparator.
DEBUG:validata.parser:Processing token:  and  [AND]
DEBUG:validata.parser:Processing  and  expression.
DEBUG:validata.parser:Processing token: income_1 missing
DEBUG:validata.parser:Evaluating right hand side expression.
DEBUG:validata.evaluator:Selected columns: income_1.
DEBUG:validata.evaluator:Using comparator: NullComparator.
DEBUG:validata.parser:Processing token:  and  [AND]
DEBUG:validata.parser:Processing  and  expression.
DEBUG:validata.parser:Processing token: income_2 missing
DEBUG:validata.parser:Evaluating right hand side expression.
DEBUG:validata.evaluator:Selected columns: income_2.
DEBUG:validata.evaluator:Using comparator: NullComparator.
DEBUG:

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
5,False
6,False
7,False
8,False
9,False


### Operators and multiple columns

In the previous examples, the conditions involved only a single columns. However, it is possible to select multiple columns:

```
column_1 + column_2 + ... + column_n
```

Multiple column names can be provided by concatenating them with the plus sign (`+`).

```
column_*
```

Using the wildcard sign (`*`) selects all columns that start with `column_`. When multiple columns are selected, an `Operator` is needed to aggregate them to a single one. There are two types of `Operator`s:

#### DataOperator
A `DataOperator` performs an aggregation before it is send to a `Comparator`; it operates on the raw data. Common examples are `mean` or `sum`, which compute the mean and sum of the selected columns respectively.

#### LogicalOperators
A `LogicalOperator` performs an aggregation after a `Comparator` is used; it aggregates the boolean output from the `Comparator`. A common example is the `any` operator, which returns `True` if any of the input columns equals `True`.

## Validator: Performing many checks

### Define a data frame with validations

In [5]:
checks_df = pd.DataFrame(
    data=[
        ["large_size", "size > 2"],
        ["income_missing", "any income_* missing"],
        ["high_collective_income", "sum income_* > 100000"]
    ],
    columns=["name", "expression"]
)
checks_df

Unnamed: 0,name,expression
0,large_size,size > 2
1,income_missing,any income_* missing
2,high_collective_income,sum income_* > 100000


### Run all validations

In [6]:
vd = Validator(checks_df)
results = vd.validate(households)

DEBUG:validata.validator:Performing validation: large_size.
DEBUG:validata.parser:Processing token: size > 2 [BARE WORD]
DEBUG:validata.parser:Evaluating expression: size > 2
DEBUG:validata.evaluator:Selected columns: size.
DEBUG:validata.evaluator:Using comparator: GtComparator.
DEBUG:validata.validator:Validated 10 rows - 30% evaluated to True.
DEBUG:validata.validator:Finished validation: large_size.
DEBUG:validata.validator:Performing validation: income_missing.
DEBUG:validata.parser:Processing token: any income_* missing [BARE WORD]
DEBUG:validata.parser:Evaluating expression: any income_* missing
DEBUG:validata.evaluator:Selected columns: income_1, income_2.
DEBUG:validata.evaluator:Using comparator: NullComparator.
DEBUG:validata.evaluator:Using operator: AnyOperator.
DEBUG:validata.validator:Validated 10 rows - 0% evaluated to True.
DEBUG:validata.validator:Finished validation: income_missing.
DEBUG:validata.validator:Performing validation: high_collective_income.
DEBUG:validat

In [7]:
results

Unnamed: 0,large_size,income_missing,high_collective_income
0,True,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,True,False,False
5,False,False,False
6,False,False,True
7,False,False,False
8,True,False,True
9,False,False,False


In [8]:
vd.get_summary()

Unnamed: 0,True %
large_size,30.0
income_missing,0.0
high_collective_income,20.0


In [29]:
import pandas
from validata.parser import Parser

df = pd.DataFrame({
    "gender": [1, 2, 1, None],
    "height": [182, 172, 278, 176],
})

# Find missing values
ps = Parser("gender not missing")
print(ps.evaluate(df))
# Outputs [True, True, True, False]

# Find (too) extreme values for height
ps = Parser("height between 140:240")
print(ps.evaluate(df))
# Outputs [True, True, False, True]

DEBUG:validata.parser:Processing token: gender not missing [BARE WORD]
DEBUG:validata.parser:Evaluating expression: gender not missing
DEBUG:validata.evaluator:Selected columns: gender.
DEBUG:validata.evaluator:Using comparator: NotNullComparator.
DEBUG:validata.parser:Processing token: height between 140:240 [BARE WORD]
DEBUG:validata.parser:Evaluating expression: height between 140:240
DEBUG:validata.evaluator:Selected columns: height.
DEBUG:validata.evaluator:Using comparator: BetweenComparator.


   gender
0    True
1    True
2    True
3   False
   height
0    True
1    True
2   False
3    True
