## Data Validation with Voluptuous

In this notebook, we'll use [Voluptuous](https://github.com/alecthomas/voluptuous) to define schemas for our data.  We can then use schema validation exceptions to either mark, set aside or remove unclean / invalid data. 

In [None]:
import logging
import pandas as pd
from datetime import datetime
from voluptuous import Schema, Required, Range, All, ALLOW_EXTRA
from voluptuous.error import MultipleInvalid, Invalid

In [None]:
# Voluptuous uses logging to display messages.
logger = logging.getLogger(0)
logger.setLevel(logging.WARNING)

In [None]:
sales = pd.read_csv('data/sales_data.csv')

### Data Quality Check

In [None]:
sales.head()

In [None]:
sales.dtypes

## Defining our first schema

This method transforms the data set into a json format so that Voluptuous will be able to work with it and validating against the schema.

In [None]:
def validate_sales_dataframe(sales_df, field_to_validate):
    error_count = 0
    for s_id, sale in sales_df.T.to_dict().items():
        try:
            schema(sale)
        except MultipleInvalid as e:
            logging.warning(f'issue with sale: {s_id} ({sale[field_to_validate]}) - {e}')
            error_count += 1
    return error_count

In this case we only care about validating the 'sale_amount' field.
Since there are other fields in the data set that we currently do not validate,
we mark extra=ALLOW_EXTRA.

In [None]:
schema = Schema({
    Required('sale_amount'): All(int, Range(min=2.50, max=1550)),
}, extra=ALLOW_EXTRA)

In [None]:
error_count = validate_sales_dataframe(sales,'sale_amount')
print(f'Total Errors in DataFrame: {error_count}')

In [None]:
sales.shape

In [None]:
schema = Schema({
    Required('sale_amount'): All(int, Range(min=-1550, max=1550)),
}, extra=ALLOW_EXTRA)

In [None]:
error_count = validate_sales_dataframe(sales,'sale_amount')
print(f'Total Errors in DataFrame: {error_count}')

### Now we need to ask ourselves: What is the reason for these errors?
- Do we have an improperly defined schema?
- Do we expect to have negative values in our data?
- Why do we see higher **sale_amount** values? Fraud? New products?
- What should we do with our schema and our failing data points?

## Adding a custom Validation Case

### This can be used to create some kind of common utility valdiators which are relevant to a specific business logic.

In this case we define a simple date format validator:

In [None]:
def ValidDate(fmt='%Y-%m-%dT%H:%M:%S'):
    return lambda v: datetime.strptime(v, fmt)

And the appropriate Schema:

In [None]:
schema = Schema({
    Required('timestamp'): All(ValidDate()),
}, extra=ALLOW_EXTRA)

Let's rerun our validation function. This time, let's validate the timestamp format:

In [None]:
error_count = validate_sales_dataframe(sales,'timestamp')
print(f'Total Errors in DataFrame: {error_count}')

## So we have valid date structures, what about actual valid dates?

In [None]:
def ValidDate(fmt='%Y-%m-%dT%H:%M:%S'):
    def validation_func(v):
        try:
            assert datetime.strptime(v, fmt) <= datetime.now()
        except AssertionError:
            raise Invalid(f'date is in the future! {v}')
    return validation_func

In [None]:
schema = Schema({
    Required('timestamp'): All(ValidDate()),
}, extra=ALLOW_EXTRA)

In [None]:
error_count = validate_sales_dataframe(sales,'timestamp')
print(f'Total Errors in DataFrame: {error_count}')

## What could be the reasons for future dates?

- presails
- incomplete data with invalid details

Now we need to decide what to do with these errors