# Great Expectations

Great Expectations ayuda a los equipos de datos a eliminar el pipeline debt, 
mediante data testing, documentation, y profiling.


<p style="text-align: center;">
</p>

## Validacion de datos con Great Expectations

### Que nos permite hacer Great Expectations?
Con Great Expectations podemos 

* Crear expectations humanreadable, dejando que la libreria se encargue de la implementacion. 

## Conceptos Clave

### Expectations
Expectations son assertions para datos, son declarativas, flexibles y extensibles.
### Suite
Conjunto de Expectation.
### Data validation
Great Expectations permite verificar datos en forma de batch, validando con nuestra suite of Expectations.
### Data Docs
Great Expectations genera Expectations documentacion en forma clara y sencilla

In [1]:
from datetime import datetime, timezone
import pandas as pd
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.exceptions import DataContextError
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context.types.base import DatasourceConfig
from great_expectations.data_context.types.base import FilesystemStoreBackendDefaults
from great_expectations.data_context import BaseDataContext
from pathlib import Path
from os.path import abspath

2021-10-22T17:27:51-0300 - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.


  pd.set_option("display.max_colwidth", -1)


##  Configurando Great Expectations

Hay distintas posibilidades para configurar el Data Contexts
* Usando CLI (data profiling)
* Usando un archivo YAML 
* En memoria 

The `DataContextConfig` setea los parametros para construir el DataContext. 
Donde
* `DatasourceConfig` define que tipo de motor vamos a usar para trabajar con los datos (pandas, spark, database connectos)

In [2]:
project_path = Path(abspath('')).parent.absolute().as_posix()
project_path

'/home/naranja/Develop/diplo/nerdearla'

In [3]:
datasource = "my_pandas_datasource"
context = BaseDataContext(
                project_config=DataContextConfig(
                    config_version=2,
                    plugins_directory=f"{project_path}/plugins",
                    datasources={
                        datasource: DatasourceConfig(
                            class_name="PandasDatasource",
                            data_asset_type = {
                                "module_name": "custom_expectation",
                                "class_name": "MyCustomPandasDataset"
                            },
                        )
                    },
                    validation_operators={
                        "action_list_operator": {
                            "class_name": "ActionListValidationOperator",
                            "action_list": [
                                {
                                    "name": "store_validation_result",
                                    "action": {"class_name": "StoreValidationResultAction"},
                                },
                                {
                                    "name": "update_data_docs",
                                    "action": {"class_name": "UpdateDataDocsAction"},
                                },
                            ],
                        }
                    },
                    store_backend_defaults=FilesystemStoreBackendDefaults(
                        root_directory=project_path
                    )
                )
            )

## Creando una Suite

Las suites estan identificadas por nombre.

De no existir se creara una con el nombre indicado.

In [4]:
expectation_suite_name = 'nerdearla'

try:
    suite = context.get_expectation_suite(expectation_suite_name)
except DataContextError:
    suite = context.create_expectation_suite(expectation_suite_name)

## Cargando datos
Great Expectations puede crear expectations sin datos, pero es mejor usar un ejemplo.

En `batch_kwargs`:
 * `datasource`: El tipo de datasource definido en el contexto
 * `dataset`: El dataset en memoria
 * `expectation_suite_names`: Metadata

In [5]:
df = pd.read_csv(
    f"{project_path}/data/nx_nerdearla_clean.csv",
    dtype={ "date":str,
           "fecha_nacimiento":str,
           "dni":str}
    )

batch_kwargs = {
    "datasource": "my_pandas_datasource",
    "dataset": df,
    "expectation_suite_names": expectation_suite_name
}

batch = context.get_batch(batch_kwargs, expectation_suite_name)

batch.head(5)

Unnamed: 0,dni,date,sexo,estado_civil,fecha_nacimiento,asset_level,education_level
0,37511093,20210625,F,Casado,1955-01-09 00:00:00,"{""name"":""Sin especificar"",""id"":""0""}","{""name"":""Terciario"",""id"":""3""}"
1,94977718,20210625,M,Casado,1951-04-23 00:00:00,"{""name"":""Sin especificar"",""id"":""0""}","{""name"":""Primario"",""id"":""1""}"
2,8627709,20210625,F,Soltero,1973-04-05 00:00:00,"{""name"":""Sin especificar"",""id"":""0""}","{""name"":""Primario"",""id"":""1""}"
3,37007709,20210625,M,Soltero,1988-09-23 00:00:00,"{""name"":""Sin especificar"",""id"":""0""}","{""name"":""Primario"",""id"":""1""}"
4,28704754,20210625,F,Soltero,1983-03-17 00:00:00,"{""name"":""Sin especificar"",""id"":""0""}","{""name"":""Terciario"",""id"":""3""}"


## Crando Expectations (IUPI ðŸ¥³)
Una vez que tenemos datos cargados en memoria, podemos crear Expectations.
Hay diferentes tipos de Expectation
* Table shape level
* Missing values, unique values, and types
* Sets and ranges
* String matching
* Datetime and JSON parsing
* Others (Aggregate functions, Multi-column, Distributional functions, FileDataAsset)

[Glossary of Expectations](https://docs.greatexpectations.io/docs/reference/glossary_of_expectations)

In [6]:
batch.expect_table_columns_to_match_ordered_list(
    column_list=[
        'dni',
        'date',
        'sexo',
        'estado_civil',
        'fecha_nacimiento',
        'asset_level',
        'education_level'
    ])

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": [
      "dni",
      "date",
      "sexo",
      "estado_civil",
      "fecha_nacimiento",
      "asset_level",
      "education_level"
    ]
  },
  "meta": {},
  "success": true
}

In [7]:
batch.expect_column_values_to_be_in_type_list (
    column='estado_civil',
    type_list= [
            "CHAR",
            "NVARCHAR",
            "STRING",
            "StringType",
            "TEXT",
            "VARCHAR",
            "str",
            "string"
        ]
    )

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [8]:
batch.expect_column_values_to_be_unique(column='dni')

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 1,
    "missing_percent": 1.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [9]:
batch.expect_column_values_to_match_regex (
    column='dni',
    regex = r'^[\d]{1,3}\.?[\d]{3,3}\.?[\d]{3,3}$'
    )

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 1,
    "missing_percent": 1.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [10]:
batch.expect_column_values_to_be_dateutil_parseable(column='date')

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [11]:
batch.expect_column_values_to_match_strftime_format (
    column='fecha_nacimiento',
    strftime_format = '%Y-%m-%d %H:%M:%S'
    )

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [12]:
batch.expect_column_values_to_be_in_set(
    column ='sexo',
    value_set = ['M','F'],
    mostly = 0.9
    )

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [13]:
batch.expect_column_values_to_be_in_set(
    column = 'estado_civil',
    value_set = ['Casado','Soltero','Divorciado','Viudo','Separado de Hecho','Novio'],
    row_condition = 'sexo == "F"',
    condition_parser = 'pandas'
)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 55,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [14]:
batch.expect_column_values_to_be_json_parseable('education_level')

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [15]:
batch.expect_column_values_to_match_json_schema (
    column = 'asset_level',
    json_schema = {
        "$schema" : "http://json-schema.org/draft-04/schema#",
        "type" :"object",
        "properties": {
            "name": {
            "type": "string"
            },
            "id": {
            "type": "string"
            }
        },
        "required": [
            "name",
            "id"
        ]
    },
    mostly = 0.5
)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 39,
    "unexpected_percent": 39.0,
    "unexpected_percent_total": 39.0,
    "unexpected_percent_nonmissing": 39.0,
    "partial_unexpected_list": [
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}",
      "{\"code\":\"0\"}"
    ]
  },
  "meta": {},
  "success": true
}

In [16]:
batch.expect_column_values_to_calculate_age (
    column = 'fecha_nacimiento',
    min_age = 18,
    date_format = '%Y-%m-%d %H:%M:%S'
)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 100,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "success": true
}

In [17]:
suite = batch.get_expectation_suite()
suite

2021-10-22T17:27:52-0300 - INFO - 	11 expectation(s) included in expectation_suite. result_format settings filtered.


{
  "expectations": [
    {
      "ge_cloud_id": null,
      "meta": {},
      "expectation_type": "expect_table_columns_to_match_ordered_list",
      "kwargs": {
        "column_list": [
          "dni",
          "date",
          "sexo",
          "estado_civil",
          "fecha_nacimiento",
          "asset_level",
          "education_level"
        ]
      }
    },
    {
      "ge_cloud_id": null,
      "meta": {},
      "expectation_type": "expect_column_values_to_be_in_type_list",
      "kwargs": {
        "column": "estado_civil",
        "type_list": [
          "CHAR",
          "NVARCHAR",
          "STRING",
          "StringType",
          "TEXT",
          "VARCHAR",
          "str",
          "string"
        ]
      }
    },
    {
      "ge_cloud_id": null,
      "meta": {},
      "expectation_type": "expect_column_values_to_be_unique",
      "kwargs": {
        "column": "dni"
      }
    },
    {
      "ge_cloud_id": null,
      "meta": {},
      "expectation_type"

### Guaramos la Suite

In [18]:
context.save_expectation_suite(suite, expectation_suite_name)

'/home/naranja/Develop/diplo/nerdearla/expectations/nerdearla.json'