# Expectations & Business rules validation

The ConstraintEngine allows to validate and ensure that datasets and databases comply with business rules. This engine is crucial for maintaining data quality, as it enforces specific conditions that data must meet to be considered accurate and reliable. 

Constraint rules, such as data type checks or range limits, ensure data adheres to defined structural standards, while business rules validate that data aligns with real-world expectations and organizational requirements. 

In this example, we quickly showcase how the ConstraintEngine is able to detect potential anomalies, ensuring consistency, and reducing errors across datasets, ultimately supporting data integrity and decision-making accuracy.

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='c34236b4-dcdd-4955-8bc0-036407c02469', 
                             namespace='6d3db1e0-4b88-4046-a28c-7dfafeabe035')
database = datasource.dataset

OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
(Background on this error at: https://sqlalche.me/e/14/e3q8)

In [2]:
print(database)

[1mMultiDataset Summary 
 
[0m[1mNumber of tables: [0m19 
 
               Table name  Num cols                         Num rows             Primary keys Foreign keys Notes
0              DimAccount        10  Number of rows not yet computed             [AccountKey]                   
1             DimCurrency         3  Number of rows not yet computed            [CurrencyKey]                   
2             DimCustomer        25  Number of rows not yet computed            [CustomerKey]                   
3      DimDepartmentGroup         3  Number of rows not yet computed     [DepartmentGroupKey]                   
4             DimEmployee        30  Number of rows not yet computed            [EmployeeKey]                   
5            DimGeography        10  Number of rows not yet computed           [GeographyKey]                   
6         DimOrganization         5  Number of rows not yet computed        [OrganizationKey]                   
7              DimProduct       

In [3]:
employees_table = database['DimEmployee']

In [4]:
employees_table.head()

Unnamed: 0,EmployeeKey,ParentEmployeeKey,EmployeeNationalIDAlternateKey,ParentEmployeeNationalIDAltKey,SalesTerritoryKey,FirstName,LastName,MiddleName,NameStyle,Title,...,PayFrequency,BaseRate,VacationHours,SickLeaveHours,CurrentFlag,SalesPersonFlag,DepartmentName,StartDate,EndDate,Status
0,1,18.0,14417807,446466105,11.0,Guy,Gilbert,R,0,Production Technician - WC60,...,1.0,12.45,21.0,30.0,1,0,Production,19960731.0,,Current
1,2,7.0,253022876,24756624,11.0,Kevin,Brown,F,0,Marketing Assistant,...,2.0,13.4615,42.0,41.0,1,0,Marketing,19970226.0,,Current
2,3,14.0,509647174,245797967,11.0,Roberto,Tamburello,,0,Engineering Manager,...,2.0,43.2692,2.0,21.0,1,0,Engineering,19971212.0,,Current
3,4,3.0,112457891,509647174,11.0,Rob,Walters,,0,Senior Tool Designer,...,2.0,29.8462,48.0,80.0,1,0,Tool Design,19980105.0,20000630.0,
4,5,3.0,112457891,509647174,11.0,Rob,Walters,,0,Senior Tool Designer,...,2.0,29.8462,48.0,80.0,1,0,Tool Design,20000630.0,,Current


## Constraint validation & analysis

Create the constraint to validate against the *EmergencyContactPhone*, *EmailAddress* and *SickLeaveHours*. Fabrics' ConstraintEngine includes pre-defined rules to help accelerating the creation of relevant validations, but also offers *CustomConstraint* so it provides flexibility and the ability to comply with even with the most complex validation

In [5]:
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.rows import GreaterThan, Positive, CustomConstraint, Regex

c1 = Regex(column='EmergencyContactPhone', regex=r'^\d{3}-\d{3}-\d{4}')
c2 = Regex(column='EmailAddress', regex=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
c3 = Positive(columns='SickLeaveHours')

ce = ConstraintEngine()
ce.add_constraints([c1, c2, c3])

ce.validate(employees_table)

In [6]:
outcomes = ce.summary()

In [7]:
from utils.constrains_report import generate_report

report = generate_report(outcomes)

In [8]:
# Write the HTML content to a file
with open('data_integrity_report.html', 'w') as file:
    file.write(report)

## Pipeline outputs

In [None]:
import json

profile_pipeline_output = {
    'outputs' : [
    {
      'type': 'web-app',
      'storage': 'inline',
      'source': report,
    },
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)