# The Metadata and Constraints

**Metadata** objects are the pilar to leverage YData's package features. The **Metadata** is the object that can be shared between the different features and elements of YData plataform: *profiling*, *synthesizer* and *report*.    
    
- The object helps you extracting the main data from your dataset:
    - The **columns metadata**: Both Variable and Data type (numerical, categorical, etc.)
    - The **data warnings**: Checks for the presence of duplicates, variables with skewness, etc.
    - The **data constraints**: They refer to business rules validation of the data. Constraints are flexible and easy to use. 
    
In this Notebook, we will be showing you features and capabilities of this objects and how to combine them with other pieces from YData's package offer.

The dataset used to explore the Metadata and Constraints can be found win ["Kaggle - Home Loans"](https://www.kaggle.com/code/sazid28/home-loan-prediction/data).

In [5]:
import pandas as pd

from ydata.connectors import LocalConnector
from ydata.connectors.filetype import FileType
from ydata.metadata import Metadata

In [6]:
#Read a local dataset using YData LocalConnector
conn = LocalConnector()
data = conn.read_file('train.csv', file_type=FileType.CSV)

In [8]:
data.head(100)

Unnamed: 0_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,LP001002,Male,No,0,Graduate,No,5849.0,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583.0,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000.0,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583.0,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000.0,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,LP001326,Male,No,0,Graduate,,6782.0,0.0,,360.0,,Urban,N
96,LP001327,Female,Yes,0,Graduate,No,2484.0,2302.0,137.0,360.0,1.0,Semiurban,Y
97,LP001333,Male,Yes,0,Graduate,No,1977.0,997.0,50.0,360.0,1.0,Semiurban,Y
98,LP001334,Male,Yes,0,Not Graduate,No,4188.0,0.0,115.0,180.0,1.0,Semiurban,Y


In [10]:
#Extract the Metadata from the Dataset
metadata = Metadata(data)
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m13
[1m% of duplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
               Column    Data type Variable type
0             Loan_ID  categorical        string
1              Gender  categorical        string
2             Married  categorical        string
3          Dependents  categorical        string
4           Education  categorical        string
5       Self_Employed  categorical        string
6     ApplicantIncome    numerical         float
7   CoapplicantIncome    numerical         float
8          LoanAmount    numerical         float
9    Loan_Amount_Term  categorical         float
10     Credit_History  categorical         float
11      Property_Area  categorical        string
12        Loan_Status  categorical        string




In [12]:
#print metadata summary
print('\n\033[1mMetadata summary\033[0m')
print(metadata.summary)


[1mMetadata summary[0m
{'nrows': 614, 'unique_counts': {'Loan_ID': 615.8848935456124, 'Gender': 2.000030518198964, 'Married': 2.000030518198964, 'Dependents': 4.000122075278871, 'Education': 2.000030518198964, 'Self_Employed': 2.000030518198964, 'ApplicantIncome': 503.93251285042095, 'CoapplicantIncome': 286.62587550498984, 'LoanAmount': 203.31505047385306, 'Loan_Amount_Term': 10.000763017065795, 'Credit_History': 2.000030518198964, 'Property_Area': 3.000068666646041, 'Loan_Status': 2.000030518198964}, 'missing_vals': Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64, 'duplicate': 0.00021817470207423177, 'skewness': {'ApplicantIncome': 6.539513113994625, 'CoapplicantIncome': 7.491531216657306, 'LoanAmount': 2.677

In [13]:
## Setting the target variable
metadata.target='Loan_Status'

In [15]:
print(metadata.target.name)

Loan_Status


### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

In this particular example, the 'encounter_id' columns have been mistakenly identified as a numerical columns instead of an ID. The code snippet below shows how to change the datatypes:

In [20]:
#Getting the all metadata summary
print('\n\033[1mChanging one column data type\033[0m')
metadata.columns = {'Loan_ID': 'id'}

print(f"'Loan_ID': {metadata.columns['Loan_ID'].datatype.name}")

print('\n\033[1mChanging multiple columns data types\033[0m')
metadata.columns = {'Loan_ID': 'id',
                    'Dependents': 'categorical'}

print(f"'Loan_ID': {metadata.columns['Loan_ID'].datatype.name}")
print(f"'Dependents': {metadata.columns['Dependents'].datatype.name}")


[1mChanging one column data type[0m
'Loan_ID': ID

[1mChanging multiple columns data types[0m
'Loan_ID': ID
'Dependents': CATEGORICAL


In [None]:
#Filtering the Metadata
filtered_metadata = metadata['Loan_ID', 'age', 'acarbose', 'readmitted']

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)

### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [22]:
filtered_metadata = metadata[['encounter_id', 'Married', 'acarbose', 'Dependents', 'Property_Area']]

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)


[1mNew avaialble metadata[0m
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m3
[1m% of duplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
          Column    Data type Variable type
0        Married  categorical        string
1     Dependents  categorical        string
2  Property_Area  categorical        string

Empty DataFrame
Columns: []
Index: []



## Metadata & Constrains 
The constrains engine allows the user to define a certain level of expectations and validations for a certain dataset. This engine is helpfull to identify potential inconsistencies and descrepencies between records and business rules or logic. 
They can also be leverage to keep expectations and validations while building an data engineering/preprocessing flow.

Constraints can be built as complex as needed. Following this tutorial provides a few **Constraints** examples - from default (`Positive`, `GreaterThan`) to customer.

In [28]:
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.constraint import GreaterThan, Positive, CustomConstraint

In [30]:
c1 = GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)
c2 = Positive(columns=['CoapplicantIncome'])

ce = ConstraintEngine()
ce.add_constraints([c1,c2])
ce.validate(data)

In [32]:
ce.summary()

{'violation_count': 283,
 'violation_ratio': 0.4609120521172638,
 'violation_per_constraint': {"GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)": {'violation_count': 283,
   'violation_ratio': 0.4609120521172638,
   'validation_time': (171.41376328468323,)},
  "Positive(columns=['CoapplicantIncome'])": {'violation_count': 273,
   'violation_ratio': 0.44462540716612375,
   'validation_time': (0.07076382637023926,)}}}

### Constrains integration with Metadata
Constraints Engine can be easily integrated with your Metadata object and downstream application such as YData synthesizers. The below example show case how to add a constraint engine and leverage it to profile the data.

In [33]:
const_metadata = Metadata(data, constraints=[c1,c2])

In [35]:
const_metadata.summary['constraints']

{'violation_count': 283,
 'violation_ratio': 0.4609120521172638,
 'violation_per_constraint': {"GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)": {'violation_count': 283,
   'violation_ratio': 0.4609120521172638,
   'validation_time': (179.3317892551422,)},
  "Positive(columns=['CoapplicantIncome'])": {'violation_count': 273,
   'violation_ratio': 0.44462540716612375,
   'validation_time': (0.08627104759216309,)}}}