# The Metadata and Constraints

**Metadata** objects are the pilar to leverage YData's package features. The **Metadata** is the object that can be shared between the different features and elements of YData plataform: *profiling*, *synthesizer* and *report*.    
    
- The object helps you extracting the main data from your dataset:
    - The **columns metadata**: Both Variable and Data type (numerical, categorical, etc.)
    - The **data warnings**: Checks for the presence of duplicates, variables with skewness, etc.
    - The **data constraints**: They refer to business rules validation of the data. Constraints are flexible and easy to use. 
    
In this Notebook, we will be showing you features and capabilities of this objects and how to combine them with other pieces from YData's package offer.

The dataset used to explore the Metadata and Constraints can be found win ["Kaggle - Home Loans"](https://www.kaggle.com/code/sazid28/home-loan-prediction/data).

In [1]:
from ydata.labs import DataSources
from ydata.metadata import Metadata

In [2]:
datasource = DataSources.get(uid='{uid}', namespace='{namespace}')
data = datasource.dataset

In [3]:
data.head(100)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,LP001326,Male,No,0,Graduate,,6782,0.0,,360.0,,Urban,N
96,LP001327,Female,Yes,0,Graduate,No,2484,2302.0,137.0,360.0,1.0,Semiurban,Y
97,LP001333,Male,Yes,0,Graduate,No,1977,997.0,50.0,360.0,1.0,Semiurban,Y
98,LP001334,Male,Yes,0,Not Graduate,No,4188,0.0,115.0,180.0,1.0,Semiurban,Y


In [4]:
#Extract the Metadata from the Dataset
metadata = Metadata(data)
print(metadata)

[########################################] | 100% Completed | 101.87 ms
[########################################] | 100% Completed | 101.96 ms
[########################################] | 100% Completed | 102.49 ms
[########################################] | 100% Completed | 102.35 ms
[########################################] | 100% Completed | 594.00 ms
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m13
[1mDuplicate rows: [0m1
[1mTarget column: [0m

[1mColumn detail: [0m
               Column    Data type Variable type Characteristics
0             Loan_ID       string        string              id
1              Gender  categorical        string                
2             Married  categorical        string                
3          Dependents  categorical        string                
4           Education  categorical        string                
5       Self_Employed  categorical        string          

In [5]:
#print metadata summary
print('\n\033[1mMetadata summary\033[0m')
print(metadata.summary)


[1mMetadata summary[0m
{'nrows': 614, 'cardinality': {'Loan_ID': 614, 'Gender': 2, 'Married': 2, 'Dependents': 4, 'Education': 2, 'Self_Employed': 2, 'ApplicantIncome': 505, 'CoapplicantIncome': 287, 'LoanAmount': 203, 'Loan_Amount_Term': 10, 'Credit_History': 2, 'Property_Area': 3, 'Loan_Status': 2}, 'iscategorical': {'Loan_ID': 614.0, 'Gender': 2.0, 'Married': 2.0, 'Dependents': 4.0, 'Education': 2.0, 'Self_Employed': 2.0, 'ApplicantIncome': 505.0, 'CoapplicantIncome': 287.0, 'LoanAmount': 203.0, 'Loan_Amount_Term': 10.0, 'Credit_History': 2.0, 'Property_Area': 3.0, 'Loan_Status': 2.0}, 'missings': Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64, 'duplicates': 1, 'skewness': {'ApplicantIncome': 6.523526250899

In [6]:
## Setting the target variable
metadata.target='Loan_Status'

In [7]:
print(metadata.target.name)

Loan_Status


### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

The code snippet below shows how to change the datatypes.

In [8]:
#Getting the all metadata summary
print('\n\033[1mChanging one column data type\033[0m')
metadata.update_datatypes({'Dependents': 'categorical'})
print(f"'Dependents': {metadata.columns['Dependents'].datatype.name}")


[1mChanging one column data type[0m
'Dependents': CATEGORICAL


### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [9]:
filtered_metadata = metadata[['encounter_id', 'Married', 'acarbose', 'Dependents', 'Property_Area']]

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)


[1mNew avaialble metadata[0m
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m3
[1mDuplicate rows: [0m1
[1mTarget column: [0m

[1mColumn detail: [0m
          Column    Data type Variable type Characteristics
0        Married  categorical        string                
1     Dependents  categorical        string                
2  Property_Area  categorical        string                

0   missings  [Married, Dependents]
1  imbalance           [Dependents]



## Metadata & Constrains 
The constrains engine allows the user to define a certain level of expectations and validations for a certain dataset. This engine is helpfull to identify potential inconsistencies and descrepencies between records and business rules or logic. 
They can also be leverage to keep expectations and validations while building an data engineering/preprocessing flow.

Constraints can be built as complex as needed. Following this tutorial provides a few **Constraints** examples - from default (`Positive`, `GreaterThan`) to customer.

In [10]:
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.rows import GreaterThan, Positive, CustomConstraint

In [11]:
c1 = GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)
c2 = Positive(columns=['CoapplicantIncome'])

ce = ConstraintEngine()
ce.add_constraints([c1, c2])
ce.validate(data)

sample size: 10538
sample size: 10538


In [12]:
ce.summary()

{'rows_violation_count': 283,
 'rows_violation_ratio': 0.4609120521172638,
 'violation_per_constraint': {"GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)": {'rows_violation_count': 283,
   'rows_violation_ratio': 0.4609120521172638,
   'validation_time': 0.03},
  "Positive(columns=['CoapplicantIncome'])": {'rows_violation_count': 273,
   'rows_violation_ratio': 0.44462540716612375,
   'validation_time': 0.02}}}

### Constrains integration with Metadata
Constraints Engine can be easily integrated with your Metadata object and downstream application such as YData synthesizers. The below example show case how to add a constraint engine and leverage it to profile the data.

In [13]:
const_metadata = Metadata(data, constraints=[c1, c2])

[########################################] | 100% Completed | 101.47 ms
[########################################] | 100% Completed | 102.54 ms
[########################################] | 100% Completed | 104.66 ms
[########################################] | 100% Completed | 102.43 ms
[########################################] | 100% Completed | 535.00 ms
sample size: 10538
[########################################] | 100% Completed | 102.20 ms
sample size: 10538
[########################################] | 100% Completed | 102.26 ms


In [14]:
const_metadata.summary['constraints']

{'rows_violation_count': 283,
 'rows_violation_ratio': 0.4609120521172638,
 'violation_per_constraint': {"GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)": {'rows_violation_count': 283,
   'rows_violation_ratio': 0.4609120521172638,
   'validation_time': 0.14},
  "Positive(columns=['CoapplicantIncome'])": {'rows_violation_count': 273,
   'rows_violation_ratio': 0.44462540716612375,
   'validation_time': 0.14}}}