# The Metadata and Constraints

**Metadata** objects are the pilar to leverage [ydata-sdk's](https://pypi.org/project/ydata-sdk/) package features. The **Metadata** is the object that can be shared between the different features and elements of ydata-sdk: *profiling*, *synthesizer* and *report*.    
    
- The object helps you extracting the main data from your dataset:
    - The **columns metadata**: Both Variable and Data type (numerical, categorical, etc.)
    - The **data warnings**: Checks for the presence of duplicates, variables with skewness, etc.
    - The **data constraints**: They refer to business rules validation of the data. Constraints are flexible and easy to use. 
    
In this Notebook, we will be showing you features and capabilities of this objects and how to combine them with other pieces from YData's package offer.

The dataset used to explore the Metadata and Constraints can be found win ["Kaggle - Home Loans"](https://www.kaggle.com/code/sazid28/home-loan-prediction/data).

## Authenticate with your YData account

In [1]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Create your Dataset & Metadata

In [2]:
import pandas as pd
from ydata.dataset import Dataset

df = pd.read_csv('insert-file-path.csv')
dataset = Dataset(df)

In [3]:
dataset.head(100)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
95,LP001499,Female,Yes,3+,Graduate,No,6260,0,110.0,360.0,1.0,Semiurban
96,LP001500,Male,Yes,1,Graduate,No,3333,4200,256.0,360.0,1.0,Urban
97,LP001501,Male,Yes,0,Graduate,No,3500,3250,140.0,360.0,1.0,Semiurban
98,LP001517,Male,Yes,3+,Graduate,No,9719,0,61.0,360.0,1.0,Urban


In [4]:
from ydata.metadata import Metadata
#Extract the Metadata from the Dataset
metadata = Metadata(dataset)
print(metadata)

[########################################] | 100% Completed | 107.96 ms
[########################################] | 100% Completed | 105.17 ms
[########################################] | 100% Completed | 104.99 ms
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m12
[1mNumber of rows: [0m367
[1mDuplicate rows: [0m1
[1mTarget column: [0m

[1mColumn detail: [0m
               Column    Data type Variable type Characteristics
0             Loan_ID       string        string                
1              Gender  categorical        string                
2             Married  categorical        string                
3          Dependents  categorical        string                
4           Education  categorical        string                
5       Self_Employed  categorical        string                
6     ApplicantIncome    numerical           int                
7   CoapplicantIncome    numerical          

In [5]:
#print metadata summary
print('\n\033[1mMetadata summary\033[0m')
print(metadata.summary)


[1mMetadata summary[0m
{'nrows': 367, 'cardinality': {'Loan_ID': 367, 'Gender': 2, 'Married': 2, 'Dependents': 4, 'Education': 2, 'Self_Employed': 2, 'ApplicantIncome': 314, 'CoapplicantIncome': 194, 'LoanAmount': 144, 'Loan_Amount_Term': 12, 'Credit_History': 2, 'Property_Area': 3}, 'duplicates': 1, 'missings': {'Loan_ID': np.int64(0), 'Gender': np.int64(11), 'Married': np.int64(0), 'Dependents': np.int64(10), 'Education': np.int64(0), 'Self_Employed': np.int64(23), 'ApplicantIncome': np.int64(0), 'CoapplicantIncome': np.int64(0), 'LoanAmount': np.int64(5), 'Loan_Amount_Term': np.int64(6), 'Credit_History': np.int64(29), 'Property_Area': np.int64(0)}, 'skewness': {'ApplicantIncome': np.float64(8.40683417612701), 'CoapplicantIncome': np.float64(4.239936499795931), 'LoanAmount': np.float64(2.214288133774681), 'Loan_Amount_Term': np.float64(-2.6681719962961696), 'Credit_History': np.float64(-1.7147253696648368)}, 'infinity': {'ApplicantIncome': np.int64(0), 'CoapplicantIncome': np.int

In [7]:
## Setting the target variable
metadata.target='Dependents'

In [8]:
print(metadata.target.name)

Dependents


### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

The code snippet below shows how to change the datatypes.

In [9]:
#Getting the all metadata summary
print('\n\033[1mChanging one column data type\033[0m')
metadata.update_datatypes({'Dependents': 'categorical'})
print(f"'Dependents': {metadata.columns['Dependents'].datatype.name}")


[1mChanging one column data type[0m
'Dependents': CATEGORICAL


### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [10]:
filtered_metadata = metadata[['encounter_id', 'Married', 'acarbose', 'Dependents', 'Property_Area']]

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)


[1mNew avaialble metadata[0m
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m3
[1mNumber of rows: [0m367
[1mDuplicate rows: [0m1
[1mTarget column: [0m

[1mColumn detail: [0m
          Column    Data type Variable type Characteristics
0        Married  categorical        string                
1     Dependents  categorical        string                
2  Property_Area  categorical        string                

0   missings  [Dependents]
1  imbalance  [Dependents]



## Metadata & Constrains 
The constrains engine allows the user to define a certain level of expectations and validations for a certain dataset. This engine is helpfull to identify potential inconsistencies and descrepencies between records and business rules or logic. 
They can also be leverage to keep expectations and validations while building an data engineering/preprocessing flow.

Constraints can be built as complex as needed. Following this tutorial provides a few **Constraints** examples - from default (`Positive`, `GreaterThan`) to customer.

In [11]:
from ydata.constraints.engine import ConstraintEngine
from ydata.constraints.rows import GreaterThan, Positive, CustomConstraint

In [12]:
c1 = GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)
c2 = Positive(columns=['CoapplicantIncome'])

ce = ConstraintEngine()
ce.add_constraints([c1, c2])
ce.validate(dataset)

In [13]:
ce.summary()

{'rows_violation_count': np.int64(158),
 'rows_violation_ratio': 0.4305177111716621,
 'violation_per_constraint': {"GreaterThan(columns=['CoapplicantIncome', 'LoanAmount'], value=0)": {'rows_violation_count': np.int64(158),
   'rows_violation_ratio': 0.4305177111716621,
   'validation_time': 0.02},
  "Positive(columns=['CoapplicantIncome'])": {'rows_violation_count': np.int64(156),
   'rows_violation_ratio': 0.4250681198910082,
   'validation_time': 0.02}}}