# Data Drift Monitoring with YData Fabric

YData Fabric’s dataset comparing and metrics tracking, enable users to build a continuous data drift checks system. Monitoring potential data drifts it ensures data quality by continuously tracking shifts in data distributions. 

By integrating drift detection into recurrent pipelines, **Fabric** automatically flags significant changes, enabling teams to address data issues before they impact model performance. 

Additionally, Fabric’s visual compare profiling provides clear insights into the nature and scope of drifts, helping users validate and interpret changes effectively. This combination of automated monitoring and visual profiling within YData Fabric safeguards data quality, ensuring reliable model outputs and analytics over time.

In this use case, it is showcased how easy it can be to build a recurrent data pipeline to validate potential data shifts from tables in a MySQL database. The dataset used in this example is a [DataWarehouse of the AdventureWorks DB](https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&tabs=ssms).

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{insert-datasource-uid}', 
                             namespace='{insert-namespace-uid}')
ogdataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
ogmetadata = datasource.metadata

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


In [4]:
"""This line is responsible to read only a partial set of the Adventure Works database"""
# Importing YData's packages
from ydata.labs import Connectors
from ydata.metadata import Metadata
# Getting a previously created Connector
connector = Connectors.get(uid='{insert-datasource-uid}', 
                           namespace='{insert-namespace-uid}')

latest= connector.query(
                    """select *
                    from FactInternetSales
                    where FactInternetSales.OrderDate_ >= '2013-11-01';"""
)

latest_metadata = Metadata(latest)


## Compare profile both datasets

The profiling compare feature is invaluable for detecting potential distribution drifts by providing a side-by-side analysis of datasets. By comparing current data distributions with historical baselines, it is possivble to quickly identify shifts in key variables, revealing changes that might impact model performance or analytics reliability. This capability allows data teams to monitor for drift continuously, ensuring that evolving data characteristics are detected early, and data integrity is maintained over time.

In [5]:
from ydata.profiling import ProfileReport

report = ProfileReport(ogdataset, title='All data prior November 2013') 
report2 = ProfileReport(latest, title='All data starting from November 2013') 

#compare the profiling
compare = report.compare(report2)
compare

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Calculate the Metadata difference

In [None]:
from metadata.metadata_compare import calculate_diff, get_stats_diff

summary_diff = calculate_diff(ogmetadata, latest_metadata, ogmetadata.categorical_vars)
stats_diff = get_stats_diff(ogdataset, latest)

In [None]:
import pickle

with open('summary_diff.pkl', "wb") as file:
    pickle.dump(summary_diff, file)
    
with open('stats_diff.pkl', "wb") as file:
    pickle.dump(stats_diff, file) 

## Define the Pipeline outputs

In [None]:
import json

profile_pipeline_output = {
    'outputs' : [
    {
      'type': 'web-app',
      'storage': 'inline',
      'source': compare.to_html(),
    },
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)