# Lab 09 - Data Quality Monitoring

During this lab we will explore techniques for monitoring data quality in machine learning systems.

By "data quality" we refer to the characteristics of data with respect to its completeness,
correctness and reliability. Poor data quality can lead to models of low performance and unreliable
predictions.

The examples of data quality issues include:
- missing values
- incorrect or inconsistent data entries
- presence of outliers
- impossible values
- contradictory data
- inconsistent formatting
- wrong data types
- duplicate records
- etc.

## 1. Data Validation

Data validation is a process of checking if the data conforms to predefined rules and constraints.
It is used to ensure that the data is accurate, complete, and reliable before it is used for
analysis or modeling.

Here, we focus mainly on automated data validation techniques that can be integrated into data
pipelines. This is somewhat different from the data validation that is performed at runtime in
applications to validate user inputs, web form data, etc.

There are multiple open source (some of them are a part of larger commercial systems) libraries
available for data validation in Python. Some of the popular ones include:

- Great Expectations - https://github.com/great-expectations/great_expectations
- Pandera - https://pandera.readthedocs.io
- Pointblank - https://posit-dev.github.io/pointblank/
- Soda Core - https://github.com/sodadata/soda-core

Your task is to implement the following scenario:

- Use the following dataset:
- Get familiar with the dataset - gain a basic understanding of the domain the dataset represents
- Introduce the some errors/disturbances into the dataset:
    - replace some values randomly with NaN
    - for a categorical attribute replace some of the items with new values that should be
      considered incorrect
    - change some values to be outside of the expected range
    - change some values of integer semantic to have fractional part
    - introduce some errors that are correct with respect to the domain of particular attributes but
      are invalid if we consider the relationship between attributes, e.g., start date is later than
      end date
- Use one of the data validation libraries to model the data validation rules and detect the
  errors/disturbances. Here, you are in a privileged position as you know what errors were
  introduced into the dataset. However, try to use some exploratory data features of the library to
  detect the errors/disturbances before you implement the rules.
- Investigate the results. Depending on the library, you should be able to generate a
  machine-readable report and/or a human-readable report. A human-readable report may be reviewed by
  a human responsible for data quality, while a machine-readable report may be used to implement
  automated pipelines that can detect errors/disturbances in the data and take appropriate actions.


## 2. Data Monitoring

Data monitoring involves continuously tracking the quality of data over time to detect any issues
that may arise. This is important because data quality can occur due to various reasons such as
changes in data sources, data collection processes, or data processing pipelines. By monitoring data
quality systems can identify issues and address them accordingly before they impact the operations
of the machine learning models.

Some of the features of data monitoring may overlap with data validation.

There are multiple tools available for data monitoring. Just to name a few:

- Evidently AI - https://docs.evidentlyai.com/introduction
- WhyLabs-oss + WhyLogs
    - https://github.com/whylabs/whylabs-oss
    - https://github.com/whylabs/whylabs-oss
    - https://docs.whylabs.ai/docs/
