# Lab 09 - Data Quality Monitoring

During this lab we will explore techniques for monitoring data quality in machine learning systems.

By "data quality" we refer to the characteristics of data with respect to its completeness,
correctness and reliability. Poor data quality can lead to models of low performance and unreliable
predictions.

The examples of data quality issues include:
- missing values
- incorrect or inconsistent data entries
- presence of outliers
- impossible values
- contradictory data
- inconsistent formatting
- wrong data types
- duplicate records
- etc.

## 1. Data Quality Validation

Data validation is a process of checking if the data conforms to predefined rules and constraints.
It is used to ensure that the data is accurate, complete, and reliable before it is used for
analysis or modeling.

Here, we focus mainly on automated data validation techniques that can be integrated into data
pipelines. This is somewhat different from the data validation that is performed at runtime in
applications to validate user inputs, web form data, etc.

There are multiple open source (some of them are a part of larger commercial systems) libraries
available for data validation/data quality analysis in Python. Some of the popular ones include:

- Great Expectations - https://github.com/great-expectations/great_expectations
- Pandera - https://pandera.readthedocs.io
- Pointblank - https://posit-dev.github.io/pointblank/
- Soda Core - https://github.com/sodadata/soda-
- Deequ - https://github.com/awslabs/deequ

Your task is to implement the following scenario:

- Use the [NYC Yellow Taxi Trip
  Data](https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data) from Kaggle, which is a
  subset of the original dataset provided by the [NYC Taxi & Limousine Commission
  (TLC)](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
- Familiarize yourself with the dataset - gain a basic understanding of the domain it represents to
  identify potential data quality issues.
- For the sake of the exercise, we will introduce some errors/disturbances into the dataset. This
  allows us to practice data quality validation techniques while being already aware of the errors.
  For example:
    - replace some values randomly with NaN
    - for a categorical attribute (e.g., `store_and_fwd_flag` or `payment_type`) replace some items
      with new values that should be considered incorrect
    - change some values to be outside of the expected range, e.g., `trip_distance` is negative,
      `latitude/longitude` are outside of New York City
    - change some integer-valued attributes to have fractional parts, e.g., `passenger_count`
    - introduce some errors that are correct with respect to the domain of particular attributes but
      invalid when considering relationships between attributes, e.g., start date is later than
      end date or `total_amount` is not equal to the sum of other amounts
    - for "cash" value of `payment_type` set `tip_amount` greater than 0 for some items (normaly,
      tip amount is reported as `0` for cash payments)
    - add 3-4 more errors/disturbances of your choice
- Use one of the libraries mentioned above (you need to choose one) to model the data validation
  rules and detect the introduced errors/disturbances. Some of the libraries also allow you to
  obtain the severity level of the detected errors based on defined thresholds. Here, you are in a
  privileged position as you already know which errors were introduced into the dataset. However,
  try to leverage the exploratory data features of the library to detect the errors/disturbances
  before implementing the rules. 
  
  Implement rules that cover the errors/disturbances that you have introduced into the dataset. You
  should also implement the rules that check basic properties of the dataset, for example:

    - ensure the dataset contains the appropriate number of columns
    - ensure the types of the attributes are correct
    - etc.

- Investigate the results. Depending on the library, you should be able to generate a
  machine-readable report and/or a human-readable report. A human-readable report can be reviewed by
  a person responsible for data quality, while a machine-readable report can be used to implement
  automated pipelines that can detect errors/disturbances in the data and take appropriate actions.


## 2. Data Monitoring

Data monitoring involves tracking data quality over time to detect any issues that may arise. This
is important because data quality issues can result from various factors, such as changes in data
sources, data collection processes, or data processing pipelines. By monitoring data quality,
systems can identify problems early and address them before they impact the operation of machine
learning models.

Some features of data monitoring may overlap with data validation. The boundary between the
approaches can sometimes be blurry. For example, you can implement data monitoring procedures using
data validation tools within data ingestion pipelines or maintain data quality within scheduled
workflows.

There are multiple tools available for data monitoring. Just to name a few:

- Evidently - https://docs.evidentlyai.com/quickstart_ml
- WhyLabs-oss + WhyLogs
    - https://github.com/whylabs/whylabs-oss
    - https://docs.whylabs.ai/docs/
- Stream DaQ - https://bilpapster.github.io/stream-DaQ

Your task is to implement one of the below scenarios:

- Prepare a simple REST service (with a database - e.g., SQLite, PostgreSQL, etc.) that allows users
  to upload batches of data. You can assume the batches will be relatively small (no more than ten
  thousand rows). You may use the same dataset as in the previous section. The service should
  support uploading data batches in file format, e.g., CSV, Parquet, etc. (it is up to you to choose
  the format). You don't need to implement all the validation rules from the previous section
  - select as many as possible without compromising the performance of the service. Return an
  appropriate HTTP status code and, in case of an error, provide error information. Additionally,
  provide an endpoint that reports the number of rows in the database.
- Implement data stream monitoring using Stream DaQ on the following dataset:
  https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption or any
  other dataset of similar size. Get familiar with the features of the monitoring tool. You may need
  to preprocess the dataset, e.g., to create a proper timestamp column. Be creative and implement
  various quality checks. For example:
  - Are there as many readings as expected within a given time window?
  - Are there gaps in the data?
  - Are the voltage readings within the expected bounds?
  - Do the current intensity readings exceed the maximum allowed value for the electric fuse (you
    can decide what this value should be)?
  - Is the sum of energy consumption from sub-metering devices less than the total energy
    consumption? Consider that the units differ: sub-metering devices use "watt-hour of active
    energy", whereas the global active power device reports "minute-averaged active power in
    kilowatts".
