# Lab 10 - Data-Quality-Driven Machine Learning Systems

During this lab we will experiment with data quality driven machine learning systems. The main focus
will be on data quality metrics and how they can be used to improve/fix the performance of machine
learning models in the presence of data quality issues.

## 1. Machine Learning System

Your task is to create a classifier ensemble that can predict the target variable based on the
features provided in a dataset. For this lab, we will use the Human Activity Recognition Using
Smartphones dataset -
https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones. It is a
classification problem where we predict the activity performed by users wearing a smarphone (Samsung
Galaxy S II) on the waist. The possible activities are LAYING, STANDING, SITTING, WALKING,
WALKING_UPSTAIRS, WALKING_DOWNSTAIRS. The embedded accelerometer and gyroscope were used to capture
sensor signals. Overall, the dataset contains 561 input features, either raw or aggregated. It also
includes the target variable that we want to predict and a subject identifier indicating which
participant carried out the experiment. The dataset is also available in CSV format on Kaggle -
https://www.kaggle.com/datasets/uciml/human-activity-recognition-with-smartphones/data.


Use the training subset to prepare an ensemble of models that operate on diverse subsets of
features. You can use any machine learning model (e.g. a decision tree) of your choice and a
majority voting scheme to aggregate the predictions. The choice of feature subsets should ensure
robustness of the ensemble. For example, you might create models that utilize:

- only raw features
- only raw features and derivatives
- only aggregated features
- only aggregated features of particular type (mean, standard deviation, etc.)
- only raw features from accelerometer
- only raw and aggregated features from accelerometer
- only raw features from gyroscope
- only raw and aggregated features from gyroscope
- feature selection methods to select a subset of features
- let machine learning algorithm to select features automatically
- etc.

You should be able to define dozens of different models easily. However, it is important to track
which features are utilized by each component of the ensemble. Use a data validation
framework/library (e.g., the one you familiarized yourself with during previous labs) to define data
quality rules. You can start with basic rules, such as checking for missing values or values out of
bounds, and configure the library to obtain machine-readable error reports. 

During inference, if any issues are detected with a given feature, you should disable/mute the
models that utilize this feature. For example, if a problem is detected with any of the raw
accelerometer features, aggregate predictions only from models that do not use this feature.

You should also anticipate cases where all features used by your ensemble have quality issues. In
these scenarios, prepare a fallback procedure - for example, return the most common activity.

Perform experiments to evaluate the performance of your ensemble under certain data quality issues.
You are required to simulate sensor failures, such as missing values or introduce out-of-bounds
values. Compare the ensemble's performance against a baseline model trained on the same dataset and
evaluated under the same data quality issues, but without any data quality handling.

In [15]:
# write your code here


## 2*. Anomaly Detection

Anomaly detection is the task of identifying rare events or outliers in a dataset. However, it
differs from data quality validation in that its focus is not on ensuring data correctness, but
rather on identifying unexpected or rare patterns. Although distinct, both anomaly detection and
data quality issues can influence the performance of machine learning models, but for different
reasons. Data quality issues usually concern the validity of data (e.g., missing or incorrect
values), while anomaly detection focuses on unusual patterns and outliers.

Sometimes, distinguishing between the two can be challenging. For example, consider a dust or
fine-particle sensor. If a sensor in the kitchen registers a maximum value while food is being
prepared on a frying pan, this could be considered an anomaly, yet it might also be a theoretically
correct reading (if the sensor has an upper limit on its readings range). On the other hand, if the
sensor stops providing readings due to a battery outage, this can be considered a data quality issue
rather than an anomaly in the dust sensor data stream (but depending on the domain context, it can
be interprted as an anomaly as well).

*There is also another aspect to consider - the presence of data drift. Data drift refers to the
process by which a data distribution changes over time. It can be caused by various factors, such as
changes in the environment, modifications of the data collection process, or shifts in the
underlying data itself. Data drift can have a significant impact on the performance of machine
learning models, as it may cause them to become less accurate over time if they were trained on a
different data distribution. For now, however, we will treat data drift as a separate topic that is
out of the scope of this lab.*

Tere are several open-source libraries that can be used for anomaly detection. Sometime these are
libraries specialized in anomaly detection for specific domains or data types. Just to name a few:

- Alibi Detect - https://docs.seldon.ai/alibi-detect
- PyOD - https://pyod.readthedocs.io/en/latest/
- Anomalib - https://anomalib.readthedocs.io/en/latest/

Familiarize yourself with the above libraries. Choose one of them and experiment with the dataset
used in the previous section. Simulate/introduce some anomalies into the dataset and try to detect
them.

In [None]:
# write your code here
