# Lab 10 - Data Quality Driven Machine Learning Systems

During this lab we will experiment with data quality driven machine learning systems. The main focus
will be on data quality metrics and how they can be used to improve/fix the performance of machine
learning models in the presence of data quality issues.

## 1. Machine Learning System

Your task is to create a classifier ensemble that can predict the target variable based on the
features provided in a dataset. For this lab, we will use the Human Activity Recognition Using
Smartphones dataset -
https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones. It is a
classification problem where we predict the activity performed by users wearing a smarphone (Samsung
Galaxy S II) on the waist. The possible activities are LAYING, STANDING, SITTING, WALKING,
WALKING_UPSTAIRS, WALKING_DOWNSTAIRS. The embedded accelerometer and gyroscope were used to capture
sensor signals. Overall, the dataset contains 561 input features, either raw or aggregated. It also
includes the target variable that we want to predict and a subject identifier indicating which
participant carried out the experiment. The dataset is also available in CSV format on Kaggle -
https://www.kaggle.com/datasets/uciml/human-activity-recognition-with-smartphones/data.


You should track which features are utilised by each component of
the ensemble. 




Split the dataset into train and test subsets. Use the training subset to prepare the ensemble
according to the following rules:

- ...
- ...
-

During prediction, validate the input data. If any issues are detected with a feature, take
appropriate action. For example, if an issue is detected with feature X, aggregate the predictions
from models that do not utilise this feature.

Evaluate the ensemble's performance against a baseline model trained on the same dataset, but
without any data quality handling.

In [15]:
# write your code here


In [18]:
import pandas as pd

df = pd.read_csv("../../data/lab-10/train.csv")


In [22]:
df

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.030400,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.123520,-0.998245,-0.975300,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.995380,-0.967187,-0.978944,-0.996520,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.982750,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.016570,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.123320,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7347,0.299665,-0.057193,-0.181233,-0.195387,0.039905,0.077078,-0.282301,0.043616,0.060410,0.210795,...,-0.880324,-0.190437,0.829718,0.206972,-0.425619,-0.791883,0.238604,0.049819,30,WALKING_UPSTAIRS
7348,0.273853,-0.007749,-0.147468,-0.235309,0.004816,0.059280,-0.322552,-0.029456,0.080585,0.117440,...,-0.680744,0.064907,0.875679,-0.879033,0.400219,-0.771840,0.252676,0.050053,30,WALKING_UPSTAIRS
7349,0.273387,-0.017011,-0.045022,-0.218218,-0.103822,0.274533,-0.304515,-0.098913,0.332584,0.043999,...,-0.304029,0.052806,-0.266724,0.864404,0.701169,-0.779133,0.249145,0.040811,30,WALKING_UPSTAIRS
7350,0.289654,-0.018843,-0.158281,-0.219139,-0.111412,0.268893,-0.310487,-0.068200,0.319473,0.101702,...,-0.344314,-0.101360,0.700740,0.936674,-0.589479,-0.785181,0.246432,0.025339,30,WALKING_UPSTAIRS


## 2*. Anomaly Detection

Anomaly detection is the task of identifying rare events or outliers in a dataset. However, it
differs from data quality validation in that its focus is not on ensuring data correctness, but
rather on identifying unexpected or rare patterns. Although distinct, both anomaly detection and
data quality issues can influence the performance of machine learning models, but for different
reasons. Data quality issues usually concern the validity of data (e.g., missing or incorrect
values), while anomaly detection focuses on unusual patterns and outliers.

Sometimes, distinguishing between the two can be challenging. For example, consider a dust or
fine-particle sensor. If a sensor in the kitchen registers a maximum value while food is being
prepared on a frying pan, this could be considered an anomaly, yet it might also be a theoretically
correct reading (if the sensor has an upper limit on its readings range). On the other hand, if the
sensor stops providing readings due to a battery outage, this can be considered a data quality issue
rather than an anomaly in the dust sensor data stream (but depending on the domain context, it can
be interprted as an anomaly as well).

*There is also another aspect to consider - the presence of data drift. Data drift refers to the
process by which a data distribution changes over time. It can be caused by various factors, such as
changes in the environment, modifications of the data collection process, or shifts in the
underlying data itself. Data drift can have a significant impact on the performance of machine
learning models, as it may cause them to become less accurate over time if they were trained on a
different data distribution. For now, however, we will treat data drift as a separate topic that is
out of the scope of this lab.*

Tere are several open-source libraries that can be used for anomaly detection. Sometime these are
libraries specialized in anomaly detection for specific domains or data types. Just to name a few:

- Alibi Detect - https://docs.seldon.ai/alibi-detect
- PyOD - https://pyod.readthedocs.io/en/latest/
- Anomalib - https://anomalib.readthedocs.io/en/latest/

Familiarize yourself with the above libraries. Choose one of them and experiment with the dataset
used in the previous section. Simulate/introduce some anomalies into the dataset and try to detect
them.