# Exective summary of Work Package 2

## Objectives

In this WP, you will work on a given training dataset. Your goal is to develop a fault detection model using the classification algorithms learnt in the class, in order to achieve best F1 score.

## Tasks

- Task 1: Develop a fault detection model using the unsupervised learning algorithms learnt in the class, in order to achieve best F1 score.
- Task 2: With the help of the supporting script, develop a cross-validation scheme to test the performance of the developed classification algorithms.
- Task 3: Develop a fault detection model using the classification algorithms learnt in the class, in order to achieve best F1 score.

## Delierables

- A Jupyter notebook reporting the process and results of the above tasks


# Before starting, please:
- Fetch the most up-to-date version of the github repository.
- Create a new branch with your name, based on the "main" branch and switch to your own branch.
- Copy this notebook to the work space of your group, and rename it to TD_WP_2_Your name.ipynb
- After finishing this task, push your changes to the github repository of your group.

# Task 1: Unsupervised learning approaches

## Implement the statistical testing approach for fault detection

In this exercise, we interpret the statistical testing approach for fault detection. The basic idea of statistical testing approach is that we fit a multi-dimensitional distribution to the observation data under normal working condition. Then, when a new data point arrives, we design a hypothesis test to see whether the new data point is consistent with the distribution. If the new data point is consistent with the distribution, we can conclude that the fault is not due to the faulty component.

The benefit of this approach is that, to design the detection algrothim, we do not need failed data. Also, the computational time is short as all we need is just to compute the pdf and compare it to a threshold.

In this exercise, you need to:
- Fit a multi-dimensitional distribution to the training dataset (all normal samples).
- Design a fault detection algorithm based on the fitted distribution to detect faulty components.

The following block defines a few functions that you can use.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import multivariate_normal


def estimateGaussian(X):
    '''Given X, this function estimates the parameter of a multivariate Gaussian distribution.'''
    mu = np.mean(X, axis=0)
    sigma2 = np.var(X, axis=0)
    return mu, sigma2


def classify(X, distribution, log_epsilon=-50):
    '''Given X, this function classifies each sample in X based on the multivariate Gaussian distribution. 
       The decision rule is: if the log pdf is less than log_epsilon, we predict 1, as the sample is unlikely to be from the distribution, which represents normal operation.
    '''
    p = distribution.logpdf(X)
    predictions = (p < log_epsilon).astype(int)
    
    return predictions

Let us use the dataset `20240105_164214` as training dataset, as all the samples in this dataset are normal operation. We will use the dataset `20240325_155003` as testing dataset. Let us try to predict the state of motor 1. For this, we first extract the position, temperature and voltage of motor 1 as features (you can change the features if you want). 

In [2]:
import sys
sys.path.insert(0, '/Users/beatriz/Documents/GitHub/Group_3/projects/maintenance_industry_4_2024/supporting_scripts')

from utility import read_all_csvs_one_test
import pandas as pd

# Specify path to the dictionary.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name

# Read the data.
df_data = read_all_csvs_one_test(path, dictionary_name)

# Get the features
X_train = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]

# We do the same to get the test dataset.
dictionary_name = '20240325_155003'
path = base_dictionary + dictionary_name

# Read the data.
df_data = read_all_csvs_one_test(path, dictionary_name)

# Get the features
X_test = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]
y_test = df_data['data_motor_1_label']

Please design your algorithm below:

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from sklearn.metrics import accuracy_score
from utility import read_all_csvs_one_test

# Estimates the parameter of a multivariate Gaussian distribution
def estimateGaussian(X):
    mu = np.mean(X, axis=0)
    sigma2 = np.var(X, axis=0)
    return mu, sigma2

# Classifies each sample in X based on the multivariate Gaussian distribution. 
def classify(X, distribution, log_epsilon=-50):
    p = distribution.logpdf(X)
    predictions = (p < log_epsilon).astype(int)
    return predictions

# # Read the training data.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_train = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]

# Construct a multivariate Gaussian distribution to represent normal operation
mu, sigma2 = estimateGaussian(X_train)
distribution = multivariate_normal(mean=mu, cov=np.diag(sigma2))

# Read the test data
dictionary_name = '20240325_155003'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_test = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]
y_test = df_data['data_motor_1_label']

#  redict the labels of the test set X_tes
y_pred = classify(X_test, distribution)

# Calculate accuracy of the prediction
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.19422730006013228


**Discussions:**
- Can you please try to improve the performance of this approach?
    - For example, by normalizating the data?
    - By smoothing the data?
    - By reducing feature number?
    - etc.
- The parameter log_epsilon defines the threshold we use for making classification. What happens if you change it?
- Could you discuss how we should get the best value for this parameter?

**Improving the performance:**

- After trying a few methods and combinations, removing the outliers than apllying standardization was the method that best improved the performance. 

In [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from sklearn.metrics import accuracy_score
from utility import read_all_csvs_one_test
from sklearn.preprocessing import StandardScaler


# Estimates the parameter of a multivariate Gaussian distribution
def estimateGaussian(X):
    mu = np.mean(X, axis=0)
    sigma2 = np.var(X, axis=0)
    return mu, sigma2

# Classifies each sample in X based on the multivariate Gaussian distribution. 
def classify(X, distribution, log_epsilon=-50):
    p = distribution.logpdf(X)
    predictions = (p < log_epsilon).astype(int)
    return predictions

# # Read the training data.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_train = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]

#Define functionn to remove outliers of the data 

def remove_outliers(df: pd.DataFrame):

    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] <= 100, np.nan)
    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] >= 0, np.nan)
    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].ffill()

    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].where(df['data_motor_1_voltage'] >= 5000, np.nan)
    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].where(df['data_motor_1_voltage'] <= 9000, np.nan)
    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].ffill()

    df['data_motor_1_position'] = df['data_motor_1_position'].where(df['data_motor_1_position'] >= 0, np.nan)
    df['data_motor_1_position'] = df['data_motor_1_position'].where(df['data_motor_1_position'] <= 1000, np.nan)
    df['data_motor_1_position'] = df['data_motor_1_position'].ffill()
    return df

#Define functionn to standard the data 
def standard_data(df: pd.DataFrame):
    scaler = StandardScaler()
    normalized_data = scaler.fit_transform(df)
    df_normalized = pd.DataFrame(normalized_data)
    return df_normalized
#Remove outliers of X_train 
X_train=remove_outliers(X_train)

#Stardard X_train 
X_train = standard_data(X_train)

# Construct a multivariate Gaussian distribution to represent normal operation
mu, sigma2 = estimateGaussian(X_train)
distribution = multivariate_normal(mean=mu, cov=np.diag(sigma2))

# Read the test data
dictionary_name = '20240325_155003'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_test = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]
y_test = df_data['data_motor_1_label']

#Stardard X_test

X_test=remove_outliers(X_test)
X_test=standard_data(X_test)

#  redict the labels of the test set X_tes
y_pred = classify(X_test, distribution)

# Calculate accuracy of the prediction
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8057726999398677


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] <= 100, np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] >= 0, np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return

**Changing log_epsilon:** 
- affects the sensitivity of the algorithm
  - If log_epsilon is set to a higher value (less negative), it means that the algorithm is more tolerant and considers a broader range of data points as normal. As a result, the algorithm may have a lower false positive rate but might miss some actual faults, leading to a decrease in sensitivity. Meaning the performance will be lower.
  - If log_epsilon is set to a lower value (more negative), the algorithm becomes more stringent, classifying fewer data points as normal. This can lead to a higher sensitivity to faults but might also increase the false positive rate, as more normal data points may be misclassified as faults. Meaning the performance will be higher.

See example bellow

In [74]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from sklearn.metrics import accuracy_score
from utility import read_all_csvs_one_test
from sklearn.preprocessing import StandardScaler


# Estimates the parameter of a multivariate Gaussian distribution
def estimateGaussian(X):
    mu = np.mean(X, axis=0)
    sigma2 = np.var(X, axis=0)
    return mu, sigma2

# Classifies each sample in X based on the multivariate Gaussian distribution. 
def classify(X, distribution, log_epsilon=-50):
    p = distribution.logpdf(X)
    predictions = (p < log_epsilon).astype(int)
    return predictions

# # Read the training data.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_train = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]

#Define functionn to remove outliers of the data 

def remove_outliers(df: pd.DataFrame):

    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] <= 100, np.nan)
    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] >= 0, np.nan)
    df['data_motor_1_temperature'] = df['data_motor_1_temperature'].ffill()

    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].where(df['data_motor_1_voltage'] >= 5000, np.nan)
    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].where(df['data_motor_1_voltage'] <= 9000, np.nan)
    df['data_motor_1_voltage'] = df['data_motor_1_voltage'].ffill()

    df['data_motor_1_position'] = df['data_motor_1_position'].where(df['data_motor_1_position'] >= 0, np.nan)
    df['data_motor_1_position'] = df['data_motor_1_position'].where(df['data_motor_1_position'] <= 1000, np.nan)
    df['data_motor_1_position'] = df['data_motor_1_position'].ffill()
    return df

#Define functionn to standard the data 
def standard_data(df: pd.DataFrame):
    scaler = StandardScaler()
    normalized_data = scaler.fit_transform(df)
    df_normalized = pd.DataFrame(normalized_data)
    return df_normalized
#Remove outliers of X_train 
X_train=remove_outliers(X_train)

#Stardard X_train 
X_train = standard_data(X_train)

# Construct a multivariate Gaussian distribution to represent normal operation
mu, sigma2 = estimateGaussian(X_train)
distribution = multivariate_normal(mean=mu, cov=np.diag(sigma2))

# Read the test data
dictionary_name = '20240325_155003'
path = base_dictionary + dictionary_name
df_data = read_all_csvs_one_test(path, dictionary_name)
X_test = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]
y_test = df_data['data_motor_1_label']

#Stardard X_test

X_test=remove_outliers(X_test)
X_test=standard_data(X_test)

#  redict the labels of the test set X_tes
y_pred = classify(X_test, distribution)

log_epsilons = [-2,-100]  # Example of log_epsilons

for log_epsilon in log_epsilons:
    y_pred = classify(X_test, distribution, log_epsilon=log_epsilon)
    accuracy = accuracy_score(y_test, y_pred)
    print("log_epsilon:", log_epsilon, "Accuracy:", accuracy)

log_epsilon: -2 Accuracy: 0.19422730006013228
log_epsilon: -100 Accuracy: 0.8057726999398677


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] <= 100, np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['data_motor_1_temperature'] = df['data_motor_1_temperature'].where(df['data_motor_1_temperature'] >= 0, np.nan)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return

**Find the best value to log_epsilon**
Applying cross validation - in which we use various values of log_epsilon and evaluate the performance of the model on the validation set and choose the value of log_epsilon that gives the best performance.



## Local outiler factor (LOF)

The local outlier factor (LOF) algorithm computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors. You can easiliy implement LOF in scikit-learn ([tutorial](https://www.datatechnotes.com/2020/04/anomaly-detection-with-local-outlier-factor-in-python.html)).

Please implement local outlier factor (LOF) algorithm on the dataset of `20240325_155003`. You can try first to detect the failure of motor 1 using this model. Please calculate the accuracy score of your prediction.

# Task 2 Develop a cross validation pipeline to evaluate the performance of the model.

The idea of cross validation is to split the data into k subsets and use one of them as the test set and the rest as the training set. The performance of the model is evaluated only on the test dataset, while the model is trained on the training dataset. By doing this, we ensure that the evaluation of the model is independent from the training of the model. Therefore, we can detect if the model is overfitted.

## k-fold cross validation

Here, we use motor 1 as an example to develop a pipeline for cross validation. Below, you have a script that read the data, extract features and get the labels.

1. Use sk-learn to split the data into training and testing sets, using a k-fold cross validation with k=5. (Hint: This is a routine task which can be answered easily by language models like chatgpt. You can try prompt like this: `Generate a code in python to split the data X and y into training and testing sets, using a k-fold cross validation with k=5.`)
2. Then, train a basic logistic regression model, without hyper-parameter tuning on the training set, and use the testing set to evaluate the performance of the model (calculate accuracy, precision, recall, and F1 score). 
3. Finally, train a logistic regression model, but use the entire dataset X and y as training data. Then, use the trained model to predict the labels of the same dataset (X). Compare the results with the previous step, and discuss why we should use cross validation to evaluate the performance of the model.

In [None]:
import sys
sys.path.insert(0, 'C:/Users/Zhiguo/OneDrive - CentraleSupelec/Code/Python/digital_twin_robot/projects/maintenance_industry_4_2024/supporting_scripts/WP_1')

from utility import read_all_csvs_one_test
import pandas as pd

# Specify path to the dictionary.
# Define the path to the folder 'collected_data'
base_dictionary = '../../dataset/training_data/'
# Read all the data
df_data = read_all_test_data_from_path(base_dictionary)

# Extract the features for motor 1: You should replace the features with the ones you have selected in WP1.
X = df_data[['data_motor_1_position', 'data_motor_1_temperature', 'data_motor_1_voltage']]
# Get the label
y = df_data['data_motor_1_label']


Write your discussions here:


# Task 3: Develop classification-based fault detection models

In this task, you are supposed to experiment different classification-based fault detection models to get best F1 score. Please use the 5-fold cross-validation to calculate the best F1 score. You are free to try different models, whether they are discussed in the class or not. To simply your work, you can use the models existed in [scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

Please report all the models you tried, how to you tune their hyperparameters, and the corresponding F1 score. Please note that if you would like to tune the hyperparameter, you can use the `GridSearchCv` function in scikit-learn, but you should use it only on the training dataset.

## Logistic regression

In [None]:
# Your code here:

## Summary of the results

Please add a table in the end, summarying the results from all the models (including the unsupervised learning models). Please write a few texts to explain what is the best model you got, its performance, and how could you further improve it.

| Model   | Accuracy | Precision | Recall | F1   |
|---------|----------|-----------|--------|------|
| Model 1 |   XX.X%  |   XX.X%   |  XX.X% | XX.X%|
| Model 2 |   XX.X%  |   XX.X%   |  XX.X% | XX.X%|
| Model 3 |   XX.X%  |   XX.X%   |  XX.X% | XX.X%|


MERSE JUST TO TEST
