In this document, we will demonstrate how to use the provided supporting function to read all the data from one test and apply basic data preprocessing.

## Read data using the supporting function

In utility.py, we provide a supporting function that reads all the data (six csv files) from the folder of one test, and create a aggregated dataframe to store all the data. To use this function, you need to specify the path of the folder of one test. Below is a demonstration how to read all the data from one test in folder `projects\maintenance_industry_4_2024\dataset\training_data\20240105_164214` and create a dataframe.

In [1]:
from utility import read_all_csvs_one_test
import pandas as pd

# Specify path to the dictionary.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name

# Read the data.
df_data = read_all_csvs_one_test(path, dictionary_name)
df_data.head()

Unnamed: 0,time,data_motor_1_position,data_motor_1_temperature,data_motor_1_voltage,data_motor_1_label,data_motor_2_position,data_motor_2_temperature,data_motor_2_voltage,data_motor_2_label,data_motor_3_position,...,data_motor_4_label,data_motor_5_position,data_motor_5_temperature,data_motor_5_voltage,data_motor_5_label,data_motor_6_position,data_motor_6_temperature,data_motor_6_voltage,data_motor_6_label,test_condition
0,76522.025433,86,42,7223,0,501,31,7334,0,80,...,0,619,43,7312,0,500,24,7361,0,20240105_164214
1,76522.125464,86,42,7214,0,502,31,7250,0,80,...,0,619,43,7332,0,499,24,7372,0,20240105_164214
2,76522.225432,86,42,7137,0,501,31,7234,0,79,...,0,619,43,7330,0,499,24,7356,0,20240105_164214
3,76522.325432,86,42,7135,0,501,31,7250,0,79,...,0,619,43,7319,0,499,24,7374,0,20240105_164214
4,76522.425451,86,42,7212,0,502,31,7232,0,79,...,0,619,43,7348,0,499,24,7365,0,20240105_164214


## Read all the data and visualize them

Below is a demonstration of how to read all the data and visualize them.

In [2]:
from utility import read_all_test_data_from_path
import pandas as pd

# Define the path to the folder 'collected_data'
base_dictionary = '/Users/user1/Desktop/Industry_40/digital_twin_robot_group2/projects/maintenance_industry_4_2024/dataset/training_data/'
# Read all the data
df_data = read_all_test_data_from_path(base_dictionary)

FileNotFoundError: [WinError 3] Le chemin d’accès spécifié est introuvable: '/Users/user1/Desktop/Industry_40/digital_twin_robot_group2/projects/maintenance_industry_4_2024/dataset/training_data/'

All the readed data will be stored in `df_data`. Each column's name tells the data is from which performance metric of which motor, and the last column specifies the test condition.

In [None]:
df_data.head()

Unnamed: 0,time,data_motor_1_position,data_motor_1_temperature,data_motor_1_voltage,data_motor_1_label,data_motor_2_position,data_motor_2_temperature,data_motor_2_voltage,data_motor_2_label,data_motor_3_position,...,data_motor_6_label,data_motor_4_position,data_motor_4_temperature,data_motor_4_voltage,data_motor_4_label,data_motor_5_position,data_motor_5_temperature,data_motor_5_voltage,data_motor_5_label,test_condition
0,76522.025433,86,42,7223,0,501,31,7334,0,80,...,0,825,25,7270,0,619,43,7312,0,20240105_164214
1,76522.125464,86,42,7214,0,502,31,7250,0,80,...,0,825,25,7345,0,619,43,7332,0,20240105_164214
2,76522.225432,86,42,7137,0,501,31,7234,0,79,...,0,825,25,7277,0,619,43,7330,0,20240105_164214
3,76522.325432,86,42,7135,0,501,31,7250,0,79,...,0,825,25,7263,0,619,43,7319,0,20240105_164214
4,76522.425451,86,42,7212,0,502,31,7232,0,79,...,0,824,25,7303,0,619,43,7348,0,20240105_164214


## Customize data preprocessing

It can be seen from the graphs above, there are some noise and outliers in the original data. To apply data preprocessing during the reading process, you can write a function to preprocess the data. This function needs to take as input a dataframe of the original csv file, apply the preprocessing, and return the preprocessed dataframe. Then, you can pass the handle of this function to the `read_all_csvs_one_test` function.

Below, we show a demo of removing outliers based on a pre-defined validity range. It can be seen that after the preprocessing, the outliers are removed.

In [None]:
from utility import read_all_test_data_from_path
import numpy as np
import pandas as pd


def remove_outliers(df: pd.DataFrame):
    ''' # Description
    Remove outliers from the dataframe based on defined valid ranges. 
    Define a valid range of temperature and voltage. 
    Use ffil function to replace the invalid measurement with the previous value.
    '''
    df['temperature'] = df['temperature'].where(df['temperature'] <= 200, np.nan)
    df['temperature'] = df['temperature'].where(df['temperature'] >= 0, np.nan)
    df['temperature'] = df['temperature'].ffill()

    df['voltage'] = df['voltage'].where(df['voltage'] >= 6000, np.nan)
    df['voltage'] = df['voltage'].where(df['voltage'] <= 9000, np.nan)
    df['voltage'] = df['voltage'].ffill()

    df['position'] = df['position'].where(df['position'] >= 0, np.nan)
    df['position'] = df['position'].where(df['position'] <= 1000, np.nan)
    df['position'] = df['position'].ffill()


base_dictionary = '../../dataset/training_data/'
df_data = read_all_test_data_from_path(base_dictionary, remove_outliers)

NotADirectoryError: [Errno 20] Not a directory: '../../dataset/training_data/Test conditions.xlsx'

## Other possible preprocessing

You can define your own preprocessing function to considering more preprocessing approaches like moving average, low-pass/high-pass filers to further remove the noise in the original data.

## Correlation between varialbes

We can visualize the correlation between variables using a heatmap. Here, the purpose is to identify the features that are highly correlated to the response variable, and remove the features that are internally correlated. We have six response variable (we need one model per resposne variable).

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# We keep only numerical variables. Remove the labels and it is not relavent in this case.
# df_data_processed = df_data.drop(columns=['data_motor_1_label', 'data_motor_2_label', 'data_motor_3_label',
#                                         'data_motor_4_label', 'data_motor_5_label', 'data_motor_6_label', 'test_condition'])
df_data_processed = df_data.drop(columns=['test_condition'])

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler to the data and transform the data
df_data_processed = pd.DataFrame(data=scaler.fit_transform(df_data_processed), columns=df_data_processed.columns)

# Compute correlation matrix
correlation_matrix = df_data_processed.corr()

# Plot correlation matrix using seaborn
plt.figure(figsize=(20, 20))  # Adjust width and height as needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")