# Component X
This notebook will explore the SCANIA Component X dataset.

We start by exploring the dataset. Next we will transform the data to remove any missing values, add and remvoe features. Finally a model that will predict the maintenance needs for the X-component.

[Paper at arXiv](https://arxiv.org/abs/2401.15199)

[Dataset can be downloaded here](https://stockholmuniversity.app.box.com/s/anmg5k93pux5p6decqzzwokp9vuzmdkh)

Place the data files in the folder /data/

In [1]:
#Includes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os


In [2]:
print(os.getcwd())
data_dir = "/workspaces/datasets/scania"
print(os.listdir(data_dir))

/workspaces/py_tf_2_17_env
['validation_labels.csv', 'test_specifications.csv', 'train_specifications.csv', 'train_tte.csv', 'test_operational_readouts.csv', 'train_operational_readouts.csv', 'validation_specifications.csv', 'validation_operational_readouts.csv', 'test_labels.csv']


In [3]:
#Read the raw data
#Train data

tteTrain = pd.read_csv(os.path.join(data_dir,'train_tte.csv'))
specificationsTrain = pd.read_csv(os.path.join(data_dir,'train_specifications.csv'))
readoutsTrain = pd.read_csv(os.path.join(data_dir,'train_operational_readouts.csv'))

#Validation data
labelsValidation = pd.read_csv(os.path.join(data_dir,'validation_labels.csv'))
specificationsValidation = pd.read_csv(os.path.join(data_dir,'validation_specifications.csv'))
readoutsValidation = pd.read_csv(os.path.join(data_dir,'validation_operational_readouts.csv'))

#Test data
specificationsTest = pd.read_csv(os.path.join(data_dir,'test_specifications.csv'))
readoutsTest = pd.read_csv(os.path.join(data_dir,'test_operational_readouts.csv'))
test_labels = pd.read_csv(os.path.join(data_dir,'test_labels.csv'))

NameError: name 'data_dir' is not defined

In [None]:
#See what is in the data
tteTrain.head()

In [None]:
specificationsTrain.head()

In [None]:
readoutsTrain.head()

In [None]:
readoutsTrain.columns


In [None]:
#Check the shape of the data
readoutsTrain.shape

In [None]:
readoutsTrain.describe()

In [None]:
#Check how many vehicles are in the data
readoutsTrain['vehicle_id'].unique().shape

# Variable structure

Histogram variables use the following indexing format: variableid_binindex. Where the "variableid"
represents the ID of an anonymized variable or feature, and "binindex" shows the bin numbers. As an example, the variable
with "variableid" 167 is a multi-dimensional histogram that has ten bins, "167_0", "167_1",..., and "167_9".

In summary, six out of 14 variables are organized into six histograms with variable IDs: "167", "272", "291", "158", "459",
and "397," with 10, 10, 11, 10, 20, and 36 bins, respectively.

Moreover, the eight rest of the variables
named "171_0", "666_0", "427_0", "837_0", "309_0", "835_0", "370_0", "100_0" are numerical counters. These features are
accumulative and are suitable for the representation of trends over time.


In [None]:
cols = ['167_0', '167_1', '167_2', '167_3', '167_4', '167_5', '167_6', '167_7', '167_8', '167_9']
#Take the first row of the data
y = readoutsTrain[cols].iloc[0]
#Plot the data
plt.plot(y, 'o-')
plt.xlabel('Bin')
plt.ylabel('Value')
plt.title('Sample readout for variable 167, as a histogram of ten bins');





In [None]:
cols = ['459_0', '459_1', '459_2', '459_3', '459_4', '459_5', '459_6', '459_7', '459_8', '459_9', '459_10', '459_11', '459_12', '459_13', '459_14', '459_15', '459_16', '459_17', '459_18', '459_19']
#Take the first row of the data
y = readoutsTrain[cols].iloc[0]
#Plot the data
plt.figure(figsize = (10,10))
plt.plot(y, 'o-')
plt.xlabel('Bin')
plt.ylabel('Value')
plt.title('Sample readout for variable 459, as a histogram of twenty bins');

In [None]:
cols =  ['171_0', '666_0', '427_0', '837_0', '309_0', '835_0', '370_0', '100_0']
vechicleId = 2
x = readoutsTrain[readoutsTrain['vehicle_id'] == vechicleId]['time_step'].to_numpy()
y = readoutsTrain[readoutsTrain['vehicle_id'] == vechicleId][cols].to_numpy()
plt.figure(figsize = (10,10))
plt.plot(x, y)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(cols)
plt.title('Sample readout for vehicle 2');

# Correlation of non-histogram features in the training set
See what the correlations of the non-histogram features are, for the entire training set.

In [None]:
cols =  ['171_0', '666_0', '427_0', '837_0', '309_0', '835_0', '370_0', '100_0']

#Calculate the correlation matrix
corr = readoutsTrain[cols].corr()

#Plot the correlation matrix
plt.figure(figsize = (10,10))
ax = sns.heatmap(corr, annot=True, fmt=".2f")

#Add title
plt.title('Correlation matrix of the non-histogram variables');

In [None]:
# Check the number of missing values per column
y = readoutsTrain.isnull().sum()
plt.figure(figsize = (10,10))
plt.bar(y.index, y)
plt.xlabel('Number of missing values')
plt.ylabel('Variable')
plt.title('Number of missing values per variable in the readouts data');


# Visualize the file train_tte.csv

The file with the name "train_tte.csv" contains the repair records of Component X collected from each vehicle, indicating
the time_to_event (tte), i.e., the replacement time for Component X during the study period. This data file includes 23550
number of rows and two columns: "length_of_study_time_step" and "in_study_repair," where the former indicates the number
of operation time steps after Component X started working. The latter is the class label, where it’s set to 1 if Component X was
repaired at the time equal to its corresponding length_of_study_time_step, or it can take the value of zero in case no failure or
repair event occurs during the first length_of_study_time_step of operation. It is good to mention that the "train_tte.csv" data
is imbalanced with 21278 occurrences of label 0 and 2272 instances of label 1

In [None]:
tteTrain.tail()

In [None]:
#Check the number of data points in the Time To Event data
tteTrain.shape

In [None]:
tteTrain.describe()

In [None]:
#Check if there are nulls in the data
tteTrain.isnull().sum()

In [None]:
#Count the number of unique vehicles
tteTrain['vehicle_id'].unique().shape

In [None]:
#Check the distribution of events, if the vechicle is 1, then a failure happened, and 0 no failure happened
tteTrain['in_study_repair'].hist()

In [None]:
plt.hist(tteTrain['length_of_study_time_step'], bins = 100);
plt.xlabel('Length of study time step')
plt.ylabel('Frequency')



# The file train_specifications.csv

The last file in the training set is called "train_specifications.csv," which contains information about the specifications of the
vehicles, such as their engine type and wheel configuration. In total, there are 23550 observations and eight categorical features
for all vehicles. The features in train_specifications.csv are anonymized, each can take categories in Cat0, Cat1, ..., Cat8. 

In [None]:
specificationsTrain.head()

In [None]:
# For each column create a subplot with the histogram of the data
plt.figure(figsize = (20,20))
for i, col in enumerate(specificationsTrain.columns[1:]):
    plt.subplot(4, 4, i+1)
    plt.hist(specificationsTrain[col])
    plt.title(col)

In [None]:
#Check for nans
specificationsTrain.isnull().sum()

In [None]:
tteTrain[tteTrain['in_study_repair'] == 1].head()

In [None]:
tteTrain.tail()

In [None]:
cols =  ['171_0', '666_0', '427_0', '837_0', '309_0', '835_0', '370_0', '100_0']
vechicleId = 22
x = readoutsTrain[readoutsTrain['vehicle_id'] == vechicleId]['time_step'].to_numpy()
y = readoutsTrain[readoutsTrain['vehicle_id'] == vechicleId][cols].to_numpy()
plt.figure(figsize = (10,10))
plt.plot(x, y)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(cols)
plt.title(f'Sample readout for vehicle {vechicleId}');

In [None]:
readoutsTrain[readoutsTrain['vehicle_id'] == vechicleId]

# Format of validation labels
The validation_labels.csv file has 5046 rows, which is equal to the number of vehicles contributed to the operational data of the
validation set. It includes a column named class_label, corresponding to the class for the last readout of each vehicle.

The temporal placement of this final simulated readout is categorized into five classes
denoted by 0, 1, 2, 3, 4 where they are related to readouts within a time window of: (more than 48), (48 to 24), (24 to 12), (12
to 6), and (6 to 0) time_step before the failure, respectively.

In [None]:
labelsValidation.head()

In [None]:
labelsValidation['class_label'].value_counts()

In [None]:
plt.hist(labelsValidation['class_label'])
plt.xlabel('Class label')
plt.ylabel('Count')
plt.title('Distribution of class labels in the validation data');

# Remove NaNs
We need to remove missing values and see if there are any strange values in the dataset.

In [None]:
#First we group by vehicle_id and we will forward fill the last known value.
#Then if the entire column is NaN, we will fill it with the median of the column. 
#If there are still any NaNs we will fill them with 0.

def fill_missing_values(df):
    df = df.groupby('vehicle_id').apply(lambda x: x.ffill(axis=0)) #Forward fill last known value, but only for the same vehicle
    df = df.droplevel('vehicle_id') #Remove multi-index, as we don't want to group by vehicle_id anymore    
    df = df.fillna(df.median()) #Fill with median rather than mean to avoid outliers
    df = df.fillna(0) #Last resort fill with 0
    
    return df

In [None]:
#Clean the data
print('Cleaning the training data')
print(f'Number of missing values before cleaning: {readoutsTrain.isnull().sum().sum()}')
readoutsTrain = fill_missing_values(readoutsTrain)
print(f'Number of missing values after cleaning: {readoutsTrain.isnull().sum().sum()}')
#readoutsValidation = fill_missing_values(readoutsValidation)

In [None]:
readoutsTrain.head()

# Class labels
Create the class labels for our training set. There are multiple ways to do this. Lets start by denoting the labels as they are in the validation set.

In [None]:
# We want to create labels for the training data based on the time to event data
# Labels in validation set are denoted by 0, 1, 2, 3, 4 where they are related to readouts within a time window of: (more than 48), (48 to 24), (24 to 12), (12 to 6), and (6 to 0) time_step before the failure, respectively. 
# If we don't have a failure reported, and the time_step left is less 48 we don't know when the failure will happen, so we will label it as -1. 

def get_class_label(row):
    #classes denoted by 0, 1, 2, 3, 4 where they are related to readouts within a time window of: (more than 48), (48 to 24), (24 to 12), (12 to 6), and (6 to 0) time_step before the failure, respectively
    if row['time_to_potential_event'] > 48:
        return 0 #No failure within 48 time steps
    elif row['time_to_potential_event'] > 24 and row['in_study_repair'] == 1:
        return 1 #Failure within 48 to 24 time steps
    elif row['time_to_potential_event'] > 12 and row['in_study_repair'] == 1:
        return 2 #Failure within 24 to 12 time steps
    elif row['time_to_potential_event'] > 6 and row['in_study_repair'] == 1:
        return 3 #Failure within 12 to 6 time steps
    elif row['time_to_potential_event'] > 0 and row['in_study_repair'] == 1:
        return 4 #Failure within 6 to 0 time steps
    else:
        return -1 #No failure reported, but within 48 time steps from the end of the study, don't know if it will fail or not
    
def add_class_labels(tte, readouts):
    # Join the readouts and the time to event data
    df = pd.merge(readouts, tteTrain, on = 'vehicle_id', how='left').copy()

    #Calculate the time to a failure event
    df['time_to_potential_event'] = df['length_of_study_time_step'] - df['time_step']

    df['class_label'] = df.apply(get_class_label, axis=1)

    return df



In [None]:
#Merge the time to event data with the readouts data and figure out which class they belong to
#Later we will need to remove the columns: length_of_study_time_step, in_study_repair, time_to_potential_event, class_label and remove any rows with class label -1
dfTrain = add_class_labels(tteTrain, readoutsTrain)

In [None]:
dfTrain.head()

In [None]:
dfTrain.tail()

# Feature engineering
Lets create features that can be used in a ML model.