# Data Preprocessing

1. Check for missing values: Check if there are any missing values in the dataset and decide on how to handle them. If there are a lot of missing values, you may consider dropping those rows or imputing them with appropriate values.
Check for duplicates: Check if there are any duplicate rows in the dataset and remove them if necessary.

2. Data type conversion: Check if the data types of the columns are appropriate. For example, the 'UID' column should be of integer data type, while the 'productID' column should be categorical.

3. Check for outliers: Check if there are any outliers in the dataset, especially in the continuous variables such as 'air temperature', 'process temperature', 'rotational speed', 'torque', and 'tool wear'. You may consider removing or adjusting the outliers depending on the context.

4. Feature engineering: Create new features that may be relevant for predictive maintenance. For example, you may create a new feature that combines the 'air temperature' and 'process temperature' to represent the temperature difference, which may be an important indicator of machine failure.

5. Label encoding: Convert the categorical variable 'productID' to numerical values using label encoding or one-hot encoding, depending on the algorithm you plan to use.

6. Feature scaling: Normalize or standardize the continuous variables to ensure that they have similar ranges. This will help the machine learning algorithm to converge faster.

7. Balance the dataset: Check if the dataset is balanced in terms of the 'machine failure' label. If there are a lot more non-failure instances than failure instances, you may consider oversampling or undersampling to balance the dataset.

8. Save the preprocessed dataset: Save the preprocessed dataset in a suitable format, such as CSV or Parquet, for future use.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('../data/raw/data.csv')
df

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,M24855,M,298.8,308.4,1604,29.5,14,0,0,0,0,0,0
9996,9997,H39410,H,298.9,308.4,1632,31.8,17,0,0,0,0,0,0
9997,9998,M24857,M,299.0,308.6,1645,33.4,22,0,0,0,0,0,0
9998,9999,H39412,H,299.0,308.7,1408,48.5,25,0,0,0,0,0,0


## Missing Values

In [3]:
df.isna().sum()

UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
TWF                        0
HDF                        0
PWF                        0
OSF                        0
RNF                        0
dtype: int64

## Check for duplicates

In [4]:
df.duplicated().sum()

0

## Data type conversion

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

# Feature Engineering
## Creating type of failure feature

In [6]:
def type_of_failure(row):
    if df.loc[row, 'TWF'] == 1:
        df.loc[row, 'type_of_failure']= 'TWF'
    elif df.loc[row, 'HDF'] == 1:
        df.loc[row, 'type_of_failure'] = 'HDF'
    elif df.loc[row, 'PWF'] == 1:
        df.loc[row, 'type_of_failure'] = 'PWF'
    elif df.loc[row, 'OSF'] == 1:
        df.loc[row, 'type_of_failure'] = 'OSF'
    elif df.loc[row, 'RNF'] == 1:
        df.loc[row, 'type_of_failure'] = 'RNF'

df.apply(lambda row : type_of_failure(row.name), axis = 1)
df['type_of_failure'].replace(np.NaN, 'No failure', inplace = True)
df.drop(['TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis = 1, inplace = True)
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['type_of_failure'].replace(np.NaN, 'No failure', inplace = True)


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No failure


In [7]:
df

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No failure
...,...,...,...,...,...,...,...,...,...,...
9995,9996,M24855,M,298.8,308.4,1604,29.5,14,0,No failure
9996,9997,H39410,H,298.9,308.4,1632,31.8,17,0,No failure
9997,9998,M24857,M,299.0,308.6,1645,33.4,22,0,No failure
9998,9999,H39412,H,299.0,308.7,1408,48.5,25,0,No failure


In [8]:
df.drop(['UDI', 'Product ID'], axis = 1,inplace = True )
df.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure
0,M,298.1,308.6,1551,42.8,0,0,No failure
1,L,298.2,308.7,1408,46.3,3,0,No failure
2,L,298.1,308.5,1498,49.4,5,0,No failure
3,L,298.2,308.6,1433,39.5,7,0,No failure
4,L,298.2,308.7,1408,40.0,9,0,No failure


## Converting kelvin to celsius

In [9]:
df['Air temperature [C]'] = df['Air temperature [K]'] - 273.15
df['Process temperature [C]'] = df['Process temperature [K]'] - 273.15
df.drop(['Air temperature [K]','Process temperature [K]'], axis = 1, inplace = True)
df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure,Air temperature [C],Process temperature [C]
0,M,1551,42.8,0,0,No failure,24.95,35.45
1,L,1408,46.3,3,0,No failure,25.05,35.55
2,L,1498,49.4,5,0,No failure,24.95,35.35
3,L,1433,39.5,7,0,No failure,25.05,35.45
4,L,1408,40.0,9,0,No failure,25.05,35.55


# Categorical Encoding

## Ordinal Encoding
Our Type feature is having ordinal relationship

In [12]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['L', 'M', 'H']])
df['Type'] = encoder.fit_transform(df[['Type']])
df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure,Air temperature [C],Process temperature [C]
0,1.0,1551,42.8,0,0,No failure,24.95,35.45
1,0.0,1408,46.3,3,0,No failure,25.05,35.55
2,0.0,1498,49.4,5,0,No failure,24.95,35.35
3,0.0,1433,39.5,7,0,No failure,25.05,35.45
4,0.0,1408,40.0,9,0,No failure,25.05,35.55


In [13]:
df['Type'].value_counts()

Type
0.0    6000
1.0    2997
2.0    1003
Name: count, dtype: int64

## label Encoding
We have to predict the type_of_failure, this is our target variable, so we use Label Encoding

In [16]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['type_of_failure'] = encoder.fit_transform(df['type_of_failure'])
df.head()                                                 

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,type_of_failure,Air temperature [C],Process temperature [C]
0,1.0,1551,42.8,0,0,1,24.95,35.45
1,0.0,1408,46.3,3,0,1,25.05,35.55
2,0.0,1498,49.4,5,0,1,24.95,35.35
3,0.0,1433,39.5,7,0,1,25.05,35.45
4,0.0,1408,40.0,9,0,1,25.05,35.55


In [17]:
df['type_of_failure'].value_counts()

type_of_failure
1    9652
0     115
3      91
2      78
5      46
4      18
Name: count, dtype: int64

In [24]:
encoder.classes_

array(['HDF', 'No failure', 'OSF', 'PWF', 'RNF', 'TWF'], dtype=object)

In [25]:
print("Class to code mapping:", dict(zip(encoder.classes_, range(len(encoder.classes_)))))

Class to code mapping: {'HDF': 0, 'No failure': 1, 'OSF': 2, 'PWF': 3, 'RNF': 4, 'TWF': 5}


## Feature Scaling

In [29]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

cols = ['Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Air temperature [C]', 'Process temperature [C]']
df_scaled = scaler.fit_transform(df[cols])
df_scaled = pd.DataFrame(df_scaled)
df_scaled.columns = cols
df_scaled


Unnamed: 0,Rotational speed [rpm],Torque [Nm],Tool wear [min],Air temperature [C],Process temperature [C]
0,0.222934,0.535714,0.000000,0.304348,0.358025
1,0.139697,0.583791,0.011858,0.315217,0.370370
2,0.192084,0.626374,0.019763,0.304348,0.345679
3,0.154249,0.490385,0.027668,0.315217,0.358025
4,0.139697,0.497253,0.035573,0.315217,0.370370
...,...,...,...,...,...
9995,0.253783,0.353022,0.055336,0.380435,0.333333
9996,0.270081,0.384615,0.067194,0.391304,0.333333
9997,0.277648,0.406593,0.086957,0.402174,0.358025
9998,0.139697,0.614011,0.098814,0.402174,0.370370


In [30]:
df.drop(cols, axis = 1, inplace = True)
df = pd.concat([df, df_scaled], axis = 1)
df.head()

Unnamed: 0,Type,Machine failure,type_of_failure,Rotational speed [rpm],Torque [Nm],Tool wear [min],Air temperature [C],Process temperature [C]
0,1.0,0,1,0.222934,0.535714,0.0,0.304348,0.358025
1,0.0,0,1,0.139697,0.583791,0.011858,0.315217,0.37037
2,0.0,0,1,0.192084,0.626374,0.019763,0.304348,0.345679
3,0.0,0,1,0.154249,0.490385,0.027668,0.315217,0.358025
4,0.0,0,1,0.139697,0.497253,0.035573,0.315217,0.37037


## Oversampling

Since there is a class imbalance in the dataset so we use SMOTE(Synthetic Minority Over-Sampling techqnique) for addressing the class imbalance in the dataset.

In [31]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto')
X = df.drop('type_of_failure', axis = 1)
Y= df['type_of_failure']

In [32]:
X_resampled, y_resampled = smote.fit_resample(X,Y)
df_sampled = pd.concat([X_resampled, y_resampled], axis = 1)
df_sampled.head()

Unnamed: 0,Type,Machine failure,Rotational speed [rpm],Torque [Nm],Tool wear [min],Air temperature [C],Process temperature [C],type_of_failure
0,1.0,0,0.222934,0.535714,0.0,0.304348,0.358025,1
1,0.0,0,0.139697,0.583791,0.011858,0.315217,0.37037,1
2,0.0,0,0.192084,0.626374,0.019763,0.304348,0.345679,1
3,0.0,0,0.154249,0.490385,0.027668,0.315217,0.358025,1
4,0.0,0,0.139697,0.497253,0.035573,0.315217,0.37037,1


In [33]:
df_sampled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57912 entries, 0 to 57911
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Type                     57912 non-null  float64
 1   Machine failure          57912 non-null  int64  
 2   Rotational speed [rpm]   57912 non-null  float64
 3   Torque [Nm]              57912 non-null  float64
 4   Tool wear [min]          57912 non-null  float64
 5   Air temperature [C]      57912 non-null  float64
 6   Process temperature [C]  57912 non-null  float64
 7   type_of_failure          57912 non-null  int32  
dtypes: float64(6), int32(1), int64(1)
memory usage: 3.3 MB


In [35]:
df_sampled['type_of_failure'].value_counts()

type_of_failure
1    9652
3    9652
5    9652
2    9652
4    9652
0    9652
Name: count, dtype: int64

In [36]:
df_sampled.to_csv('../data/processed/data_processed.csv', index = False)