# **Business Problem Understanding:**

## **Objective:**
The goal of this predictive maintenance project is to develop a machine learning model that can effectively predict machine failures based on a synthetic dataset provided. The dataset contains information about various features related to the manufacturing process, and the target variable includes two aspects: whether a machine failure occurred (binary classification) and the type of failure if it occurred (multi-class classification).

## Dataset Description
The dataset consists of 10,000 data points with 14 features for each observation. The features include:
1. UID: Unique identifier ranging from 1 to 10,000.
2. ProductID: Categorized as low (L), medium (M), or high (H) quality variants with variant-specific serial numbers.
3. Air Temperature [K]: Generated using a random walk process, later normalized to a standard deviation of 2 K around 300 K.
4. Process Temperature [K]: Generated using a random walk process, normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
5. Rotational Speed [rpm]: Calculated from power of 2860 W, overlaid with normally distributed noise.
6. Torque [Nm]: Normally distributed around 40 Nm with a standard deviation of 10 Nm and no negative values.
7. Tool Wear [min]: Tool wear values added based on the quality variants H/M/L (5/3/2 minutes).
8. Machine Failure Label: Indicates whether the machine failed in a particular data point for any of the specified failure modes.

## Targets
1. **Failure or Not**: Binary classification indicating whether a machine failure occurred.
2. **Failure Type**: Multi-class classification specifying the type of failure if it occurred.

## Challenge
Predictive maintenance is crucial for minimizing downtime and reducing operational costs in the manufacturing industry. Developing an accurate model to predict failures can enable proactive maintenance interventions, optimizing resource allocation and ensuring continuous production.

## Approach
The project will involve data preprocessing, exploratory data analysis, feature engineering, and the development of machine learning models for both binary and multi-class classifications. The dataset's synthetic nature reflects real-world predictive maintenance scenarios, allowing the model to learn from various failure modes and their associated features.

## Acknowledgements
The dataset is sourced from UCI's AI4I 2020 Predictive Maintenance Dataset, which serves as a valuable resource for developing and testing predictive maintenance models.


[Dataset Kaggle link](https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification)

# **Index:**

| S.No | Section                          |
|------|----------------------------------|
| 1.   | Importing Libraries              |
| 2.   | Data Acquisition                 |
| 3.   | Exploratory Data Analysis        |
| 4.   | Feature Engineering              |
| 5.   | Machine Learning (Predictive Analytics) |
| 6.   | Model Prediction                 |
| 7.   | Model Evaluation                 |



## **1. Importing Libraries:**

In [1]:
# Import necessary libraries
import os
import pickle
import pandas as pd
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LinearRegression, LogisticRegression

## **2. Data Acquisition:**

In [3]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv("D:\\Feynn-Labs-Project-1\\project1\\predictive_maintenance.csv")

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No Failure


## **3. Exploratory Data Analysis:**

In [4]:
# Count the occurrences of each unique value in the 'Failure Type' column of the DataFrame (assumed as 'df')
# and display the result as a series of counts for each unique value.
df['Failure Type'].value_counts()

Failure Type
No Failure                  9652
Heat Dissipation Failure     112
Power Failure                 95
Overstrain Failure            78
Tool Wear Failure             45
Random Failures               18
Name: count, dtype: int64

##### **Observation: The `Failure Type` column exhibits an imbalance in its value counts.** 

In [5]:
# Count the occurrences of each unique value in the 'Type' column of the DataFrame 'df'
# and display the result as a series of value counts.
df.Type.value_counts()

Type
L    6000
M    2997
H    1003
Name: count, dtype: int64

```python
# Since there are three types (L, M, H), we need to perform one-hot encoding on the 'Type' column.
# One-hot encoding converts categorical variables into binary vectors for machine learning models.

# One-hot encoding 'Type' column:
df = pd.get_dummies(df, columns=['Type'])

In [6]:
# Encode categorical variable 'Type' using one-hot encoding and create dummy columns
df1 = pd.get_dummies(df, columns=['Type'])

# Convert the dummy columns 'Type_H', 'Type_L', 'Type_M' to integer type
df1[['Type_H', 'Type_L', 'Type_M']] = df1[['Type_H', 'Type_L', 'Type_M']].astype(int)

# Display the first few rows of the modified DataFrame
df1.head()

Unnamed: 0,UDI,Product ID,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type,Type_H,Type_L,Type_M
0,1,M14860,298.1,308.6,1551,42.8,0,0,No Failure,0,0,1
1,2,L47181,298.2,308.7,1408,46.3,3,0,No Failure,0,1,0
2,3,L47182,298.1,308.5,1498,49.4,5,0,No Failure,0,1,0
3,4,L47183,298.2,308.6,1433,39.5,7,0,No Failure,0,1,0
4,5,L47184,298.2,308.7,1408,40.0,9,0,No Failure,0,1,0


In [7]:
# Obtain unique values in the 'Target' column of DataFrame df1
df1.Target.unique()

array([0, 1], dtype=int64)

#### **Explanation:**

The `Target` column in DataFrame `df1` contains binary values, where:
- `1` indicates machine failure
- `0` indicates no failure (not faild)



In [8]:
# Filter rows in DataFrame df1 where the 'Target' column has a value of 1
# Select the 'Failure Type' column from the filtered DataFrame
# Count the occurrences of each unique value in the 'Failure Type' column
df1[df1.Target == 1]['Failure Type'].value_counts()

Failure Type
Heat Dissipation Failure    112
Power Failure                95
Overstrain Failure           78
Tool Wear Failure            45
No Failure                    9
Name: count, dtype: int64

In [9]:
# Display the columns of the DataFrame 'df1'
df1.columns

Index(['UDI', 'Product ID', 'Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Target',
       'Failure Type', 'Type_H', 'Type_L', 'Type_M'],
      dtype='object')

#### **Check for Operation Success:**

Inspect whether the operation on the DataFrame 'df1' for displaying columns was successful or not.


## **4. Feature Engineering:**

In [10]:
# Selecting specific columns from DataFrame df1 to create the feature matrix X
X = df1[['Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Type_H', 'Type_L', 'Type_M']]

# Assigning the 'Target' column from DataFrame df1 to the target variable y
y = df1.Target

##### Due to the presence of imbalanced data, the `imblearn` module is employed to address the class imbalance issue and balance the dataset.


In [11]:
# Apply SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced classes
# Create a SMOTE object
smote = SMOTE()

# Resample the feature set (X) and target variable (y) using SMOTE
X_smote, y_smote = smote.fit_resample(X, y)

In [12]:
# Split the dataset into training and testing sets using the train_test_split function
# X_smote: Features data after applying SMOTE (Synthetic Minority Over-sampling Technique)
# y_smote: Target variable data after applying SMOTE
# test_size=0.2: Allocating 20% of the data for testing, and the remaining 80% for training
# x_train: Training set features
# x_test: Testing set features
# y_train: Training set target variable
# y_test: Testing set target variable
x_train, x_test, y_train , y_test = train_test_split(X_smote, y_smote, test_size=0.2)

## **5. Machine Learning (Predictive Analytics):**

In [13]:
# List of machine learning models to be evaluated
models = [LinearRegression, LogisticRegression,
          DecisionTreeClassifier, RandomForestClassifier,
          KNeighborsClassifier, GaussianNB,
          MultinomialNB, SVC]

# Corresponding names for the models
names = ['LinearRegression', 'LogisticRegression',
         'DecisionTreeClassifier', 'RandomForestClassifier',
         'KNeighborsClassifier', 'GaussianNB',
         'MultinomialNB', 'SVC']

# List to store model names and their corresponding accuracy scores
data = []

# Loop through each model and evaluate its performance
for name, model in zip(names, models):
    # Display the current model being processed
    print(name)
    
    # Initialize the model
    m = model()
    
    # Train the model on the training data
    m.fit(x_train, y_train)
    
    # Evaluate the model's performance on the test data and calculate accuracy score
    score = m.score(x_test, y_test)
    
    # Append model name and accuracy score to the data list
    data.append([name, score])


LinearRegression
LogisticRegression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


DecisionTreeClassifier
RandomForestClassifier
KNeighborsClassifier
GaussianNB
MultinomialNB
SVC


In [14]:
# Sort the 'data' list of tuples based on the second element (score) in descending order
data.sort(key=lambda x: x[1], reverse=True)

# Create a Pandas DataFrame from the sorted 'data' with columns 'Model name' and 'Score'
pd.DataFrame(data, columns=['Model name', 'Score'])

Unnamed: 0,Model name,Score
0,RandomForestClassifier,0.980854
1,DecisionTreeClassifier,0.974386
2,KNeighborsClassifier,0.941785
3,LogisticRegression,0.88564
4,SVC,0.832083
5,GaussianNB,0.763519
6,MultinomialNB,0.642432
7,LinearRegression,0.596257


##### **Training the RandomForestClassifier with an increased number of n_estimators and utilizing the entire dataset for training**

```markdown
| Model name               | Score    |
|--------------------------|----------|
| RandomForestClassifier   | 0.980854 |
| DecisionTreeClassifier   | 0.974386 |
| KNeighborsClassifier     | 0.941785 |
| LogisticRegression       | 0.885640 |
| SVC                      | 0.832083 |
| GaussianNB               | 0.763519 |
| MultinomialNB            | 0.642432 |
| LinearRegression         | 0.596257 |


### **Random Forest Classifier Algorithm:**

In [15]:
# Create a RandomForestClassifier with 300 decision trees
best_model = RandomForestClassifier(n_estimators=300)

# Train (fit) the RandomForestClassifier on the resampled training data
best_model.fit(X_smote, y_smote)

##### Creating a function that takes X values as input and returns predictions for failure.


In [16]:
# Function definition for 'is_failure'
def is_failure(x):
    # Create dummy variables for the 'Type' column using one-hot encoding
    df1 = pd.get_dummies(x, columns=['Type'])

    # Convert the one-hot encoded columns 'Type_H', 'Type_L', 'Type_M' to integer type
    df1[['Type_H', 'Type_L', 'Type_M']] = df1[['Type_H', 'Type_L', 'Type_M']].astype(int)
    
    # Return the predictions using the 'best_model' on the modified DataFrame 'df1'
    return best_model.predict(df1)

In [17]:
df.columns

Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Target', 'Failure Type'],
      dtype='object')

In [18]:
# Extracting a subset of features from the DataFrame 'df'
# Selecting columns: 'Type', 'Air temperature [K]', 'Process temperature [K]', 
# 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]'
x = df[['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']]

# Printing the shape of the extracted feature subset
print(x.shape)

# Calculating the accuracy score by comparing the predictions obtained from 'is_failure(x)' 
# with the actual values in the 'Target' column of the DataFrame 'df'
accuracy_score(is_failure(x), df.Target)

(10000, 6)


1.0

> Testing the model with actual data without oversampling:

Extracting a specific set of features from the DataFrame 'df' for testing. The selected columns include: 'Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', and 'Tool wear [min]'.

Printing the shape of the extracted feature subset.

Calculating the accuracy score by comparing the predictions obtained from the 'is_failure(x)' function with the actual values in the 'Target' column of the DataFrame 'df'.


### **Saving the Model:**

In [19]:
# Check if the 'models' directory exists; if not, create it
if not os.path.exists('models'):
    os.mkdir('models')

# Save the best_model using pickle in the 'models' directory
with open('models/is_failure.pkl', 'wb') as f:
    pickle.dump(best_model, f)

# **Failure Type:**

In [20]:
# Selecting specific columns as features (independent variables) from the DataFrame df1
X = df1[['Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Type_H', 'Type_L', 'Type_M']]

# Selecting the 'Failure Type' column as the target variable (dependent variable) from the DataFrame df1
y = df1['Failure Type']


```python
# Extracting specific columns as features (independent variables) from DataFrame df1
X = df1[['Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Type_H', 'Type_L', 'Type_M']]

# Converting string values in the 'Failure Type' column to integer format for the target variable (dependent variable) in DataFrame df1
y = df1['Failure Type']


In [21]:
# Create a dictionary for label encoding, where each unique value in 'y' is assigned a numerical label
labelEncoding = {j: i for i, j in enumerate(y.unique())}

# Create an inverse dictionary to map numerical labels back to their original values
inverse = {j: i for i, j in labelEncoding.items()}

# Apply label encoding to the 'y' variable using the created mapping
y = y.map(labelEncoding)

In [22]:
labelEncoding

{'No Failure': 0,
 'Power Failure': 1,
 'Tool Wear Failure': 2,
 'Overstrain Failure': 3,
 'Random Failures': 4,
 'Heat Dissipation Failure': 5}

In [23]:
inverse

{0: 'No Failure',
 1: 'Power Failure',
 2: 'Tool Wear Failure',
 3: 'Overstrain Failure',
 4: 'Random Failures',
 5: 'Heat Dissipation Failure'}

In [24]:
y.map(inverse)

0       No Failure
1       No Failure
2       No Failure
3       No Failure
4       No Failure
           ...    
9995    No Failure
9996    No Failure
9997    No Failure
9998    No Failure
9999    No Failure
Name: Failure Type, Length: 10000, dtype: object

#### **Handling Unbalanced Failure Types:**

As the failure types are unbalanced, we need to address this issue through oversampling.

We will perform oversampling to balance the distribution of failure types in the dataset. Specifically, we will apply a technique such as Synthetic Minority Over-sampling Technique (SMOTE) to create synthetic samples for the minority failure types, ensuring a more balanced representation of failure classes.

This step is crucial to prevent bias in the machine learning model and enhance its ability to accurately predict and generalize across different failure types.


In [25]:
# Create a SMOTE (Synthetic Minority Over-sampling Technique) object
smote = SMOTE()

# Apply SMOTE to the feature set (X) and target variable (y)
X_smote, y_smote = smote.fit_resample(X, y)

In [26]:
# Split the dataset into training and testing sets using the train_test_split function
# X_smote and y_smote are assumed to be the features and target variable after applying SMOTE

# x_train: Training set features
# x_test: Testing set features
# y_train: Training set target variable
# y_test: Testing set target variable

# The test_size parameter is set to 0.2, indicating that 20% of the data will be used for testing,
# and the remaining 80% will be used for training.

x_train, x_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2)

In [27]:
# List of machine learning models to be evaluated
models = [LinearRegression, LogisticRegression,
          DecisionTreeClassifier, RandomForestClassifier,
          KNeighborsClassifier, GaussianNB,
          MultinomialNB]

# Corresponding names of the models for identification
names = ['LinearRegression', 'LogisticRegression',
          'DecisionTreeClassifier', 'RandomForestClassifier',
          'KNeighborsClassifier', 'GaussianNB',
          'MultinomialNB']

# Container to store model names and their corresponding scores
data = []

# Iterate over the models and evaluate their performance
for name, model in zip(names, models):
    print(name)  # Display the name of the current model
    m = model()  # Create an instance of the model
    m.fit(x_train, y_train)  # Train the model with training data
    score = m.score(x_test, y_test)  # Evaluate the model on the test data
    data.append([name, score])  # Store the model name and score in the data list


LinearRegression
LogisticRegression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


DecisionTreeClassifier
RandomForestClassifier
KNeighborsClassifier
GaussianNB
MultinomialNB


In [28]:
# Sort the 'data' list of tuples based on the second element of each tuple (index 1)
# Sorting is done in descending order (reverse=True) to have higher scores first
data.sort(key=lambda x: x[1], reverse=True)

# Create a Pandas DataFrame from the sorted 'data' list
# Specify column names as 'Model name' and 'Score'
pd.DataFrame(data, columns=['Model name', 'Score'])

Unnamed: 0,Model name,Score
0,RandomForestClassifier,0.99482
1,DecisionTreeClassifier,0.98964
2,KNeighborsClassifier,0.948804
3,LogisticRegression,0.768281
4,GaussianNB,0.702581
5,MultinomialNB,0.523699
6,LinearRegression,0.316707


#### Training RandomForestClassifier with Increased n_estimators

Considering the table above, it appears that the RandomForestClassifier is a promising model for the given problem. To enhance its performance, we will train the RandomForestClassifier with a higher number of `n_estimators` using the entire dataset for trainin.


In [29]:
# Instantiate a RandomForestClassifier with 300 decision trees
best_model = RandomForestClassifier(n_estimators=300)

# Train the RandomForestClassifier using the SMOTE-resampled training data
best_model.fit(X_smote, y_smote)

#### Creating a function that takes input values (X) and returns predictions for the 'failure_type'.


In [30]:
# Define a function 'failure_type' that takes a DataFrame 'x' as input
def failure_type(x):
    
    # Use 'pd.get_dummies' to one-hot encode the 'Type' column in the DataFrame 'x'
    df1 = pd.get_dummies(x, columns=['Type'])
    
    # Convert the one-hot encoded columns 'Type_H', 'Type_L', 'Type_M' to integers
    df1[['Type_H', 'Type_L', 'Type_M']] = df1[['Type_H', 'Type_L', 'Type_M']].astype(int)
    
    # Use the 'best_model' to predict the failure type for the preprocessed DataFrame 'df1'
    return best_model.predict(df1)

In [31]:
# Extracting specific columns from the DataFrame 'df' and creating a new DataFrame 'x'
x = df[['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']]

# Displaying the shape of the DataFrame 'x'
print(x.shape)

# Making predictions using the 'failure_type' function on the extracted features 'x'
prediction = failure_type(x)

# Calculating and printing the accuracy score using the true 'Failure Type' values from 'df'
# and the predicted values obtained from the 'failure_type' function
accuracy_score(df['Failure Type'].map(labelEncoding), prediction)

(10000, 6)


1.0

* Testing the model with actual data without oversampling
* Saving the failure_type model

In [32]:
# Save the trained machine learning model to a file using pickle
# The file is named 'failure_type.pkl' and is stored in the 'models' directory
with open('models/failure_type.pkl', 'wb') as f:
    # Use pickle to dump the best_model object into the file
    pickle.dump(best_model, f)

Saving the inverse dictionary for converting integer values of predicted outcomes back into their corresponding string representations.


In [33]:
# Save the 'inverse' object using pickle for later use in the 'models' directory
# 'wb' mode is used for writing in binary format
with open('models/encoding.pkl', 'wb') as f:
    # Dump the 'inverse' object into the file handle 'f'
    pickle.dump(inverse, f)

In [34]:
# Create a pandas Series using the predicted values
# prediction is assumed to be a list or array containing the predicted values

# Map the predicted values to their corresponding original values using the 'inverse' mapping
# inverse is assumed to be a mapping or function to convert predicted values back to their original form
# The result is a Series containing the original values corresponding to the predicted values
pd.Series(prediction).map(inverse)

0       No Failure
1       No Failure
2       No Failure
3       No Failure
4       No Failure
           ...    
9995    No Failure
9996    No Failure
9997    No Failure
9998    No Failure
9999    No Failure
Length: 10000, dtype: object

> **By employing the provided code, we can transform the predicted values into the corresponding failure_type using the 'inverse' mapping.**
