# Predictive Maintenance 

### Student: Rodolfo Lerma

This assignment covers the topic of predictive maintenance. Predictive Maintenance problems adress predicting when a machine needs to be maintained ahead of breaking down. This problem can occur anywhere regular maintenance is required for a machine. For example, it can be used in manufacturing, fleet operations, train maintenance, etc.

This assignment will use the [Predictive Maintenance Dataset](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). The dataset consists of 10 000 data points stored as rows with 14 features in columns. The 'machine failure' label that indicates, whether the machine has failed in this particular datapoint.

# Learning Objectives
- Perform model tuning based on hyper parameters.
- Select the best model after attempting multiple models.
- Perform recursive feature elimination, producing a statistically significant improvement over a model without feature selection.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split

ai4i2020 = pd.read_csv('ai4i2020.csv')
print(ai4i2020.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  object 
 4   Process temperature [K]  10000 non-null  object 
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 703.2+ KB
None


In [2]:
ai4i2020.head(10)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0
5,6,M14865,M,298.1,308.6,1425,41.9,11,0
6,7,L47186,L,298.1,308.6,1558,42.4,14,0
7,8,L47187,L,298.1,308.6,1527,40.2,16,0
8,9,M14868,M,298.3,308.7,1667,28.6,18,0
9,10,M14869,M,298.5,309.0,1741,28.0,21,0


In [3]:
ai4i2020.shape

(10000, 9)

In [4]:
ai4i2020.dtypes

UDI                          int64
Product ID                  object
Type                        object
Air temperature [K]         object
Process temperature [K]     object
Rotational speed [rpm]       int64
Torque [Nm]                float64
Tool wear [min]              int64
Machine failure              int64
dtype: object

## Question 1.1:  Write a command that will calculate the number of unique values for each feature in the training data.

In [5]:
columns_names = ai4i2020.columns.to_list()

for i in columns_names:
    unique_values = ai4i2020[i].value_counts(dropna=False)
    print(' ')
    print('****************')
    print(' ')
    print(unique_values)

 
****************
 
2049    1
8865    1
6806    1
4759    1
8857    1
       ..
9526    1
5432    1
7481    1
1338    1
2047    1
Name: UDI, Length: 10000, dtype: int64
 
****************
 
L48931    1
L50527    1
M22467    1
L53732    1
L49937    1
         ..
M21134    1
L56472    1
L50967    1
M16245    1
L49482    1
Name: Product ID, Length: 10000, dtype: int64
 
****************
 
L    6000
M    2997
H    1003
Name: Type, dtype: int64
 
****************
 
300.7    279
298.9    231
297.4    230
300.5    229
298.8    227
        ... 
304.4      7
296        6
295.4      3
295.3      3
304.5      1
Name: Air temperature [K], Length: 93, dtype: int64
 
****************
 
310.6    317
310.8    273
310.7    266
308.6    265
310.5    263
        ... 
306.9      4
306.8      4
305.8      3
305.7      2
313.8      2
Name: Process temperature [K], Length: 82, dtype: int64
 
****************
 
1452    48
1435    43
1447    42
1469    40
1479    40
        ..
2165     1
2133     1
2117     1

In [6]:
print('Number of Unique Values per Column')
print(' ')
for i in columns_names:
    unique = ai4i2020[i].nunique()
    print(i)
    print(unique)
    print(' ')
    print('************')

Number of Unique Values per Column
 
UDI
10000
 
************
Product ID
10000
 
************
Type
3
 
************
Air temperature [K]
93
 
************
Process temperature [K]
82
 
************
Rotational speed [rpm]
941
 
************
Torque [Nm]
577
 
************
Tool wear [min]
246
 
************
Machine failure
2
 
************


## Question 1.2: Determine if the data contains any missing values, and replace the values with np.nan. Missing values would be '?'.

In [7]:
ai4i2020.replace('?', np.nan, inplace=True)

In [8]:
ai4i2020.isnull().sum() #checking the dataset for NaN values .... NaN values in two columns

UDI                          0
Product ID                   0
Type                         0
Air temperature [K]        140
Process temperature [K]    183
Rotational speed [rpm]       0
Torque [Nm]                  0
Tool wear [min]              0
Machine failure              0
dtype: int64

## Question 1.3: Replace all missing values with the mean. Change column types to numeric.

In [9]:
#replace missing values with their mode
import statistics as st
sample_columns = ['Air temperature [K]','Process temperature [K]']
for i in sample_columns:
    ai4i2020[i] = pd.to_numeric(ai4i2020[i])
    ai4i2020[i].fillna(np.mean(ai4i2020[i]), inplace=True)

## Question 1.4: Drop UDI and 'Product ID' from the data

In [10]:
ai4i2020.drop(columns=['UDI', 'Product ID'], inplace=True)
ai4i2020.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,M,298.1,308.6,1551,42.8,0,0
1,L,298.2,308.7,1408,46.3,3,0
2,L,298.1,308.5,1498,49.4,5,0
3,L,298.2,308.6,1433,39.5,7,0
4,L,298.2,308.7,1408,40.0,9,0


## Question 2.2: Apply [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to data. Make sure to Fit the training data and transform both training and test data. 

In [11]:
from sklearn.preprocessing import OneHotEncoder

hot_encoder = OneHotEncoder(sparse=False)
x = ai4i2020['Type'].values.reshape(-1, 1)
ai4i2020_hot = hot_encoder.fit_transform(x)

In [12]:
column_names = hot_encoder.categories_

In [28]:
for i in range(3):
    ai4i2020[column_names[0][i]] = ai4i2020_hot[:,i].tolist()

In [31]:
ai4i2020.drop(columns=['Type'], inplace = True)

## Question 2.1: Split the data into training and testing taking into consideration 'Machine failure' as the target (y)

In [15]:
from sklearn.preprocessing import OneHotEncoder


## Question 2.3: Apply [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to the training data since there is class imbalance.

In [16]:
from imblearn.over_sampling import SMOTE


ModuleNotFoundError: No module named 'imblearn'

## Question 3.1: Train five machine learning [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) based on the training data, and evaluate their performance on the test dataset. Use default hyperparameter values.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


In [None]:
#Build models (You can either do it combined or separate)

models = {'Logistic Regresion': LogisticRegression(), 'Support Vector Machine': SVC(), 'K-NN': KNeighborsClassifier(), 'Decision Tree':DecisionTreeClassifier(),'XGBoost': XGBClassifier()}


## Questions 3.2:  Perform recursive feature elimination (3 features) on the dataset using a logistic regression classifier with max_iter= 1000, random_state=5.  Any difference in the results? Explain.

In [None]:
from sklearn.feature_selection import RFE


## Q.4. Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include:
What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work. 