# Sensorless Drive Diagnosis Project

#### Purpose of the project
To build a multi-class classification model with numerical attributes

#### About the dataset
Dataset used in the analysis: Sensorless Drive Diagnosis dataset

https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis 

The dataset contains features extracted from electric current drive signals.

There are 48 continous predictive features. The target feature contains 11 classes. 






SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.

In this iteration, we will establish the baseline accuracy measurement for comparison with future rounds of modeling.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 84.65%. Two algorithms (Random Forest and Extra Trees) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an accuracy metric of 99.95%. After applying the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 99.97%, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Extra Trees algorithm achieved the best overall training and validation results. For this dataset, Extra Trees could be considered for further modeling.

### Load Libraries And Modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid', {'axes.facecolor': '0.9'})
sns.set(rc={'figure.figsize': (35, 20)},font_scale=2.1) 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, ParameterGrid, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import recall_score, f1_score, fbeta_score, r2_score, roc_auc_score, roc_curve, auc, cohen_kappa_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score
from sklearn.pipeline import Pipeline
import xgboost as xgb

### Load Dataset

In [2]:
df = pd.read_csv("sensorless_drive_diagnosis.txt", delim_whitespace=True, header=None)
header_names = ['feat' + str(i) for i in range(df.shape[1]-1)]
header_names.append('class')
df.set_axis(header_names, axis=1, inplace=True)
display(df.head())
display(df.tail())

Unnamed: 0,feat0,feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8,feat9,...,feat39,feat40,feat41,feat42,feat43,feat44,feat45,feat46,feat47,class
0,-3.0146e-07,8.2603e-06,-1.2e-05,-2e-06,-1.4386e-06,-2.1e-05,0.031718,0.03171,0.031721,-0.032963,...,-0.63308,2.9646,8.1198,-1.4961,-1.4961,-1.4961,-1.4996,-1.4996,-1.4996,1
1,2.9132e-06,-5.2477e-06,3e-06,-6e-06,2.7789e-06,-4e-06,0.030804,0.03081,0.030806,-0.03352,...,-0.59314,7.6252,6.169,-1.4967,-1.4967,-1.4967,-1.5005,-1.5005,-1.5005,1
2,-2.9517e-06,-3.184e-06,-1.6e-05,-1e-06,-1.5753e-06,1.7e-05,0.032877,0.03288,0.032896,-0.029834,...,-0.63252,2.7784,5.3017,-1.4983,-1.4983,-1.4982,-1.4985,-1.4985,-1.4985,1
3,-1.3226e-06,8.8201e-06,-1.6e-05,-5e-06,-7.2829e-07,4e-06,0.02941,0.029401,0.029417,-0.030156,...,-0.62289,6.5534,6.2606,-1.4963,-1.4963,-1.4963,-1.4975,-1.4975,-1.4976,1
4,-6.8366e-08,5.6663e-07,-2.6e-05,-6e-06,-7.9406e-07,1.3e-05,0.030119,0.030119,0.030145,-0.031393,...,-0.6301,4.5155,9.5231,-1.4958,-1.4958,-1.4958,-1.4959,-1.4959,-1.4959,1


Unnamed: 0,feat0,feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8,feat9,...,feat39,feat40,feat41,feat42,feat43,feat44,feat45,feat46,feat47,class
58504,-1e-05,2e-06,-2.1e-05,2.1e-05,-6e-06,-9.8e-05,-0.083417,-0.083419,-0.083398,-0.18234,...,-0.52907,1.4641,7.0032,-1.5024,-1.5025,-1.5023,-1.4933,-1.4933,-1.4933,11
58505,-1.1e-05,2e-05,3.1e-05,-1.8e-05,-0.000106,0.000292,-0.085131,-0.085151,-0.085182,-0.18432,...,-0.51971,3.3275,2.3072,-1.5024,-1.5025,-1.5024,-1.4925,-1.4925,-1.4926,11
58506,-6e-06,1.9e-05,-0.000102,-3e-06,4e-06,0.000117,-0.081989,-0.082008,-0.081906,-0.18614,...,-0.51103,20.925,9.0437,-1.5035,-1.5035,-1.5039,-1.4911,-1.4912,-1.491,11
58507,-4e-06,3.4e-05,-0.000442,5e-06,7e-06,8.7e-05,-0.0815,-0.081534,-0.081093,-0.18363,...,-0.52033,1.389,10.743,-1.5029,-1.5029,-1.503,-1.4932,-1.4932,-1.4931,11
58508,-9e-06,5.2e-05,7.2e-05,1e-05,4e-06,-3.2e-05,-0.083034,-0.083086,-0.083159,-0.18589,...,-0.50974,1.6026,4.5773,-1.5039,-1.504,-1.5036,-1.4945,-1.4946,-1.4943,11


### Basic EDA

In [3]:
display(df.describe())

Unnamed: 0,feat0,feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8,feat9,...,feat39,feat40,feat41,feat42,feat43,feat44,feat45,feat46,feat47,class
count,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,...,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0,58509.0
mean,-3e-06,1.439648e-06,1.412013e-06,-1e-06,1.351239e-06,-2.654483e-07,0.001915,0.001913,0.001912,-0.011897,...,-0.397757,7.293781,8.273772,-1.500887,-1.500912,-1.500805,-1.497771,-1.497794,-1.497686,6.0
std,7.2e-05,5.555429e-05,0.0002353009,6.3e-05,5.660943e-05,0.0002261907,0.036468,0.036465,0.03647,0.066482,...,25.018728,12.451781,6.565952,0.003657,0.003668,0.003632,0.003163,0.003163,0.003175,3.162305
min,-0.013721,-0.0054144,-0.01358,-0.012787,-0.0083559,-0.0097413,-0.13989,-0.13594,-0.13086,-0.21864,...,-0.90235,-0.59683,0.32066,-1.5255,-1.5262,-1.5237,-1.5214,-1.5232,-1.5213,1.0
25%,-7e-06,-1.4444e-05,-7.2396e-05,-5e-06,-1.4753e-05,-7.3791e-05,-0.019927,-0.019951,-0.019925,-0.032144,...,-0.71547,1.4503,4.4363,-1.5033,-1.5034,-1.5032,-1.4996,-1.4996,-1.4995,3.0
50%,-3e-06,8.8046e-07,5.1377e-07,-1e-06,7.5402e-07,-1.6593e-07,0.013226,0.01323,0.013247,-0.015566,...,-0.66171,3.3013,6.4791,-1.5003,-1.5003,-1.5003,-1.4981,-1.4981,-1.498,6.0
75%,2e-06,1.8777e-05,7.52e-05,4e-06,1.9062e-05,7.1386e-05,0.02477,0.024776,0.024777,0.020614,...,-0.57398,8.2885,9.8575,-1.4982,-1.4982,-1.4982,-1.4962,-1.4963,-1.4962,9.0
max,0.005784,0.0045253,0.0052377,0.001453,0.00082451,0.0027536,0.069125,0.06913,0.069131,0.35258,...,3670.8,889.93,153.15,-1.4576,-1.4561,-1.4555,-1.3372,-1.3372,-1.3371,11.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58509 entries, 0 to 58508
Data columns (total 49 columns):
feat0     58509 non-null float64
feat1     58509 non-null float64
feat2     58509 non-null float64
feat3     58509 non-null float64
feat4     58509 non-null float64
feat5     58509 non-null float64
feat6     58509 non-null float64
feat7     58509 non-null float64
feat8     58509 non-null float64
feat9     58509 non-null float64
feat10    58509 non-null float64
feat11    58509 non-null float64
feat12    58509 non-null float64
feat13    58509 non-null float64
feat14    58509 non-null float64
feat15    58509 non-null float64
feat16    58509 non-null float64
feat17    58509 non-null float64
feat18    58509 non-null float64
feat19    58509 non-null float64
feat20    58509 non-null float64
feat21    58509 non-null float64
feat22    58509 non-null float64
feat23    58509 non-null float64
feat24    58509 non-null float64
feat25    58509 non-null float64
feat26    58509 non-null float64


In [5]:
df.isna().sum().sum()

0

In [6]:
df.groupby('class').size()

class
1     5319
2     5319
3     5319
4     5319
5     5319
6     5319
7     5319
8     5319
9     5319
10    5319
11    5319
dtype: int64

In [3]:
X = df.drop(['class'], axis=1)
y = df['class']

### Splitting The Data 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

In [5]:
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler
scaler = StandardScaler()
scaled_x_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
scaled_X_train = pd.DataFrame(scaled_x_train, columns=X_train.columns)

In [34]:
clf = SGDClassifier(alpha = 0.0001, max_iter= 10000, penalty='l1')
clf.fit(scaled_X_train, y_train)
y_pred_train = clf.predict(scaled_X_train)
y_pred_test = clf.predict(scaled_X_test)

In [35]:
accuracy_score(y_test, y_pred_test)

0.6236540762262861

In [36]:
accuracy_score(y_train, y_pred_train)

0.6090328369688295

In [21]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty='l1', solver='saga', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)

In [22]:
accuracy_score(y_test, y_pred_test)

0.9092462826867203

In [23]:
accuracy_score(y_train, y_pred_train)

0.9106116606490482

In [12]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty='l2', solver='newton-cg', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)

In [13]:
accuracy_score(y_test, y_pred_test)

0.9175354640232439

In [14]:
accuracy_score(y_train, y_pred_train)

0.9204392505394492

In [15]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)

In [16]:
accuracy_score(y_test, y_pred_test)

0.9175354640232439

In [17]:
accuracy_score(y_train, y_pred_train)

0.9204392505394492

In [18]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty='l2', solver='sag', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)

In [19]:
accuracy_score(y_train, y_pred_train)

0.9127053645822206

In [20]:
accuracy_score(y_test, y_pred_test)

0.9120663134506922

In [None]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(penalty='l2', solver='sag', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)

In [None]:
accuracy_score(y_train, y_pred_train)

In [None]:
accuracy_score(y_test, y_pred_test)

NO PENALTY

In [26]:

log = LogisticRegression(solver='saga', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

0.9102271027837716
0.9086480943428473


In [31]:
log = LogisticRegression(solver='saga', max_iter=10000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

0.910205738457923
0.9086480943428473


In [28]:

log = LogisticRegression(solver='newton-cg', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

0.9204392505394492
0.9175354640232439


In [29]:

log = LogisticRegression(solver='lbfgs', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

0.9204392505394492
0.9175354640232439


In [30]:
log = LogisticRegression(solver='sag', max_iter=100000)
log.fit(scaled_X_train, y_train)
y_pred_train = log.predict(scaled_X_train)
y_pred_test = log.predict(scaled_X_test)
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

0.9127053645822206
0.9120663134506922
