## Hands-on 3C 
#### Build 3 classification models to classify faulty steel plates. The dataset contains information about steel plates and their faults. There are 27 features and the data samples have been classified into 7 different types of steel plate faults.

In [1]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# Load the required libraries
from pandas import read_csv
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

To do:
- Load the dataset from "steel_faults.csv"

In [3]:
df = read_csv("steel_faults.csv")

To do: 
- Print 5 random data samples from the dataset

In [4]:
df.sample(5)

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Fault
1817,1170,1176,745366,745380,68,15,14,7859,106,127,...,0.4,1.0,1.0,1.8325,0.7782,1.1461,0.5714,-0.0971,0.1892,Other_Faults
672,39,210,2449833,2449893,5816,273,132,618320,51,127,...,0.6264,0.4545,0.0,3.7646,2.233,1.7781,-0.6491,-0.1694,1.0,K_Scatch
690,41,218,2840169,2840236,6503,264,135,677593,40,124,...,0.6704,0.4963,0.0,3.8131,2.248,1.8261,-0.6215,-0.186,1.0,K_Scatch
1363,867,1104,949655,949669,1695,247,106,197365,103,132,...,0.9595,0.1321,0.0,3.2292,2.3747,1.1461,-0.9409,-0.0903,1.0,Other_Faults
1791,209,259,9649727,9649771,1182,87,71,130201,96,127,...,0.5747,0.6197,0.0,3.0726,1.699,1.6435,-0.12,-0.1394,1.0,Other_Faults


To do:
- Separate the dataset into features (X) and targets (y)

In [5]:
X = df.drop(columns=['Fault'])
y = df['Fault']

To do: 
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [6]:
# Use spot-checking to quickly evaluate the performance of 3 classifiers
models = {}
models['knn'] = KNeighborsClassifier()
models['dtc'] = DecisionTreeClassifier(random_state=42)
models['lgr'] = LogisticRegression()

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for m in models:
    scores = cross_val_score(models[m], X, y, cv=kf, n_jobs=-1)
    print(f"{m}: {scores.mean():.3%}, {scores.std():.3%}")

knn: 46.368%, 3.167%
dtc: 70.736%, 2.008%
lgr: 45.646%, 2.028%


To do: 
- Peform feature scaling using standard scaler. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on the scaled features using 5-fold cross validation.

In [7]:
scl = StandardScaler()
Xs = scl.fit_transform(X)

print("After feature scaling")
for m in models:
    scores = cross_val_score(models[m], Xs, y, cv=kf, n_jobs=-1)
    print(f"{m}: {scores.mean():.3%}, {scores.std():.3%}")

After feature scaling
knn: 73.983%, 2.111%
dtc: 70.684%, 2.033%
lgr: 71.302%, 3.089%


To do: 
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the scaled features to 13. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [10]:
pca = PCA(n_components=13)
Xsr = pca.fit_transform(Xs)

print("After dimensionality reduction")
for m in models:
    scores = cross_val_score(models[m], Xsr, y, cv=kf, n_jobs=-1)
    print(f"{m}: {scores.mean():.3%}, {scores.std():.3%}")

After dimensionality reduction
knn: 73.725%, 1.527%
dtc: 65.174%, 3.083%
lgr: 70.632%, 3.342%
