## Hands-on 3C
#### Build 3 classification models to classify faulty steel plates. The dataset contains information about steel plates and their faults. There are 27 features and the data samples have been classified into 7 different types of steel plate faults.

In [7]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [8]:
# Load the required libraries
import sys
from pandas import read_csv
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [9]:
# Load the dataset
# Determine the environment
is_colab = 'google.colab' in sys.modules

# Execute code conditionally
if is_colab:
    # Code for Google Colab environment
    df = read_csv("https://raw.githubusercontent.com/wooihaw/datasets/main/steel_faults.csv")
else:
    # Code for local Jupyter Notebook environment
    df = read_csv("steel_faults.csv")

**To do:**
- Print 5 random data samples from the dataset

In [10]:
df.sample(5)

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Fault
255,357,370,1483879,1483901,130,26,25,13185,90,117,...,0.5,0.88,1.0,2.1139,1.1139,1.3424,0.4091,-0.2076,0.4729,Z_Scratch
1907,122,146,203252,203261,95,33,22,12566,124,143,...,0.7273,0.4091,0.0,1.9777,1.3802,0.9542,-0.625,0.0334,0.3601,Other_Faults
1580,0,14,990037,990193,1183,80,162,91488,57,101,...,0.175,0.963,1.0,3.073,1.1461,2.1931,0.9103,-0.3958,1.0,Other_Faults
1533,1641,1666,1606625,1606671,652,49,48,56018,73,108,...,0.5102,0.9583,1.0,2.8142,1.3979,1.6628,0.4565,-0.3288,0.9965,Other_Faults
349,853,861,31984,31995,66,9,11,12872,178,207,...,0.8889,1.0,1.0,1.8195,0.9031,1.0414,0.2727,0.5237,0.1934,K_Scatch


**To do:**
- Separate the dataset into features (X) and targets (y)

In [12]:
X = df.drop(columns=['Fault'])
y = df['Fault']

**To do:**
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [13]:
# Use spotchecking to quickly evaluate different machine learning algorithms
models = {}
models['knn'] = KNeighborsClassifier()
models['lgr'] = LogisticRegression()
models['dtc'] = DecisionTreeClassifier(random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for m in models:
  scores = cross_val_score(models[m], X, y, cv=kf, n_jobs=-1)
  print(f'{m}, {scores.mean():.3%}')

knn, 46.368%
lgr, 45.646%
dtc, 70.736%


**To do:**
- Peform feature scaling using standard scaler. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on the scaled features using 5-fold cross validation.

In [14]:
scl = StandardScaler()
Xs = scl.fit_transform(X)
print("Accuracy with feature scaling")
for m in models:
  scores = cross_val_score(models[m], Xs, y, cv=kf, n_jobs=-1)
  print(f'{m}, {scores.mean():.3%}')

Accuracy with feature scaling
knn, 73.983%
lgr, 71.302%
dtc, 70.684%


**To do:**
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the scaled features to 13. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [15]:
pca = PCA(n_components=13)
Xsr = pca.fit_transform(Xs)
print("Accuracy with feature scaling and dimensionality reduction")
for m in models:
  scores = cross_val_score(models[m], Xsr, y, cv=kf, n_jobs=-1)
  print(f'{m}, {scores.mean():.3%}')

Accuracy with feature scaling and dimensionality reduction
knn, 73.725%
lgr, 70.632%
dtc, 65.174%
