<a href="https://colab.research.google.com/github/wooihaw/pai_may2024/blob/main/Part_3/handson_3/handson_3c.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hands-on 3C
#### Build 3 classification models to classify faulty steel plates. The dataset contains information about steel plates and their faults. There are 27 features and the data samples have been classified into 7 different types of steel plate faults.

In [1]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# Load the required libraries
import sys
from pandas import read_csv
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [3]:
# Load the dataset
# Determine the environment
is_colab = 'google.colab' in sys.modules

# Execute code conditionally
if is_colab:
    # Code for Google Colab environment
    df = read_csv("https://raw.githubusercontent.com/wooihaw/datasets/main/steel_faults.csv")
else:
    # Code for local Jupyter Notebook environment
    df = read_csv("steel_faults.csv")

**To do:**
- Print 5 random data samples from the dataset

In [4]:
df.sample(5)

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Fault
961,143,151,3986260,3986268,51,10,8,5716,102,125,...,0.8,1.0,0.5,1.7076,0.9031,0.9031,0.0,-0.1244,0.1696,Bumps
1752,151,161,2368336,2368345,59,11,9,5637,86,103,...,0.9091,1.0,0.0,1.7708,1.0,0.9542,-0.1,-0.2536,0.1954,Other_Faults
19,1601,1613,21349,21376,209,15,27,24807,96,141,...,0.8,1.0,1.0,2.3201,1.0792,1.4314,0.5556,-0.0727,0.5362,Pastry
842,625,635,1981388,1981483,571,68,95,64701,98,135,...,0.1471,1.0,1.0,2.7566,1.0,1.9777,0.8947,-0.1147,0.9869,Dirtiness
679,41,218,2698232,2698294,6327,275,148,660291,40,126,...,0.6436,0.4189,0.0,3.8012,2.248,1.7924,-0.6497,-0.1847,1.0,K_Scatch


**To do:**
- Separate the dataset into features (X) and targets (y)

In [5]:
X = df.drop(columns=["Fault"])
y = df["Fault"]
print(X.shape, y.shape)

(1941, 27) (1941,)


**To do:**
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [6]:
# Use spot-checking to quickly evaluate multiple ML algorithms
models = {}
models["knn"] = KNeighborsClassifier()
models["lgr"] = LogisticRegression()
models["dtc"] = DecisionTreeClassifier()

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for k in models:
    scores = cross_val_score(models[k], X, y, cv=kf, n_jobs=-1)
    print(f"{k}: {scores.mean():.3%}, {scores.std():.3%}")

knn: 46.368%, 3.167%
lgr: 45.646%, 2.028%
dtc: 71.251%, 0.813%


**To do:**
- Peform feature scaling using standard scaler. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on the scaled features using 5-fold cross validation.

In [7]:
scl = StandardScaler()
Xs = scl.fit_transform(X)

print("After feature scaling")
for k in models:
    scores = cross_val_score(models[k], Xs, y, cv=kf, n_jobs=-1)
    print(f"{k}: {scores.mean():.3%}, {scores.std():.3%}")

After feature scaling
knn: 73.983%, 2.111%
lgr: 71.302%, 3.089%
dtc: 69.911%, 1.592%


**To do:**
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the scaled features to 13. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [8]:
pca = PCA(n_components=13)
Xsr = pca.fit_transform(Xs)

print("After dimensionality reduction")
for k in models:
    scores = cross_val_score(models[k], Xsr, y, cv=kf, n_jobs=-1)
    print(f"{k}: {scores.mean():.3%}, {scores.std():.3%}")

After dimensionality reduction
knn: 73.725%, 1.527%
lgr: 70.632%, 3.342%
dtc: 64.556%, 3.702%
