<a href="https://colab.research.google.com/github/wooihaw/pai_july2024/blob/main/Part_3/handson_3/handson_3c.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hands-on 3C
#### Build 3 classification models to classify faulty steel plates. The dataset contains information about steel plates and their faults. There are 27 features and the data samples have been classified into 7 different types of steel plate faults.

In [1]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# Load the required libraries
import sys
from pandas import read_csv
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [3]:
# Load the dataset
# Determine the environment
is_colab = 'google.colab' in sys.modules

# Execute code conditionally
if is_colab:
    # Code for Google Colab environment
    df = read_csv("https://raw.githubusercontent.com/wooihaw/datasets/main/steel_faults.csv")
else:
    # Code for local Jupyter Notebook environment
    df = read_csv("steel_faults.csv")

**To do:**
- Print 5 random data samples from the dataset

In [4]:
df.sample(5)

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Fault
1774,897,925,1410869,1410896,361,51,35,46442,121,143,...,0.549,0.7714,0.0,2.5575,1.4472,1.4314,-0.0357,0.0051,0.9537,Other_Faults
1494,243,254,205684,205730,305,37,46,14320,7,86,...,0.2973,1.0,1.0,2.4843,1.0414,1.6628,0.7609,-0.6332,0.7955,Other_Faults
966,251,260,961360,961371,67,11,11,7550,99,125,...,0.8182,1.0,1.0,1.8261,0.9542,1.0414,0.1818,-0.1196,0.2051,Bumps
1796,927,936,1971777,1971784,35,10,8,4277,117,127,...,0.9,0.875,0.0,1.5441,0.9542,0.8451,-0.2222,-0.0453,0.1687,Other_Faults
1726,20,31,1048964,1048986,158,18,22,17923,106,125,...,0.6111,1.0,1.0,2.1987,1.0414,1.3424,0.5,-0.1138,0.4009,Other_Faults


**To do:**
- Separate the dataset into features (X) and targets (y)

In [5]:
X = df.drop(columns=["Fault"])
y = df["Fault"]

**To do:**
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [6]:
# Use spot-checking to quickly evaluate the performance of different models
models = {}
models["knn"] = KNeighborsClassifier()
models["dtc"] = DecisionTreeClassifier(random_state=42)
models["lgr"] = LogisticRegression()

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for n in models:
    scores = cross_val_score(models[n], X, y, cv=kf, n_jobs=-1)
    print(f"{n}: {scores.mean():.3%}, {scores.std():.3%}")

knn: 46.368%, 3.167%
dtc: 70.736%, 2.008%
lgr: 45.594%, 2.025%


**To do:**
- Peform feature scaling using standard scaler. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on the scaled features using 5-fold cross validation.

In [7]:
scl = StandardScaler()
Xs = scl.fit_transform(X)

for n in models:
    scores = cross_val_score(models[n], Xs, y, cv=kf, n_jobs=-1)
    print(f"{n}: {scores.mean():.3%}, {scores.std():.3%}")

knn: 73.983%, 2.111%
dtc: 70.684%, 2.033%
lgr: 71.302%, 3.089%


**To do:**
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the scaled features to 13. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [9]:
pca = PCA(n_components=13)
Xsr = pca.fit_transform(Xs)

for n in models:
    scores = cross_val_score(models[n], Xsr, y, cv=kf, n_jobs=-1)
    print(f"{n}: {scores.mean():.3%}, {scores.std():.3%}")

knn: 73.725%, 1.527%
dtc: 65.174%, 3.083%
lgr: 70.683%, 3.247%
