<a href="https://colab.research.google.com/github/wooihaw/ml_dl_comparison/blob/main/ml_shape_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Shape classification using machine learning
In this hands-on, we will go through the machine learning workflow of data preparation, training, testing, tuning and deployment of a  model for the classification of circles, squares and triangles.

In [None]:
!pip install gdown
!cd /content/
!gdown 1Hr3CAZPjFUkrLGmA070BMS7J6UL7YWi_ -O three_shapes_filled.zip
!unzip /content/three_shapes_filled.zip > /dev/null && echo "A total of $(find /content/three_shapes_filled/ -type f | wc -l) files have been successfully extracted."

In [None]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
from skimage.io import imread
from random import randrange

filelist = []
labels = []
for root, dirs, files in os.walk('three_shapes_filled/'):
    print(f'Folder: {root}, sub-folders: {dirs}, number of files: {len(files)}')
    if len(files) == 0:
        continue
    filelist.extend([os.path.join(root, f) for f in files])
    dir = root.split('/')[-1]
    labels.extend([dir] * len(files))


In [None]:
# Read image and reshape as 1D array
images = imread(filelist[0], as_gray=True).reshape(1, -1)
for i in range(1, len(filelist)):
    images = np.append(images, imread(filelist[i], as_gray=True).reshape(1, -1), axis=0)

In [None]:
X = images
y = np.array(labels)
print(X.shape, y.shape)

In [None]:
indices = [randrange(len(X)) for i in range(15)]
fig, axes = plt.subplots(3, 5)
for ax, image, label in zip(axes.ravel(), X[indices], y[indices]):
    ax.set_axis_off()
    ax.set_title(label)
    ax.imshow(image.reshape(32, 32), cmap='gray', interpolation='nearest')

- To do:
  - Split the dataset into training and testing sets.
  - Train a knn model (using default settings).
  - Evaluate its performance using the testing set and print the score.

In [None]:
from sklearn.model_selection import train_test_split as split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

- To do:
  - Construct a pipeline with three steps: scaling ('scl'), dimensionality reduction using PCA ('dr') and classification ('clf').
  - Use gridsearch to search for the best process for each steps as follows:
    - scaling: No scaling, MinMaxScaler, StandardScaler.
    - dimensionality reduction with PCA: test with 50 to 150 principal components (with an interval of 10).
    - classification: test with kNN, logistic regression, decision tree, random forests, gradient boosted tree, extreme gradient boost, support vector machine and multilayer perceptron on this dataset with no rescaling, MinMaxScaler, StandardScaler and RobustScaler.
    - Use default settings for all the classifiers.

-  To do:
  - Store the best model as 'model'.
  - Evaluate the best model using the testing set and print the score.

- Random select 15 test data and predict their corresponding classes.
- Plot the test data and the predicted labels.

In [None]:
ind_test = [randrange(len(X_test)) for i in range(15)]
y_pred = model.predict(X_test[ind_test])

fig, axes = plt.subplots(3, 5)
for ax, image, label in zip(axes.ravel(), X_test[ind_test], y_pred):
    ax.set_axis_off()
    ax.set_title(label)
    ax.imshow(image.reshape(32, 32), cmap='gray', interpolation='nearest')

To do:  
- Print classification report
- Plot the confusion matrix.