# TDT4173: Machine Learning and Case-Based Reasoning - Assignment 5
### Author: Vittorio Triassi 

For the purpose of the following assignment we decided to implement every task using `scikit-learn`. The main reason is that we already had the chance to use such library by implementing a few models in the previous assignment, and that is why we thought it would be a good idea to stick with it in this case as well. It is also true that several are the libraries available to carry out the same tasks. With a better understanding of the architectures and of course of the syntax, it would have been nice to use such libraries.

In [1]:
#!pip install scikit-image
import numpy as np
import os
import skimage.io
from random import randint
from sklearn import svm
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

In [2]:
def load_examples(test_size = 0.2):

    examples = np.zeros(shape = (7112, 400), dtype = int)
    labels = np.zeros(shape = (7112), dtype = int)

    i = 0
    index = 0
    for letter in str('abcdefghijklmnopqrstuvwxyz'):
        for img_name in os.listdir(os.path.join('chars74k-lite', letter)):
            path = os.path.join('chars74k-lite', letter, img_name)
            examples[i] = np.array(skimage.io.imread(path)).flatten()
            labels[i] = index
            i += 1
        index += 1

    train_X, test_X, train_y, test_y = train_test_split(examples, labels, test_size = test_size, random_state = 42)
    print("Training examples: " + str(len(train_X)) + " / Test examples: " + str(len(test_X)))

    return train_X, test_X, train_y, test_y

In [3]:
train_X, test_X, train_y, test_y = load_examples(test_size = 0.2)

Training examples: 5689 / Test examples: 1423


## Feature Engineering

To  carry  out  the feature  engineering part,  we  decided  to: standardize the  features and apply the principal component analysis. The main reason why we chose the folllowing tecniques is that almost all the machine learning estimators need standardized datasets.  When we talk about standardizing our dataset, we are saying nothing but removing the mean and scaling to unit variance. In our task, we used `StandardScaler` from `sklearn.preprocessing`. What we do is to center and scale each feature.  Another  reason  why  scaling  the  features  is  a  good  practice  is  that  if  a  feature  has  a variance that is way larger than the others, it might dominate the objective function ending up in an estimator not able to learn as expected.  The second tecnique used is the principal component analysis (PCA). PCA uses Singular Value Decomposition to project high dimensional data to a lower dimensional representation.  In order for PCA to work, the input has to be centered before applying the SVD.  In our task, `PCA` from `sklearn.decomposition` was used.

In [4]:
def standardize_features(train_X, test_X):
    
    scaler = StandardScaler()
    scaler.fit(train_X)

    train_X = scaler.transform(train_X)
    test_X = scaler.transform(test_X)

    return train_X, test_X

In [5]:
def principal_component_analysis(train_X, test_X, n_components = 1):

    pca = PCA(n_components)
    pca.fit(train_X)
    pca_train = pca.transform(train_X)
    pca_test = pca.transform(test_X) 

    return pca_train, pca_test

In [6]:
train_X, test_X = standardize_features(train_X, test_X)
pca_train, pca_test = principal_component_analysis(train_X, test_X, n_components = 1)

## Character Classification

The  models  we  used  in  the  following  task  were  a  Support  Vector  Machine  (SVM) and a Random Forest.  In SVMs, we try to find a hyperplane that is able to separate two classes of the data by finding the largest margin.  SVMs have a good accuracy compared to other tecniques but are very hard to tune because they involve a lot of parameters. Random Forest is a classifier that fits a number of decision tree classifiers on sub-sets of the dataset and averages the results to improve the accuracy of the model without running into overfitting.  In our code, we used `SVC` from `sklearn.svm` and `RandomForestClassifier` from `sklearn-ensemble`.  We chose the following models because we wanted to test less trending approaches to solve an interesting problem such as image classification.  Usually, CNNs are preferred over other tecniques.

Between our two models, SVM performed better, achieving an accuracy around $80\%$ on the test data. The second model, the Random Forest, performed slightly worse achieving $70\%$. We can be quite satisfied with the first model since we did not tuned its parameters so much and still we obtained good results. As stated before, SVMs can be hard to tune and at least in our case, we have simply used the default kernel (`rbf`) and tried a few values for `gamma`. Both models were run after having performed the feature scaling. Five predictions from both classifiers are shown. As we can see, the SVM is more precise in its predictions, while the RFC gets more often confused.

In [7]:
clf_SVM = svm.SVC(gamma=0.0035, probability=True)
clf_SVM.fit(train_X, train_y)
score_SVM = clf_SVM.score(test_X, test_y)
prediction_SVM = clf_SVM.predict(test_X)

In [8]:
print("SVM classification report:\n" + str(metrics.classification_report(test_y, prediction_SVM)))

SVM classification report:
              precision    recall  f1-score   support

           0       0.69      0.92      0.79       157
           1       0.77      0.33      0.47        30
           2       0.91      0.83      0.87        47
           3       0.80      0.54      0.64        52
           4       0.72      0.89      0.80       144
           5       0.93      0.54      0.68        24
           6       0.88      0.55      0.68        42
           7       0.79      0.66      0.72        41
           8       0.72      0.79      0.75        80
           9       0.77      0.43      0.56        23
          10       0.92      0.52      0.67        23
          11       0.90      0.83      0.86        53
          12       0.81      0.81      0.81        36
          13       0.78      0.90      0.84       105
          14       0.73      0.92      0.81        99
          15       0.97      0.76      0.85        38
          16       1.00      0.11      0.19        19


In [9]:
clf_RFC = RFC(n_estimators=300)
clf_RFC.fit(train_X, train_y)
clf_RFC.score(test_X, test_y)
prediction_RFC = clf_RFC.predict(test_X)

In [10]:
print("RFC classification report:\n" + str(metrics.classification_report(test_y, prediction_RFC)))

RFC classification report:
              precision    recall  f1-score   support

           0       0.61      0.85      0.71       157
           1       1.00      0.20      0.33        30
           2       0.73      0.79      0.76        47
           3       0.71      0.19      0.30        52
           4       0.64      0.88      0.74       144
           5       1.00      0.17      0.29        24
           6       1.00      0.40      0.58        42
           7       0.82      0.66      0.73        41
           8       0.65      0.78      0.71        80
           9       1.00      0.13      0.23        23
          10       0.90      0.39      0.55        23
          11       0.83      0.74      0.78        53
          12       0.78      0.69      0.74        36
          13       0.79      0.86      0.82       105
          14       0.55      0.92      0.69        99
          15       0.88      0.74      0.80        38
          16       0.00      0.00      0.00        19


  'precision', 'predicted', average, warn_for)


In [11]:
print("Random predictions for SVM")
for i in range(5):
    r = randint(0, len(test_y))
    print("Target: " + str(test_y[r]) + " / Prediction: " + str(clf_SVM.predict(test_X[r].reshape(1, -1))))
    
print("\nRandom predictions for RFC")
for i in range(5):
    r = randint(0, len(test_y))
    print("Target: " + str(test_y[r]) + " / Prediction: " + str(clf_RFC.predict(test_X[r].reshape(1, -1))))

Random predictions for SVM
Target: 6 / Prediction: [6]
Target: 6 / Prediction: [14]
Target: 13 / Prediction: [13]
Target: 3 / Prediction: [3]
Target: 21 / Prediction: [21]

Random predictions for RFC
Target: 13 / Prediction: [13]
Target: 25 / Prediction: [4]
Target: 24 / Prediction: [24]
Target: 21 / Prediction: [21]
Target: 0 / Prediction: [0]


## Character Detection

Several are the improvements that are possible when trying to address an object detection task. In our case, two different images were provided. In the first image, even without specific modifications, it is possible to detect the characters. On the other hand in the second one, we notice that a few characters are rotated. Unless our detector is implemented in such a way that considers also other rotations, it is not able to detect such angles. Other possible improvements might be related to the scale of the objects we want to detect. For instance, in our detector we set a window size of $20 \times 20$ pixels. If we are trying to detect something on a different scale, it will be a problem and our detector will not be as effective as it is on the aforementioned scale.

In [12]:
#!pip install opencv-python
import cv2
import matplotlib.pyplot as plt

image_1 = cv2.imread('./detection-images/detection-1.jpg')
image_2 = cv2.imread('./detection-images/detection-2.jpg')

#cv2.imshow('image', image_1)
tmp = image_1
stepSize = 1
(width, height) = (20, 20)
for x in range(0, image_1.shape[1] - width , stepSize):
    for y in range(0, image_1.shape[0] - height, stepSize):
        window = image_1[x:x + width, y:y + height, :]
        
# now we use the classifier previously defined (SVM)
# and detect letters if there are any

    # draw rectangle on image
    cv2.rectangle(tmp, (x, y), (x + width, y + height), (255, 0, 0), 2)
    plt.imshow(np.array(tmp).astype('uint8'))
    
# show the recognized letters
plt.show()

<Figure size 640x480 with 1 Axes>