# **Classification with Machine Learning**

In this lesson, we learn how to solve a classification problem through Machine Learning classifiers, *i.e.* model that are able to automatically learn how to solve a problem.

**It is absolutely recommended to read the documentation relating to the functions and methods used!**
Usually, it is sufficient typing on Google the name of the function (and eventually the name of the library used).

Import some libraries
In particular, `sklearn` is the library for the Machine Learning stuff!

In [None]:
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import math

### Functions and Classes
This is the class that we'll use to handle coordinates of the dataset. We assume to work with only 2D $(x,y)$  coordinates.

In [None]:
class Point:
    x = None
    y = None

`get_labels()` is a function that receives a name (`string`) and returns the class (`int`), following this:

*   Triangle: 0
*   Rectangle: 1
*   Square: 2
*   Rhombus: 3

Example: 0_triangle.png → 0

In [None]:
def get_labels(name):
    if 'triangle' in name:
        return 0
    elif 'square' in name:
        return 1
    elif 'rectangle' in name:
        return 2
    elif 'rhombus' in name:
        return 3
    else:
        raise NotImplementedError('Not existing class!')

`prepare__data()` is a function that prepare the data for the computation.
Specifically, returns two lists: `coordinates` and `labels`.
In this exercise, we exclude `triangles` from classes for simplicity.

In [None]:
def prepare_data(lines):
    labels = []
    coordinates = []

    for line in lines:
        content = line.split()

        # let's exclude triangles
        if 'triangle' not in content[0]:
            # create label
            labels.append(get_labels(content[0]))

            # coordinates
            coordinates.append([float(x) for x in content[1:]])

    return coordinates, labels

### Body of the solution
Upload the file `shapes.txt`.
Open the dataset file `shapes.txt` and read the content

In [None]:
dataset_file_path = 'shapes.txt'
with open(dataset_file_path, 'r') as f:
    lines = f.readlines()
    print('Read {} lines'.format(len(lines)))

We **shuffle** the data to change the initial order.
It is important in order to have a train and a validation set with all classes.

**Tools**:
-    `np.random.shuffle()`: modify a sequence in-place by shuffling its contents.

In [None]:
print('Before shuffling: {}'.format(lines[:10]))
np.random.shuffle(lines)
print('Before shuffling: {}'.format(lines[:10]))

Differently from the previous exercitation, **in this case it is essential to have a training, validation and test sets**.
Training data are used to train the model, while the validation split is used to assess performance.

Here, we use validation and test set as synonymous, since we do not have a real test set.

We put **20% of data in training, 20% in validation**, and the remaining **60% in the test set**.

In [None]:
trainset = lines[:int(0.2*len(lines))]
valset = lines[int(0.2*len(lines)):int(0.4*len(lines))]
testset = lines[int(0.4*len(lines)):]
print('Total: {} splitted in Train: {}, Val: {} and Test: {}'.format(len(lines), len(trainset), len(valset), len(testset)))

There is also another way to create the train/val/est splits.

**Tools**:

*   `train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`: splits arrays or matrices into random train and test subsets. It is also possible to shuffle data.



In [None]:
# to apply this method we need two different lists: X (data) and y (labels)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

From this moment, we will have three sets: train, validation and test set.

A single datapoint belongs only to one, **these three sets are completely disjointed**.

It is important to keep them separated!

In [None]:
train_x, train_y = prepare_data(trainset)
val_x, val_y = prepare_data(valset)
test_x, test_y = prepare_data(testset)
print('Train: {}, Val: {} and Test: {}'.format(len(train_x), len(val_x), len(test_x)))
print('Total: {}'.format(len(train_x) + len(val_x) + len(test_x)))

### Classifier
Here, we define what classifier we are going to use to solve our classification problem. Let's use the SVM implementation of the `sklearn` library.


In [None]:
from numpy.core.arrayprint import format_float_scientific
clf = svm.SVC(gamma=0.001, C=100., kernel='rbf', verbose=False, probability=False)
# clf = RandomForestClassifier()
# clf = AdaBoostClassifier()
# clf = DecisionTreeClassifier()

### Training
Now we are ready for the training!
With `sklearn` library is tremendously simple, we just need training data (`train_x` and the related labels `train_y`) and pass them to the classifier.

**Tools**:
-   `model.fit()`: fit the provided model with training data.

In [None]:
clf.fit(train_x, train_y)

### Validation

It's time to validate the trained model, in order to find proper hyperparameters.

**Tools:**
*   `score()`: evaluates the quality of a model’s predictions.



In [None]:
print('Validation accuracy: {:.3f}'.format(clf.score(val_x, val_y)))

### Test
Now we are reading to use our classifier! The trained classifier outputs the labels (as defined above) for the classification task.

Tools:
  - `model.predict()`: predict the class of the given data.

In [None]:
pred_y = clf.predict(test_x)
print('Predicted {} samples: {}'.format(len(pred_y), pred_y))
print('GT {} samples: {}'.format(len(test_y), test_y))

It's time to understand the how good is the trained classifier.

**Tools**:
   * `accuracy_score()`: accuracy classification score. The set of labels predicted for a sample must exactly match the corresponding set of labels of GT.

In [None]:
print('Final Accuracy: {:.3f}'.format(accuracy_score(test_y, pred_y)))

Presumably, you have obtained a lower performance w.r.t. the previous exercitation (based on PR), but rememeber that:
- Now the model has **automatically learned** how to solve the classification problem;
- The classification problem is quite simple, since we know how to classify geometric shapes. Then, we have a good level of a-priori knowledge.

### Exercise/Homework

1) Try to obtain the highest accuracy in classification!
You can use:
- Different **classifiers**:
  *   Tree classifiers, Random Forest, AdaBoostClassifier, ...
  *   You can also install other packages (for instance, `xgboost`)

You can find a list of several classifiers available in scikit-learn  library  here: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
Remember to import the classifiers from sklearn package!

- Different **data** in input (you can provide not only the raw coordinates of the shapes, but also other values like diagonals and so on).
- Different **normalization** of data.
- Different **data splits** (you can vary the amount of samples in train, val and test sets).

**Data Normalization**
The purpose of normalization is to transform data in a way that they have similar distributions. Normalization for instance translates data into the range [0, 1] or [-1, +1] as follows:

> `coordinates = (coordinates - np.mean(coordinates)) / np.std(coordinates)`

or

> `coordinates = (coordinates - np.min(coordinates)) / (np.max(coordinates)-np.min(coordinates))`

In our case, in the Euclid dataset all coordinates are already in the range [0, 224] and then normalization is not strictly needed (actually, in some cases decreases the final accuracy, since normalization compresses data within a certain range, reducing the variance).

**NB** In order to obtain comparable results, do not shuffle again the dataset. Only modify the `prepare_data()` function, and/or define a new classifier, and then run a new `fit()` and `score()` procedure.

2) Obtain prediction probabilities.

**Tools:**
*   `predict_proba()`: returns the class probabilities for each data point (model must have the parameter `probability` set to `True`!)

3) Include also **triangles** in the classification problem.

