# **Classification with Machine Learning**

In this lesson, we learn how to solve a classification problem through Machine Learning ensemble classifiers, *i.e.* models that are able to automatically learn how to solve a problem.

**It is absolutely recommended to read the documentation relating to the functions and methods used!**
Usually, it is sufficient typing on Google the name of the function (and eventually the name of the library used).

Import some libraries
In particular, `sklearn` is the library for the Machine Learning stuff!

In [None]:
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import math

### Functions and Classes


`get_labels()` is a function that receives a name (`string`) and returns the class (`int`), following this:

*   Triangle: 0
*   Rectangle: 1
*   Square: 2
*   Rhombus: 3

Example: 0_triangle.png → 0

In [None]:
def get_labels(name):
    if 'triangle' in name:
        return 0
    elif 'square' in name:
        return 1
    elif 'rectangle' in name:
        return 2
    elif 'rhombus' in name:
        return 3
    else:
        raise NotImplementedError('Not existing class!')

`prepare__data()` is a function that prepare the data for the computation.
Specifically, returns two lists: `coordinates` and `labels`.
In this exercise, we exclude `triangles` from classes for simplicity.

In [None]:
def prepare_data(lines):
    labels = []
    coordinates = []

    for line in lines:
        content = line.split()

        # let's exclude triangles
        if 'triangle' not in content[0]:
            # create label
            labels.append(get_labels(content[0]))

            # coordinates
            coordinates.append([float(x) for x in content[1:]])

    return coordinates, labels

### Body of the solution
Upload the file `shapes.txt`.
Open the dataset file `shapes.txt` and read the content

In [None]:
dataset_file_path = 'shapes.txt'
with open(dataset_file_path, 'r') as f:
    lines = f.readlines()
    print('Read {} lines'.format(len(lines)))

We **shuffle** the data to change the initial order.
It is important in order to have a train and a validation set with all classes.

**Tools**:
-    `np.random.shuffle()`: modify a sequence in-place by shuffling its contents.

In [None]:
print('Before shuffling: {}'.format(lines[:10]))
np.random.shuffle(lines)
print('Before shuffling: {}'.format(lines[:10]))

Differently from the previous exercitation, **in this case it is essential to have a training, validation and test sets**.
Training data are used to train the model, while the validation split is used to assess performance.

Here, we use validation and test set as synonymous, since we do not have a real test set.

We put **20% of data in training, 20% in validation**, and the remaining **60% in the test set**.

In [None]:
trainset = lines[:int(0.2*len(lines))]
valset = lines[int(0.2*len(lines)):int(0.4*len(lines))]
testset = lines[int(0.4*len(lines)):]
print('Total: {} splitted in Train: {}, Val: {} and Test: {}'.format(len(lines), len(trainset), len(valset), len(testset)))

There is also another way to create the train/val/est splits.

**Tools**:

*   `train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`: splits arrays or matrices into random train and test subsets. It is also possible to shuffle data.



In [None]:
# to apply this method we need two different lists: X (data) and y (labels)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

From this moment, we will have three sets: train, validation and test set.

A single datapoint belongs only to one, **these three sets are completely disjointed**.

It is important to keep them separated!

Apply **bootstrapping** (random sampling with replacement)

In [None]:
import random

set_number = 3

trainsets = []

for i in range(set_number):
  trainsets.append(random.choices(trainset, k=int(len(trainset)/set_number)))

for i in range(set_number):
  print('{} subsets with {} elements'.format(i, len(trainsets[i])))

In [None]:
trains_x_y = []

for i in range(set_number):
  trains_x_y.append(prepare_data(trainsets[i]))

val_x, val_y = prepare_data(valset)
test_x, test_y = prepare_data(testset)

for i in range(set_number):
  print('Train {}: {}'.format(i, len(trains_x_y[i][0])))

print('Val: {} and Test: {}'.format(len(val_x), len(test_x)))

### Classifier
Here, we define what classifier we are going to use to solve our classification problem. Let's use the SVM implementation of the `sklearn` library.


In [None]:
from sklearn.tree import DecisionTreeClassifier

classifiers = []

for i in range(set_number):
  classifiers.append(DecisionTreeClassifier())

print('{} classifiers declared'.format(len(classifiers)))

### Training
Now we are ready for the training!
With `sklearn` library is tremendously simple, we just need training data (`train_x` and the related labels `train_y`) and pass them to the classifier.

**Tools**:
-   `model.fit()`: fit the provided model with training data.

In [None]:
for iteration, (clf, data) in enumerate((zip(classifiers, trains_x_y))):
  print('Training {} classifier'.format(iteration), end=", ")
  clf.fit(data[0], data[1])
  print('done!')

### Test

It's time to test the trained model.
We skip the validation for simplicity.



In [None]:
pred_y = []

for x, y in zip(test_x, test_y):
  # votes for all classes
  votes = [0, 0, 0, 0]
  for clf in classifiers:
    x = np.array(x)
    x = x.reshape(1, -1)
    prediction = clf.predict(x)
    # here, votes are accumulated
    votes[int(prediction)] += 1
  # here, the most voted is selected
  pred_y.append(np.argmax(votes))

print(pred_y)
print('Final Accuracy: {:.3f}'.format(accuracy_score(test_y, pred_y)))