# **Regression with Machine Learning**

In this lesson, we learn how to solve a regression problem through Machine Learning regressors.

**It is absolutely recommended to read the documentation relating to the functions and methods used!**
Usually, it is sufficient typing on Google the name of the function (and eventually the name of the library used).

Import some libraries
In particular, `sklearn` is the library for the Machine Learning stuff!

In [None]:
import numpy as np
from sklearn import svm
import numpy as np
import math

### Functions and Classes
This is the class that we'll use to handle coordinates of the dataset. We assume to work with only 2D $(x,y)$  coordinates.

In [None]:
class Point:
    x = None
    y = None

In [None]:
def get_labels(coordinates):

  c = [float(x) for x in coordinates]

  if len(coordinates) == 6:
        centroid_x = (c[0] + c[2] + c[4]) / 3
        centroid_y = (c[1] + c[3] + c[5]) / 3
  else:
        centroid_x = (c[0] + c[2] + c[4] + c[6]) / 4
        centroid_y = (c[1] + c[3] + c[5] + c[7]) / 4

  return [centroid_x, centroid_y]

`prepare__data()` is a function that prepare the data for the computation.
Specifically, returns two lists: `coordinates` and `labels`.
In this exercise, we exclude `triangles` from classes for simplicity.

In [None]:
def prepare_data(lines):
    labels = []
    coordinates = []

    for line in lines:
        content = line.split()

        # let's exclude triangles
        if 'triangle' not in content[0]:
            # create label
            labels.append(get_labels(content[1:]))

            # coordinates
            coordinates.append([float(x) for x in content[1:]])

    return coordinates, labels

### Body of the solution
Upload the file `shapes.txt`.
Open the dataset file `shapes.txt` and read the content

In [None]:
dataset_file_path = 'shapes.txt'
with open(dataset_file_path, 'r') as f:
    lines = f.readlines()
    print('Read {} lines'.format(len(lines)))

We **shuffle** the data to change the initial order.
It is important in order to have a train and a validation set with all possible ground truth values.

**Tools**:
-    `np.random.shuffle()`: modify a sequence in-place by shuffling its contents.

In [None]:
print('Before shuffling: {}'.format(lines[:10]))
np.random.shuffle(lines)
print('Before shuffling: {}'.format(lines[:10]))

**It is essential to have a training, validation and test sets**.
Training data are used to train the model, while the validation split is used to assess performance.

Here, we use validation and test set as synonymous, since we do not have a real test set.

We put **20% of data in training, 20% in validation**, and the remaining **60% in the test set**.

In [None]:
trainset = lines[:int(0.6*len(lines))]
valset = lines[int(0.6*len(lines)):int(0.8*len(lines))]
testset = lines[int(0.8*len(lines)):]
print('Total: {} splitted in Train: {}, Val: {} and Test: {}'.format(len(lines), len(trainset), len(valset), len(testset)))

There is also another way to create the train/val/test splits.

**Tools**:

*   `train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`: splits arrays or matrices into random train and test subsets. It is also possible to shuffle data.



In [None]:
# to apply this method we need two different lists: X (data) and y (labels)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

From this moment, we will have three sets: train, validation and test set.

A single datapoint belongs only to one, **these three sets are completely disjointed**.

It is important to keep them separated!

In [None]:
train_x, train_y = prepare_data(trainset)
val_x, val_y = prepare_data(valset)
test_x, test_y = prepare_data(testset)
print('Train: {}, Val: {} and Test: {}'.format(len(train_x), len(val_x), len(test_x)))
print('Total: {}'.format(len(train_x) + len(val_x) + len(test_x)))

### Regressor
Here, we define what regressor we are going to use to solve our regression problem. Let's use the SVR (SVM) implementation of the `sklearn` library.


In [None]:
#from numpy.core.arrayprint import format_float_scientific

from sklearn.multioutput import MultiOutputRegressor
clf = svm.SVR()
clf = MultiOutputRegressor(clf)

# from sklearn.neural_network import MLPRegressor
# clf = MLPRegressor()
# clf =
# clf =

### Training
Now we are ready for the training!
With `sklearn` library is tremendously simple, we just need training data (`train_x` and the related labels `train_y`) and pass them to the regressor.

**Tools**:
-   `model.fit()`: fit the provided model with training data.

In [None]:
clf.fit(train_x, train_y)

### Validation

It's time to validate the trained model, in order to find proper hyperparameters.

**Tools:**
*   `score()`: evaluates the quality of a model’s predictions.



In [None]:
print('Validation accuracy: {:.3f}'.format(clf.score(val_x, val_y)))

### Test
Now we are reading to use our regressor! The trained regressor outputs the labels (as defined above) for the regression task.

Tools:
  - `model.predict()`: predict the values.

In [None]:
pred_y = clf.predict(test_x)
print('Predicted {} samples: {}'.format(len(pred_y), pred_y[:5]))
print('GT {} samples: {}'.format(len(test_y), test_y[:5]))

In [None]:
import cv2
from google.colab.patches import cv2_imshow

for t, p, gt in zip(test_x[:4], pred_y, test_y):

  white = np.ones((800, 800, 3)) * 255

  cv2.circle(white, (int(t[0]), int(t[1])), 3, (255, 0, 0), -1)
  cv2.circle(white, (int(t[2]), int(t[3])), 3, (255, 0, 0), -1)
  cv2.circle(white, (int(t[4]), int(t[5])), 3, (255, 0, 0), -1)
  cv2.circle(white, (int(t[6]), int(t[7])), 3, (255, 0, 0), -1)

  # prediction
  print(p)
  cv2.circle(white, (int(p[0]), int(p[1])), 3, (0, 0, 255))

  # ground truth
  print(gt)
  cv2.circle(white, (int(gt[0]), int(gt[1])), 3, (0, 255, 0))

  cv2_imshow(white)

It's time to understand the how good is the trained regressor.
We need some metrics for regression!

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(test_y, pred_y, multioutput='raw_values')
print('MAE:', mae)

mse = mean_squared_error(test_y, pred_y, multioutput='raw_values')
print('MSE:', mse)

### Exercise/Homework
1) Implement your own MAE, MSE, ...

2) Try to obtain the highest accuracy in regression!
You can use:
- Different **Regressors**:
  *   SVR, MLP, ...
  *   You can also install other packages (for instance, `xgboost`)

Remember to import the regressors from sklearn package!

- Different **data** in input (you can provide not only the raw coordinates of the shapes, but also other values like diagonals and so on).
- Different **normalization** of data.
- Different **data splits** (you can vary the amount of samples in train, val and test sets).

**Data Normalization**
The purpose of normalization is to transform data in a way that they have similar distributions. Normalization for instance translates data into the range [0, 1] or [-1, +1] as follows:

> `coordinates = (coordinates - np.mean(coordinates)) / np.std(coordinates)`

or

> `coordinates = (coordinates - np.min(coordinates)) / (np.max(coordinates)-np.min(coordinates))`

In our case, in the Euclid dataset all coordinates are already in the range [0, 224] and then normalization is not strictly needed (actually, in some cases decreases the final accuracy, since normalization compresses data within a certain range, reducing the variance).

**NB** In order to obtain comparable results, do not shuffle again the dataset. Only modify the `prepare_data()` function, and/or define a new regressor, and then run a new `fit()` and `score()` procedure.


