 # Machine Learning LAB 1
 Academic Year 2021/22, P. Zanuttigh, U. Michieli, F. Barbato, D. Shenaj, G. Rizzoli

The notebook contains some simple tasks to be performed about classification and regression. Complete **all** the required code sections and answer to **all** the questions.

### IMPORTANT 1: make sure to rerun all the code from the beginning to obtain the results for the final version of your notebook, since this is the way we will do it before evaluating your notebook!


### IMPORTANT 2: Place your name and ID number. Also recall to save the file as Surname_Name_LAB1.ipynb . Notebooks without name will be discarded.

**Student name**: Alessandro Zanoli<br>
**ID Number**: 2057447


# 1) Classification of Music genre

### Dataset description

A music genre is a conventional category that identifies pieces of music as belonging to a shared tradition or set of conventions. It is to be distinguished from musical form and musical style. The features extracted from these songs can help the machine to assing them to the two genres. 

This dataset is a subset of the dataset provided [here](https://www.kaggle.com/insiyeah/musicfeatures), containing only the data regarding the classical and metal genres.

### We consider 3 features for the classification

1) **tempo**, the speed at which a passage of music is played, i.e., the beats per minute of the musical piece<br>
2) **chroma_stft**, [mean chromagram activation on Short-Time Fourier Transform](https://librosa.org/doc/0.7.0/generated/librosa.feature.chroma_stft.html)<br>
3) **spectral_centroid**, Indicates where the "center of mass" of the spectrum is located, i.e., it is the weighted average of the frequency transform<br>


We first import all the packages that are needed.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv

import numpy as np
import scipy as sp
import sklearn as sl
from scipy import stats
from sklearn import datasets
from sklearn import linear_model

# Perceptron
Firstly we will implement the perceptron algorithm and use it to learn a halfspace.

**TO DO** Set the random seed, you can use your ID (matricola) or any other number! Try to make various tests changing the seed.

In [None]:
IDnumber = 2057447 #YOUR_ID , try also to change the seed to see the impact of random initialization on the results
np.random.seed(IDnumber)

Load the dataset and then split in training set and test set (the training set is typically larger, you can use a 75% tranining 25% test split) after applying a random permutation to the datset.

A) Load dataset and perform permutation

In [None]:
# Load the dataset
filename = 'data/music.csv'
music = csv.reader(open(filename, newline='\n'), delimiter=',')

header = next(music) # skip first line
print(f"Header: {header}\n")

dataset = np.array(list(music))
print(f"Data shape: {dataset.shape}\n")
print("Dataset Example:")
print(dataset[:10,...])

X = dataset[:,:-1].astype(float) #columns 0,1,2 contain the features
Y = dataset[:,-1].astype(int)    # last column contains the labels

print(X.shape)
print(Y.shape)

                                 # for the dataset, classical--> 0, metal --> 1
Y = 2*Y-1                        # for the perceptron classical--> -1, metal-->1
m = dataset.shape[0]
print("\nNumber of samples loaded:", m)


In [None]:
# plot data in order to get a feeling for the dataset
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

colors=[
    'darkred', 
    'navy',
    'green',
    'slateblue',
    'coral'
]

ax.scatter(X[Y>=0][:,0],X[Y>=0][:,1],X[Y>=0][:,2],color=colors[0],alpha=0.5,label='Metal') #  metal
ax.scatter(X[Y<=0][:,0],X[Y<=0][:,1],X[Y<=0][:,2],color=colors[1],alpha=0.5,label='Classical') # classical
ax.legend()


plt.show() 

We are going to classify class "1" (metal) vs class "-1" (classical)

B) **TO DO** Divide the data into training set and test set (75% of the data in the first set, 25% in the second one)

In [None]:
# before dividing the dataset, we will normalize it using sklearn's routine
# this helps the perceptron's convergence, since the dot product will
# receive contributions from each feature with the same "importance"

X_norm=sl.preprocessing.normalize(X,axis=0)

# NOTE:
# preprocessing.normalize() actually normalizes so that the squared norm sums up to one.
# this is different than .StandardScaler(), that transforms the data with the assumption that
# it is generated according to a gaussian distribution, and rescales it in order to have mean=0 and std=1.
# nonetheless, it works

print("Data set in normalized homogeneous coordinates:")
print(X_norm[:10])

In [None]:
# Divide in training and test: make sure that your training set
# contains at least 10 elements from class 1 and at least 10 elements
# from class -1! If it does not, modify the code so to apply more random
# permutations (or the same permutation multiple times) until this happens.
# IMPORTANT: do not change the random seed.

n_classical_train=0
n_metal_train=0
while (n_classical_train < 10) and (n_metal_train < 10):

    # MOVING THE PERMUTATION HERE SO IT GETS EXECUTED ONE TIME 
    # AND THEN HOWEVER NECESSARY
    permutation = np.random.permutation(m) # random permutation

    X_norm = X_norm[permutation]
    Y = Y[permutation]


    # m_test needs to be the number of samples in the test set
    m_training = int(m*0.75)

    # m_test needs to be the number of samples in the test set
    m_test = m - m_training

    # X_training = instances for training set
    X_training = X_norm[:m_training]
    #Y_training = labels for the training set
    Y_training = Y[:m_training]

    # X_test = instances for test set
    X_test = X_norm[m_training:]
    # Y_test = labels for the test set
    Y_test = Y[m_training:]


    print(Y_training) # to make sure that Y_training contains both 1 and -1
    print(m_test)

    n_classical_train = np.sum(Y_training==-1)
    n_metal_train = np.sum(Y_training==1)
    n_classical_test=np.sum(Y_test==-1)
    n_metal_test = np.sum(Y_test==1)

    print("\nNumber of classical instances in test:", n_classical_test)
    print("Number of metal instances in test:", n_metal_test)

    print("Shape of training set: " + str(X_training.shape))
    print("Shape of test set: " + str(X_test.shape))

We add a 1 in front of each sample so that we can use a vector in homogeneous coordinates to describe all the coefficients of the model. This can be done with the function $hstack$ in $numpy$.

In [None]:
# Add a 1 to each sample (homogeneous coordinates)
X_training = np.hstack((np.ones((m_training,1)),X_training))
X_test = np.hstack((np.ones((m_test,1)),X_test))

print("Training set in homogeneous coordinates:")
print(X_training[:10])

**TO DO** Now complete the function *perceptron*. Since the perceptron does not terminate if the data is not linearly separable, your implementation should return the desired output (see below) if it reached the termination condition seen in class or if a maximum number of iterations have already been run, where one iteration corresponds to one update of the perceptron weights. In case the termination is reached because the maximum number of iterations have been completed, the implementation should return **the best model** seen up to now.

The input parameters to pass are:
- $X$: the matrix of input features, one row for each sample
- $Y$: the vector of labels for the input features matrix X
- $max\_num\_iterations$: the maximum number of iterations for running the perceptron

The output values are:
- $best\_w$: the vector with the coefficients of the best model
- $best\_error$: the *fraction* of misclassified samples for the best model

In [None]:
# PERCEPTRON CLASS
# implementing as a class because i find it easier to organize

class Perceptron:

    def __init__(self):
        self.w = 0
        self.best_w = self.w
        self.best_loss = -1
        

    def train(self,X,Y,Nmax,debug=False,lr=1):
        """
        Perceptron object is initialized with training.
        X = features array, shaped like (Nsamples,Nfeatures)
            needs to be expressed in homogeneous coordinates, so that
            Nfeatures has an additional 1 as first coordinate.
        Y = labels array, shaped like (Nsamples); its values are either 1 or -1
        """
        self.w = np.zeros(X.shape[1])
        self.best_w = self.w
        
        if debug:
            print("Initialized w as a zero vector")
            print(self.w)

        t = 0

        if debug: print("Calculating initial loss")
        
        loss, misclassified = self.loss(X,Y,debug)

        self.best_loss = loss
        
        if debug:
            print("Total loss = ",loss)
            print("Array of misclassifications (lenght = {}):".format(misclassified.shape))
            print(misclassified)

        while (t<abs(Nmax)) and (loss>0):
            t += 1


            random_index = np.random.randint(len(X[misclassified]))

            # ---------------------------------DEBUG
            # check progress every 500 steps
            if debug and t%500==0:
                print("Chosen point was number {} of misclassified samples, i.e.:".format(random_index) )
                print(X[misclassified][random_index])
                print(Y[misclassified][random_index])
                print("Total loss = ",loss)
                print("Array of misclassifications (lenght = {}):".format(misclassified.shape))
                print(misclassified)
            # ------------------------------------

            # ------------------ UPDATE
            self.update(X[misclassified][random_index],Y[misclassified][random_index],lr)

            loss,misclassified = self.loss(X,Y,debug)
            
            if loss<self.best_loss:
                self.best_w = self.w
                self.best_loss = loss

            #if debug: print("Current loss = ", loss)

        # in the end, keep the best w
        self.w = self.best_w

        if debug: 
            print("Best loss was ",self.best_loss)
            print(f"We got {self.best_loss / X.shape[0] :.2%} wrong!")
            print("Best w was: ",self.best_w)
            


    def classify(self,x):
        # x must be [1,feature1,feature2,...]
        # if plane vector w is "facing" the data point
        # it gets classified as 1
        # otherwise as -1
        # returns 0 if on the boundary
        return np.sign(np.dot(self.w,x))

    def update(self,x,y,lr):
        """
        Update w using data point 
        x = [1,feature1,feature2,...]
        y = label {-1,1}
        with learning rate lr
        """
        self.w = self.w + lr*y*x

    def loss(self,X,Y,debug=False):
        """
        Calculates loss using current value of w over
        a data set X,Y.
        Returns total loss (equal to the number of misclassified samples)
        And array of boolean values corresponding to misclassified samples
        """
        
        #np.apply_along_axis takes a function, an axis and an array
        misclassified = np.apply_along_axis(self.classify,1,X) != Y

        loss = len(misclassified[misclassified == True])
            
        return loss, misclassified



Now we use the implementation above of the perceptron to learn a model from the training data using 100 iterations and print the error of the best model we have found.

In [None]:
#now run the perceptron for 100 iterations
perceptron = Perceptron()
perceptron.train(X_training,Y_training,100)

print("Training Error of perceptron (100 iterations): {:.1%}".format(perceptron.best_loss/X_training.shape[0]))
print("Best w found is {}".format(perceptron.w))

In [None]:
# Let's plot again the data adding the plane
# corresponding to the w vector found, and see if it makes sense.

fig1 = plt.figure()
ax1 = fig1.add_subplot(projection='3d')


#plotting against normalized training datapoints
ax1.scatter(X_training[Y_training>=0][:,1],X_training[Y_training>=0][:,2],X_training[Y_training>=0][:,3],color=colors[0],alpha=0.5,label='Metal') #  metal
ax1.scatter(X_training[Y_training<=0][:,1],X_training[Y_training<=0][:,2],X_training[Y_training<=0][:,3],color=colors[1],alpha=0.5,label='Classical') # classical

x_plot = np.linspace(np.min(X_training[:,1]),np.max(X_training[:,1]),10)
y_plot = np.linspace(np.min(X_training[:,2]),np.max(X_training[:,2]),10)

# building the points of the plane
xx, yy = np.meshgrid(x_plot,y_plot)
#xx,yy = np.meshgrid(X_training[::10,1],X_training[::10,2])

# multiplying by X_training[0,0] shouldn't be necessary since its value is 1 (the homogeneous coordinate). But this works even when normalizing with the homogeneous coordinate.
zz = (-perceptron.best_w[0]*X_training[0,0] - xx*perceptron.best_w[1] - yy*perceptron.best_w[2])/perceptron.best_w[3]

ax1.plot_surface(xx,yy,zz,alpha=0.5,color=colors[2])
ax1.legend()
ax1.view_init(elev=10., azim=10)

plt.show()

**TO DO** use the best model $w\_found$ to predict the labels for the test dataset and print the fraction of misclassified samples in the test set (the test error that is an estimate of the true loss).

In [None]:
#now use the w_found to make predictions on test dataset

num_errors,test_misclassified = perceptron.loss(X_test,Y_test)

true_loss_estimate = num_errors/m_test  # error rate on the test set
#NOTE: you can avoid using num_errors if you prefer, as long as true_loss_estimate is correct
print("Test Error of perceptron (100 iterations): {:.2%}".format(true_loss_estimate))

In [None]:
# again, plot the plane and data points, this time taken from test set

fig2 = plt.figure()
ax2 = fig2.add_subplot(projection='3d')


#plotting against normalized training datapoints
ax2.scatter(X_test[Y_test>=0][:,1],X_test[Y_test>=0][:,2],X_test[Y_test>=0][:,3],color=colors[4],alpha=0.5,label='Metal (test)') #  metal
ax2.scatter(X_test[Y_test<=0][:,1],X_test[Y_test<=0][:,2],X_test[Y_test<=0][:,3],color=colors[3],alpha=0.5,label='Classical (test)') # classical

x_plot = np.linspace(np.min(X_test[:,1]),np.max(X_test[:,1]),10)
y_plot = np.linspace(np.min(X_test[:,2]),np.max(X_test[:,2]),10)

# building the points of the plane
xx, yy = np.meshgrid(x_plot,y_plot)

zz = (-perceptron.best_w[0]*X_training[0,0] - xx*perceptron.best_w[1] - yy*perceptron.best_w[2])/perceptron.best_w[3]

ax2.plot_surface(xx,yy,zz,alpha=0.5,color=colors[2])
ax2.legend()
ax2.view_init(azim=26., elev=15)

zmin=np.min([np.min(X_training[:,3]),np.min(X_test[:,3])])
zmax=np.max([np.max(X_test[:,3]),np.max(X_training[:,3])])
ax2.set_zlim(zmin,zmax)

plt.show()

**TO DO** **[Answer the following]** What about the difference betweeen the training error and the test error  in terms of fraction of misclassified samples)? Explain what you observe. [Write the answer in this cell]

**ANSWER QUESTION 1**

The chosen permutation used to separate the dataset in training and test sets affects the error set directly; in some cases it could even result in a linearly separable training set, and in turn this results in the perceptron converging and overfitting the training data. Since there is no hyper parameter to fine tune, the best way to tackle this inconvenience is to ensure both data sets sample the real distribution in the same way. 
A brief code snippet is provided with the intent to show the difference between the training set and sample set, while also displaying the resulting plane of the perceptron algorithm applied to the training set only.

In [None]:
# ANSWER TO QUESTION 1:
# apparently the entirety of the samples represent a trend that isn't entirely
# captured by the model trained on the training dataset only.
# we can try to visualize the difference between the training set and test set by plotting them both:

fig3 = plt.figure()
ax3 = fig3.add_subplot(projection='3d')
#plotting normalized training datapoints
ax3.scatter(X_training[Y_training>=0][:,1],X_training[Y_training>=0][:,2],X_training[Y_training>=0][:,3],color=colors[0],alpha=0.5,label='Metal (training)')
ax3.scatter(X_training[Y_training<=0][:,1],X_training[Y_training<=0][:,2],X_training[Y_training<=0][:,3],color=colors[1],alpha=0.5,label='Classical (training)')
#plotting normalized test datapoints
ax3.scatter(X_test[Y_test>=0][:,1],X_test[Y_test>=0][:,2],X_test[Y_test>=0][:,3],color=colors[4],alpha=0.5,label='Metal (test)') 
ax3.scatter(X_test[Y_test<=0][:,1],X_test[Y_test<=0][:,2],X_test[Y_test<=0][:,3],color=colors[3],alpha=0.5,label='Classical (test)')
#plotting perceptron plane
x_plot = np.linspace(np.min(X_test[:,1]),np.max(X_test[:,1]),10)
y_plot = np.linspace(np.min(X_test[:,2]),np.max(X_test[:,2]),10)
xx, yy = np.meshgrid(x_plot,y_plot)
zz = (-perceptron.best_w[0]*X_training[0,0] - xx*perceptron.best_w[1] - yy*perceptron.best_w[2])/perceptron.best_w[3]

ax3.plot_surface(xx,yy,zz,alpha=0.5,color=colors[2])
ax3.legend()
ax3.view_init(azim=26., elev=15)

ax3.set_zlim(zmin,zmax)

plt.show()



**TO DO** Copy the code from the last 2 cells above in the cell below and repeat the training with 4000 iterations. Then print the error in the training set and the estimate of the true loss obtained from the test set.

In [None]:
#now run the perceptron for 4000 iterations here!

perceptron.train(X_training,Y_training,4000)

print("Best w found: ",perceptron.best_w)

print("Training Error of perceptron (4000 iterations): {:.2%}".format(perceptron.best_loss/X_training.shape[0]))

print("Test Error of perpceptron (4000 iterations): {:.2%}".format(perceptron.loss(X_test,Y_test)[0]/X_test.shape[0]))

**TO DO** [Answer the following] What about the difference betweeen the training error and the test error  in terms of fraction of misclassified samples) when running for a larger number of iterations ? Explain what you observe and compare with the previous case. [Write the answer in this cell]

**ANSWER QUESTION 2**

With longer training duration, the algorithm manages to lower its loss to a remarkable minimum, and even got a 0 loss over the test set (using random seed 2057447); apparently, the test-training separation managed to sample data points in a uniform fashion, well representing the distribution of the whole set, in this way allowing the algorithm to find a prediction algorithm the well extends beyond the training set.

# Logistic Regression
Now we use logistic regression, exploiting the implementation in Scikit-learn, to predict labels. We will also plot the decision region of logistic regression.

We first load the dataset again.

In [None]:
# Load the dataset
filename = 'data/music.csv'
music = csv.reader(open(filename, newline='\n'), delimiter=',')

header = next(music) # skip first line
print(f"Header: {header}\n")

dataset = np.array(list(music))
print(f"Data shape: {dataset.shape}\n")
print("Dataset Example:")
print(dataset[:10,...])

X = dataset[:,:-1].astype(float) # columns 0,1,2 contain the features
Y = dataset[:,-1].astype(int)    # last column contains the labels

Y = 2*Y-1                        # for the perceprton classical--> -1, metal-->1
m = dataset.shape[0]
print("\nNumber of samples loaded:", m)

**TO DO** As for the previous part, divide the data into training and test (75%-25%) and add a 1 as first component to each sample.

In [None]:
#Divide in training and test: make sure that your training set
#contains at least 10 elements from class 1 and at least 10 elements
#from class -1! If it does not, modify the code so to apply more random
#permutations (or the same permutation multiple times) until this happens.
#IMPORTANT: do not change the random seed.

# normalize for easier convergence, NOT ADDING HOMOGENEOUS COORDINATE since the logreg class already fits the intercept
X_norm = sl.preprocessing.normalize(X,axis=0)

print(X_norm[:10])

n_classical_train=0
n_metal_train=0
while (n_classical_train < 10) and (n_metal_train < 10):

    # MOVING THE PERMUTATION HERE SO IT GETS EXECUTED ONE TIME 
    # AND THEN HOWEVER NECESSARY
    permutation = np.random.permutation(m) # random permutation

    X_norm = X_norm[permutation]
    Y = Y[permutation]


    # m_test needs to be the number of samples in the test set
    m_training = int(m*0.75)

    # m_test needs to be the number of samples in the test set
    m_test = m - m_training

    # X_training = instances for training set
    X_training = X_norm[:m_training]
    #Y_training = labels for the training set
    Y_training = Y[:m_training]

    # X_test = instances for test set
    X_test = X_norm[m_training:]
    # Y_test = labels for the test set
    Y_test = Y[m_training:]


    print(Y_training) # to make sure that Y_training contains both 1 and -1
    print(m_test)

    n_classical_train = np.sum(Y_training==-1)
    n_metal_train = np.sum(Y_training==1)
    n_classical_test=np.sum(Y_test==-1)
    n_metal_test = np.sum(Y_test==1)

    print("\nNumber of classical instances in test:", n_classical_test)
    print("Number of metal instances in test:", n_metal_test)

    print("Shape of training set: " + str(X_training.shape))
    print("Shape of test set: " + str(X_test.shape))

To define a logistic regression model in Scikit-learn use the instruction

$linear\_model.LogisticRegression(C=1e5)$

($C$ is a parameter related to *regularization*, a technique that
we will see later in the course. Setting it to a high value is almost
as ignoring regularization, so the instruction above corresponds to the
logistic regression you have seen in class.)

To learn the model you need to use the $fit(...)$ instruction and to predict you need to use the $predict(...)$ function. See the Scikit-learn documentation for how to use it.

**TO DO** Define the logistic regression model, then learn the model using the training set and predict on the test set. Then print the fraction of samples misclassified in the training set and in the test set.

In [None]:
#part on logistic regression for 2 classes
logreg = linear_model.LogisticRegression(C=1e5) #a large C disables regularization

#learn from training set

logreg.fit(X_training,Y_training)

#predict on training set

Y_training_predicted = logreg.predict(X_training)

#print the error rate = fraction of misclassified samples
error_rate_training = len(Y_training_predicted[Y_training_predicted != Y_training])/X_training.shape[0]

print("Error rate on training set: {:.2%}".format(error_rate_training))

#predict on test set

Y_test_predicted = logreg.predict(X_test)

#print the error rate = fraction of misclassified samples
error_rate_test = len(Y_test_predicted[Y_test_predicted!=Y_test])/X_test.shape[0]

print("Error rate on test set: {:.2%}".format(error_rate_test))

**TO DO** Now pick two features and restrict the dataset to include only two features, whose indices are specified in the $feature$ vector below. Then split into training and test. Which features are you going to select ?

In [None]:
# to make the plot we need to reduce the data to 2D, so we choose two features
features_list = ['tempo', 'chroma_stft', 'spectral_centroid']
index_feature1 =  1
index_feature2 =  2
features = [index_feature1, index_feature2] 

feature_name0 = features_list[features[0]]
feature_name1 = features_list[features[1]]

X_reduced = X_norm[:,features]

print(X_reduced[:5])

n_classical_train=0
n_metal_train=0
while (n_classical_train < 10) and (n_metal_train < 10):
    # MOVING THE PERMUTATION HERE SO IT GETS EXECUTED ONE TIME 
    # AND THEN HOWEVER NECESSARY
    permutation = np.random.permutation(m) # random permutation

    X_reduced = X_reduced[permutation]
    Y = Y[permutation]


    # m_test needs to be the number of samples in the test set
    m_training = int(m*0.75)

    # m_test needs to be the number of samples in the test set
    m_test = m - m_training

    # X_training = instances for training set
    X_training_reduced = X_reduced[:m_training]
    #Y_training = labels for the training set
    Y_training = Y[:m_training]

    # X_test = instances for test set
    X_test_reduced = X_reduced[m_training:]
    # Y_test = labels for the test set
    Y_test = Y[m_training:]


    print(Y_training) # to make sure that Y_training contains both 1 and -1
    print(m_test)

    n_classical_train = np.sum(Y_training==-1)
    n_metal_train = np.sum(Y_training==1)
    n_classical_test=np.sum(Y_test==-1)
    n_metal_test = np.sum(Y_test==1)

    print("\nNumber of classical instances in test:", n_classical_test)
    print("Number of metal instances in test:", n_metal_test)

    print("Shape of training set: " + str(X_training_reduced.shape))
    print("Shape of test set: " + str(X_test_reduced.shape))

Now learn a model using the training data and measure the performances.

In [None]:
# learning from training data
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X_training_reduced,Y_training)
Y_training_predicted_reduced = logreg.predict(X_training_reduced)
error_rate_training = len(Y_training_predicted_reduced[Y_training_predicted_reduced!=Y_training])/m
print("Error rate on training set: {:.2%}".format(error_rate_training))

#print the error rate = fraction of misclassified samples
Y_test_predicted_reduced=logreg.predict(X_test_reduced)
error_rate_test = len(Y_test_predicted_reduced[Y_test_predicted_reduced!=Y_test])/m

print("Error rate on test set: {:.2%}".format(error_rate_test))

**TO DO** [Answer the following] Which features did you select and why ? Compare the perfromances with the ones of the case with all the 3 features and comment about the results. [Write the answer in this cell]

**ANSWER QUESTION 3**

I chose <code>chroma_stft</code> and <code>spectral_centroid</code>. As soon as the data is inspected through a 3d plot (as previously done in the Perceptron exercise), it is easy to observe that the highest correlation is due to the mean chromogram activation, since samples tend to be separated (classical music tends to conglomerate near the lower end of the <code>chroma_stft</code> range, while metal music seems to could to the right of the axis). A similar correlation  also shows along the z-axis of the 3d plot, which is <code>spectral_centroid</code>; metal music "clouds" higher.

If everything is ok, the code below uses the model in $logreg$ to plot the decision region for the two features chosen above, with colors denoting the predicted value. It also plots the points (with correct labels) in the training set. It makes a similar plot for the test set.

In [None]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].

# NOTICE: This visualization code has been developed for a "standard" solution of the notebook, 
# it could be necessary to make some fixes to adapt to your implementation

# changed step size since I normalized
# also x_min and x_max

h = .002 # step size in the mesh
#x_min, x_max = X_reduced[:, 0].min() - .5, X_reduced[:, 0].max() + .5
#y_min, y_max = X_reduced[:, 1].min() - .5, X_reduced[:, 1].max() + .5

x_min, x_max = X_reduced[:, 0].min(), X_reduced[:, 0].max()
y_min, y_max = X_reduced[:, 1].min(), X_reduced[:, 1].max()

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)

plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X_training_reduced[:, 0], X_training_reduced[:, 1], c=Y_training, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('Training set')

plt.show()

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the test points 
plt.scatter(X_test_reduced[:, 0], X_test_reduced[:, 1], c=Y_test, edgecolors='k', cmap=plt.cm.Paired, marker='s')
plt.xlabel(feature_name0)
plt.ylabel(feature_name1)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('Test set')

plt.show()

# 2) Linear Regression on the Boston House Price dataset

### Dataset description: 

The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details about the house and its neighborhood.

The dataset contains a total of 500 observations, which relate 13 input features to an output variable (house price).

The variable names are as follows:

CRIM: per capita crime rate by town.

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of nonretail business acres per town.

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

NOX: nitric oxides concentration (parts per 10 million).

RM: average number of rooms per dwelling.

AGE: proportion of owner-occupied units built prior to 1940.

DIS: weighted distances to five Boston employment centers.

RAD: index of accessibility to radial highways.

TAX: full-value property-tax rate per $10,000.

PTRATIO: pupil-teacher ratio by town.

B: 1000*(Bk – 0.63)2 where Bk is the proportion of blacks by town.

LSTAT: % lower status of the population.

MEDV: Median value of owner-occupied homes in $1000s.
    

In [None]:
#needed if you get the IPython/javascript error on the in-line plots
%matplotlib nbagg  

import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from scipy import stats

In [None]:
#Import Data: Load the data from a .csv file

filename = "data/house.csv"
Data = np.genfromtxt(filename, delimiter=';',skip_header=1)

#A quick overview of data, to inspect the data you can use the method describe()

dataDescription = stats.describe(Data)
print(dataDescription)
print ("Shape of data array: " + str(Data.shape))


#for more interesting visualization: use Panda!

import pandas as pd
pd.DataFrame(Data).describe()

# Split data in training and test sets



Given $m$ total data, denote with $m_{t}$ the part used for training. Keep $m_t$ data as training data, and $m_{test}:= m-m_{t}$. For instance one can take $m_t=0.7m$ of the data as training and $m_{test}=0.3m$ as testing. Let us define as define

$\bullet$ $S_{t}$ the training data set

$\bullet$ $S_{test}$ the testing data set


The reason for this splitting is as follows:

TRAINING DATA: The training data are used to compute the empirical loss
$$
L_S(h) = \frac{1}{m_t} \sum_{z_i \in S_{t}} \ell(h,z_i)
$$
which is used to estimate $h$ in a given model class ${\cal H}$.
i.e. 
$$
\hat{h} = {\rm arg\; min}_{h \in {\cal H}} \, L_S(h)
$$

TESTING DATA: The test data set can be used to estimate the performance of the final estimated model
$\hat h_{\hat d_j}$ using:
$$
L_{{\cal D}}(\hat h_{\hat d_j}) \simeq \frac{1}{m_{test}} \sum_{ z_i \in S_{test}} \ell(\hat h_{\hat d_j},z_i)
$$


**TO DO**: split the data in training and test sets (70%-30%)

In [None]:
#get number of total samples
num_total_samples = Data.shape[0]

print ("Total number of samples: ", num_total_samples)

m_t = int(num_total_samples*.7)

print ("Cardinality of Training Set: ", m_t)

#shuffle the data
np.random.shuffle(Data)

#training data 

X_training = Data[:m_t,:13]
Y_training = Data[:m_t,13]
print ("Training input data size: ", X_training.shape)
print ("Training output data size: ", Y_training.shape)

#test data, to be used to estimate the true loss of the final model(s)
X_test = Data[m_t:,:13]
Y_test = Data[m_t:,13]
print ("Test input data size: ", X_test.shape)
print ("Test output data size: ", Y_test.shape)

# Data Normalization
It is common practice in Statistics and Machine Learning to scale the data (= each variable) so that it is centered (zero mean) and has standard deviation equal to 1. This helps in terms of numerical conditioning of the (inverse) problems of estimating the model (the coefficients of the linear regression in this case), as well as to give the same scale to all the coefficients.

In [None]:
# scale the data

# standardize the input matrix
from sklearn import preprocessing
# the transformation is computed on training data and then used on all the 3 sets
scaler = preprocessing.StandardScaler().fit(X_training) 

np.set_printoptions(suppress=True) # sets to zero floating point numbers < min_float_eps
X_training = scaler.transform(X_training)
print ("Mean of the training input data:", X_training.mean(axis=0))
print ("Std of the training input data:",X_training.std(axis=0))

X_test = scaler.transform(X_test) # use the same transformation on test data
print ("Mean of the test input data:", X_test.mean(axis=0))
print ("Std of the test input data:", X_test.std(axis=0))

# Model Training 

The model is trained (= estimated) minimizing the empirical error
$$
L_S(h) := \frac{1}{m_t} \sum_{z_i \in S_{t}} \ell(h,z_i)
$$
When the loss function is the quadratic loss
$$
\ell(h,z) := (y - h(x))^2
$$
we define  the Residual Sum of Squares (RSS) as
$$
RSS(h):= \sum_{z_i \in S_{t}} \ell(h,z_i) = \sum_{z_i \in S_{t}} (y_i - h(x_i))^2
$$ 
so that the training error becomes
$$
L_S(h) = \frac{RSS(h)}{m_t}
$$

We recal that, for linear models we have $h(x) = <w,x>$ and the Empirical error $L_S(h)$ can be written
in terms of the vector of parameters $w$ in the form
$$
L_S(w) = \frac{1}{m_t} \|Y - X w\|^2
$$
where $Y$ and $X$ are the matrices whose $i-$th row are, respectively, the output data $y_i$ and the input vectors $x_i^\top$.


 **TO DO:** compute the linear regression coefficients using np.linalg.lstsq from scikitlear 
 

In [None]:
#compute linear regression coefficients for training data

#add a 1 at the beginning of each sample for training, and testing (use homogeneous coordinates)
m_training = X_training.shape[0]
X_trainingH = np.hstack((np.ones((m_training,1)),X_training)) # H: in homogeneous coordinates

#print(X_trainingH.shape)
#print(X_trainingH[:10])

m_test = X_test.shape[0]
X_testH = np.hstack((np.ones((m_test,1)),X_test))  # H: in homogeneous coordinates

# Compute the least-squares coefficients using linalg.lstsq
w_np, RSStr_np, rank_Xtr, sv_Xtr =  np.linalg.lstsq(X_trainingH,Y_training)
print("LS coefficients with numpy lstsq:", w_np)

# compute Residual sums of squares 

RSStr_hand = np.sum(((X_trainingH @ w_np.transpose()) - Y_training)**2)

print("RSS with numpy lstsq: ", RSStr_np)
print("Empirical risk with numpy lstsq:", RSStr_np/m_training)

print("RSS calculated with code: ", RSStr_hand)
print("Empirical risk with code: ", RSStr_hand/m_training)

## Data prediction 

Compute the output predictions on both training and test set and compute the Residual Sum of Squares (RSS). 

**TO DO**: Compute these quantities on  training and test sets.

In [None]:
#compute predictions on training and test

prediction_training = X_trainingH @ w_np
prediction_test = X_testH @ w_np

#what about the loss for points in the test data?
RSS_test = np.sum((prediction_test - Y_test)**2)

print("RSS on test data:",  RSS_test)
print("Loss estimated from test data:", RSS_test/m_test)

### QUESTION 4: Comment on the results you get and on the difference between the train and test errors.

The loss estimated from test data appears to be just slightly worse than the one calculated over the training data, in perfect agreement with the expectation of the model being fitted exclusively on the training data; this means the model only slightly overfitted the training data set and was able to extend its validity over unseen data.

## Ordinary Least-Squares using scikit-learn
Another fast way to compute the LS estimate is through sklearn.linear_model (for this function homogeneous coordinates are not needed).

In [None]:
from sklearn import linear_model

# build the LinearRegression() model and train it
LinReg = linear_model.LinearRegression()
LinReg.fit(X_training,Y_training)

print("Intercept:", LinReg.intercept_)
print("Least-Squares Coefficients:", LinReg.coef_)

# predict output values on training and test sets

Y_tr_predicted=LinReg.predict(X_training)
Y_te_predicted=LinReg.predict(X_test)

# return a prediction score based on the coefficient of determination
print("Measure on training data:", 1-LinReg.score(X_training, Y_training))
print("Measure on test data:", 1-LinReg.score(X_test, Y_test))