# Validation
In this problem, we explore the role of the validation set in model selection. This experiment is done to answer problems 1 to 5 in homework 7 of the Learning from Data course of Caltech, which can be found here: http://work.caltech.edu/homework/hw7.pdf

### Import all necessary libraries

In [1]:
import math
import matplotlib.pyplot as plt
import numpy as np
import random
%matplotlib inline

### Non-linear transformation

In [21]:
# Perform the non-linear transformation on a data point.
def transform(x, transformation=7):
    res = np.array([1, x[0], x[1], x[0]*x[0], x[1]*x[1], x[0]*x[1], abs(x[0]-x[1]), abs(x[0]+x[1])])
    return res[:transformation+1]

### Linear Regression

In [58]:
# Implementation of linear regression for fitting a set of data points
class LinearRegression:
    # This method receives the data points and finds a weight vector that best fits these points.
    @staticmethod
    def fit(X, Y, k=7):
        # Transform the input data.
        X = np.array([transform(X[i], transformation=k) for i in range(X.shape[0])])
        
        # N: number of data points.
        # d: dimension.
        N, d = X.shape
        
        # Approximate Xw = Y.
        XT = np.transpose(X)
        w = np.dot(np.dot(np.linalg.inv(np.dot(XT, X)), XT), Y)

        return w

### Classification error

In [76]:
def computeClassificationError(W, X, Y, transformation=7):
    correct = 0
    for x, y in zip(X, Y):
        x = transform(x, transformation)
        y_hat = np.sign(np.dot(W, x))
        if y_hat == y:
            correct += 1
    return 1 - correct / X.shape[0]

### Input data

In [4]:
# Read data from file.
def readData(filename):
    with open(filename) as f:
        data = f.read().replace('\n', ' ').replace('\t', ' ').split(' ')
    data = [float(data[_]) for _ in range(len(data)) if len(data[_]) > 0]
    X = np.array([float(data[_]) for _ in range(len(data)) if _ % 3 != 2])
    X = np.reshape(X, (X.shape[0] // 2, 2))
    Y = np.array([float(data[_]) for _ in range(len(data)) if _ % 3 == 2])
    return [X, Y]

Create the training and test set.

In [77]:
train_data = '../data/hw6-in.dta.txt'
test_data = '../data/hw6-out.dta.txt'

[X_train, Y_train] = readData(train_data)
[X_test, Y_test] = readData(test_data)

### Main experiment
#### 1. Training=25 examples, Validation=10 examples
Create the validation set that takes up the last 10 examples in the training set.

In [78]:
X_val = X_train[25:]
Y_val = Y_train[25:]
X_train = X_train[0:25]
Y_train = Y_train[0:25]

Fit the training set with 5 different models. Each model is different in the number of dimensions that it uses from the non-linear transformation.

In [79]:
W = [LinearRegression.fit(X_train, Y_train, k=i) for i in range(3, 8)]

Compute the classification error of each model on the validation set.

In [80]:
E_val = np.zeros(5)
for i in range(3, 8):
    E_val[i-3] = computeClassificationError(W[i-3], X_val, Y_val, transformation=i)
print(E_val)

[ 0.3  0.5  0.2  0.   0.1]


So the model that uses 6 dimensions in the non-linear transformation produces the smallest classification error on the validation set.

Next, we compute the classification error of each model on the test set.

In [81]:
E_out = np.zeros(5)
for i in range(3, 8):
    E_out[i-3] = computeClassificationError(W[i-3], X_test, Y_test, transformation=i)
print(E_out)

[ 0.42   0.416  0.188  0.084  0.072]


The model with $k = 7$ produces the smallest classification error on the test set.

#### 2. Training=10 examples, Validation=25 examples
Create the validation set that takes up the first 25 examples in the training set.

In [82]:
# Swap the training set and the validationset
tmp = X_val
X_val = X_train
X_train = tmp

tmp = Y_val
Y_val = Y_train
Y_train = tmp

The experiment is performed in the same way with the last one.

In [83]:
W = [LinearRegression.fit(X_train, Y_train, k=i) for i in range(3, 8)]

E_val = np.zeros(5)
for i in range(3, 8):
    E_val[i-3] = computeClassificationError(W[i-3], X_val, Y_val, transformation=i)
print('E_val = ', E_val)

E_out = np.zeros(5)
for i in range(3, 8):
    E_out[i-3] = computeClassificationError(W[i-3], X_test, Y_test, transformation=i)
print('E_out = ', E_out)

E_val =  [ 0.28  0.36  0.2   0.08  0.12]
E_out =  [ 0.396  0.388  0.284  0.192  0.196]


Model $k = 6$ produces the smallest classification error in both the validation and the test set.