# Machine Learning - Supervised Methods
# Is Learning Feasible?  +  The Linear Model I

## 1. Finite Hypothesis Sets

* **a) Load the "banana.csv" data set provided in the moodle course. Split it into 50% training and test points. Create $M = 10$ hypotheses $H = \{h_1, \dots, h_{10}\}$ at random by sampling a Gaussian weight vector (with numpy.random.randn): there should be 10 random weight vectors $w_1, w_2, \dots, w_{10}$, defining the 10 hypotheses $h_i = \textrm{sign}(w_i^T x)$. Define $g$ as the hypothesis with the lowest in-sample error, i.e., the error on the training set. Output this error.**

**This is a simplistic training method based on a finite hypothesis set. Althouth the process is of course a very inefficient learner, note that it is exactly compatible with the learning diagram. Training amounts to solving an optimization problem:**
$$ g = \arg\min\big\{E_{in}(h) \,\big|\, h \in H\big\} $$
** Note: The quantity called $N$ in the lecture videos is the size of the training set. **

In [64]:
import numpy as np

def loadCSV(filename):
    f = open(filename)
    data = np.loadtxt(f, delimiter=',')
    X = data[:, 1:]  #vor dem Komma: Zeilen, nach dem Komma: Spalten
    y = data[:, 0]
    return X, y

X, y = loadCSV("banana.csv")

In [18]:
X

array([[-0.56169354, -1.1159856 ],
       [-0.40297224, -0.48806087],
       [ 0.43471191,  1.307098  ],
       ...,
       [ 1.1302304 ,  1.4797409 ],
       [-0.00633513, -1.0014036 ],
       [ 0.55423325,  1.1879978 ]])

In [211]:
len(X) 
X.shape  # gibt einem Dimension des Arrays an

(5300, 2)

In [19]:
y

array([-1.,  1.,  1., ..., -1., -1., -1.])

In [53]:
X_training = X[0:int(len(X)/2),:] #man fängt immer bei 0 an zu zählen; der letzte Eintrag wird nicht mitgenommen
X_training  
len(X_training)

array([[-0.56169354, -1.1159856 ],
       [-0.40297224, -0.48806087],
       [ 0.43471191,  1.307098  ],
       ...,
       [-0.98758189, -0.19052135],
       [ 0.92763843, -0.10172408],
       [ 0.81871635,  0.54567215]])

In [58]:
X_test = X[int(len(X)/2):,:]
X_test
len(X_test)

2650

In [62]:
y_training = y[0:int(len(y)/2)]
y_test = y[int(len(y)/2):]

In [133]:
import random
np.random.seed(0)
weights = np.random.randn(10,2)
weights
#len(weights) #10
#len(weights[0,:]) #2

array([[ 1.76405235,  0.40015721],
       [ 0.97873798,  2.2408932 ],
       [ 1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721],
       [-0.10321885,  0.4105985 ],
       [ 0.14404357,  1.45427351],
       [ 0.76103773,  0.12167502],
       [ 0.44386323,  0.33367433],
       [ 1.49407907, -0.20515826],
       [ 0.3130677 , -0.85409574]])

In [205]:
N = len(X_training)
sum = np.zeros(10)
for i in range(0,9):
    for n in range(0,N):
        sum[i] =+ (np.dot(weights[0,:],X_training[n,:])-y_training[n])**2;  #** bedeutet ^2 
sum

12556.886092664019

In [204]:
E_h1 = (1/N)*sum
E_h1

4.738447582137366

In [210]:
sum = np.zeros(10) 
sum

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

* **b) Estimate the out-of-sample error of $g$ by computing the test error. The difference between training and test error is the quantity $\epsilon$ from the lecture videos. What is the bound on the probability for exceeding this value for $M = 10$ hypotheses? Is the bound meaningful? **

* **c) Verify empirically that the error on the training set is close to the error on the test set for *all* hypothesis in the set $H$. Output the maximal gap between training and test error. **

* **d) Increase the number of random hypotheses to M = 1000 and check whether the gap between training and test error increases. Try to interpret the result. In order to come to a conclusion, try replacing the "banana" data set with the "mushroom" data set, and repeat the experiment multiple times. **

## 2. Linear Regression

* **a) Apply the *sklearn.linear_model.LinearRegression* method to the "housing" data set provided in the moodle course. This is a regression problem with real-valued labels. Split the data set 50/50 into training and test. Train a linear regression model and output the mean squared error on the training set and the test set. **

* ** b) Train a linear regression model on the "mushroom" data set, using 50% for training and the other 50% for testing. Note that we are applying a regression technique to a classification problem. Compare training and test error (misclassification rate) to the Perceptron.**

* ** c) Train a linear regression model on the "banana" data set. Output training and test error (misclassification rate). **

* **d) Add polynomial features of the form $x_1^a x_2^b$ for the banana data set, for degrees $a + b \leq 7$. Apply the class *sklearn.preprocessing.PolynomialFeatures*. Train the linear regression again and check training and test error. How do the errors change? **