### Theoretical methods of deep learning: Homework assignment 4
Submit solution by uploading to canvas, **by Friday, November 30th, 12:00**

**The task.** Perform an experimental study of convergence of gradient descent for a basic model, and give some theoretical interpretation to the results.
* Consider random training sets consisting of $N=20$ points $(\mathbf x_n, y_n),$ where $\mathbf x_n\in \mathbb R^d, y_n\in \mathbb R.$ Generate each $\mathbf x_n$ and $y_n$ independently, using standard normal distribution. Consider fitting this training data by a network having at least two hidden layers and using the standard quadratic loss.
* For $d=15$, choose a network architecture (sizes of the layers, the activation functions,..) and training parameters (weight initialization, learning rate, number of GD steps,..) so that the network reliably learns the training data (say with the final loss below $10^{-8}$ for 80% of random training sets). Provide a motivation for your choice and compare it to other choices. 
* What happens with training if the input dimension $d$ is significantly decreased (say to $d=5$ or $d=2$)? Does performance improve or deteriorate, and why?

In [67]:
import numpy as np
import torch
from torch.autograd import *
from torch.optim import SGD
import torch.nn as nn

In [40]:

X = np.random.normal(size=(N, d))
Y = np.random.normal(size=(N))
Y

array([-0.19764356,  1.67595248, -0.50506144,  0.66762533, -0.91967509,
       -0.09560613, -0.33479748,  0.30792044,  0.42676831, -1.30765899,
        1.06058597,  1.31657895, -0.66373204,  2.02305011, -0.71936897,
        0.67108124,  0.6541714 , -0.6180394 , -0.8654515 ,  0.24384127])

In [179]:
class Net(nn.Module):

    def __init__(self, input_dim, hidden_dims):
        super(Net, self).__init__()
        layers = []
        
        inp = input_dim
        for i in range(len(hidden_dims)):
            layers.append(nn.Linear(inp, hidden_dims[i]))
            layers.append(nn.ReLU())
            inp = hidden_dims[i]
            
        self.net = nn.Sequential(*layers)
        self.out = nn.Linear(inp, 1)
        
    def forward(self, x):
        x = self.out(self.net(x))
        return x

In [203]:
N = 20
d = 15

hidden_dims = [15, 15]

n_epoches = 500
n_iterations = 20

for i in range(n_iterations):
    X = np.random.normal(size=(N, d))
    Y = np.random.normal(size=(N))
    net = Net(input_dim=d, hidden_dims=hidden_dims)
    
    
    optimizer = SGD(net.parameters(), lr=0.01)
    lf = nn.MSELoss()
    
    final_losses = []
    for ep in range(n_epoches):
        
        for k in range(N):
            out = net(Variable(torch.FloatTensor(X[k])))
            optimizer.zero_grad()
            loss = lf(out, torch.FloatTensor([Y[k]]))
            loss.backward()
            optimizer.step()
            
    out = net(Variable(torch.FloatTensor(X)))        
    final_loss = lf(out, torch.FloatTensor([Y]).t())
    final_losses.append(final_loss.item())
    
    print(final_loss.item())

print(f"\nMaximal loss for {n_iterations} iterations: {max(final_losses)}")
    

1.0984269049885143e-14
1.455016594947671e-14
1.1744771652346785e-14
1.3789664194048014e-14
7.40102406850086e-15
2.070565940925917e-14
2.4183432695084293e-14
2.120525977034049e-14
1.3751673072297764e-14
1.8209912405558158e-14

Maximal loss for 10 iterations: 1.8209912405558158e-14
