### Theoretical methods of deep learning: Homework assignment 4
Submit solution by uploading to canvas, **by Friday, November 30th, 12:00**

**The task.** Perform an experimental study of convergence of gradient descent for a basic model, and give some theoretical interpretation to the results.
* Consider random training sets consisting of $N=20$ points $(\mathbf x_n, y_n),$ where $\mathbf x_n\in \mathbb R^d, y_n\in \mathbb R.$ Generate each $\mathbf x_n$ and $y_n$ independently, using standard normal distribution. Consider fitting this training data by a network having at least two hidden layers and using the standard quadratic loss.
* For $d=15$, choose a network architecture (sizes of the layers, the activation functions,..) and training parameters (weight initialization, learning rate, number of GD steps,..) so that the network reliably learns the training data (say with the final loss below $10^{-8}$ for 80% of random training sets). Provide a motivation for your choice and compare it to other choices. 
* What happens with training if the input dimension $d$ is significantly decreased (say to $d=5$ or $d=2$)? Does performance improve or deteriorate, and why?

In [11]:
import torch
import numpy as np

device = torch.device('cuda')

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 2, 15, 20, 1

x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
best_loss = 999
yeah = False
losses = []
for t in range(20000):
    y_pred = model(x)

    loss = loss_fn(y_pred, y)
    if (t+1)%10000 == 0: 
        print(t+1, loss.item())
    
    if best_loss > loss:
        best_loss = loss
        torch.save(model, 'checkpoint.pth')
 
        if loss < 1e-8 and yeah == False:
            print (t+1, 'yeah!')
            yeah = True
    losses.append(np.float(loss))
    model.zero_grad()
    loss.backward()
    
    with torch.no_grad():
        for param in model.parameters():
            param.data -= learning_rate * param.grad
best_loss

8172 yeah!
10000 1.0969536390348367e-09
20000 8.006395546544809e-11


tensor(8.0064e-11, device='cuda:0', grad_fn=<MseLossBackward>)

In [19]:
for param in model.named_parameters():
    print (param)

('0.weight', Parameter containing:
tensor([[-0.1343, -0.0941, -0.1888, -0.0057, -0.1569,  0.2298,  0.1400, -0.1596,
          0.2001,  0.2407, -0.2562, -0.1972, -0.0859,  0.1719, -0.2038],
        [ 0.1916, -0.0721,  0.2328,  0.0365,  0.0556, -0.1187, -0.2511,  0.0860,
         -0.2202, -0.2031,  0.0941,  0.1990,  0.0980, -0.2640,  0.0781],
        [ 0.2047, -0.1008,  0.0717, -0.1971,  0.0374,  0.2363,  0.1327,  0.1662,
          0.0894,  0.1361,  0.0982, -0.1304,  0.2489, -0.0380, -0.0799],
        [-0.1456,  0.0885,  0.0340,  0.0188, -0.1517,  0.1518,  0.2300,  0.1990,
         -0.2420, -0.1546,  0.2124, -0.2139,  0.2429, -0.0556,  0.0115],
        [-0.1334,  0.2327, -0.2299, -0.1789,  0.0156,  0.0745,  0.1018,  0.0544,
          0.1151, -0.1036,  0.0960, -0.0988,  0.2374,  0.1305, -0.2428],
        [ 0.0494,  0.2092,  0.2298, -0.1681, -0.1857,  0.1741,  0.2256,  0.2436,
         -0.2265,  0.1110, -0.0342, -0.1509,  0.2262,  0.1394,  0.2178],
        [ 0.0017,  0.1444,  0.0470,  0.25

In [18]:
model.parameters()

<generator object Module.parameters at 0x7f4c6420cf68>