### Theoretical methods of deep learning: Homework assignment 4
Submit solution by uploading to canvas, **by Friday, November 30th, 12:00**

**The task.** Perform an experimental study of convergence of gradient descent for a basic model, and give some theoretical interpretation to the results.
* Consider random training sets consisting of $N=20$ points $(\mathbf x_n, y_n),$ where $\mathbf x_n\in \mathbb R^d, y_n\in \mathbb R.$ Generate each $\mathbf x_n$ and $y_n$ independently, using standard normal distribution. Consider fitting this training data by a network having at least two hidden layers and using the standard quadratic loss.
* For $d=15$, choose a network architecture (sizes of the layers, the activation functions,..) and training parameters (weight initialization, learning rate, number of GD steps,..) so that the network reliably learns the training data (say with the final loss below $10^{-8}$ for 80% of random training sets). Provide a motivation for your choice and compare it to other choices. 
* What happens with training if the input dimension $d$ is significantly decreased (say to $d=5$ or $d=2$)? Does performance improve or deteriorate, and why?

Motivation for architecture:
- Linear+ReLU layers are a pretty flexible choice (also for overfitting on training dataset) and the most popular choice of layer functions across recent years
- Sizes of the layers should not be below input size or output size to avoid introducing a bottleneck (because full memorization is the intent). Hidden size = input size often works well in practice (the theorem on deep narrow networks doesn't work here since depth is limited).
- Learning rate choice is standard and was empirically shown to achieve good performance on an extremely wide range of tasks.
- Number of GD steps should be set at maximum until the desired accuracy on training set in achieved (in this specific case of overfitting). It is just limited by running time which should definitely not exceed time left until homework deadline.
- Weight initialization is uniform on an interval dependend on layer size. It was introduced to avoid vanishing/exploding gradients with saturating activation functions. With ReLU it doesn't matter much but this is still a common heuristic.

In [1]:
import numpy as np
import torch
from torch.autograd import *
from tqdm import tqdm_notebook, tnrange
import matplotlib.pyplot as plt
%matplotlib inline
from torch.optim import SGD
import torch.nn as nn
from torch.nn import init
import seaborn as sns

In [2]:
class Net(nn.Module):
    def __init__(self, input_dim, hidden_dims):
        super(Net, self).__init__()
        self.layers = []
        for i in range(len(hidden_dims)):
            prev_dim = input_dim if i == 0 else hidden_dims[i-1]
            cur_layer = nn.Linear(prev_dim, hidden_dims[i])
#             init.normal(self.lin.weight, std=0.0001)
#             init.normal(self.lin.bias, std=0.0001)
            self.layers.append(cur_layer)
            self.layers.append(nn.ReLU())
            
        self.net = nn.Sequential(*self.layers)
        self.out = nn.Linear(hidden_dims[-1], 1)
        
    def forward(self, x):
        x = self.out(self.net(x))
        return x

In [12]:
n_iters = 100

N = 20

d = 2
# for d in range(2, 15):
# for i in range(15):
hidden_dim = 45
hidden_dims = [hidden_dim] * 2
# new one

outcomes = []
n_epochs = 15
for i in range(n_epochs):
    X = np.random.normal(size=(N, d))
    Y = np.random.normal(size=(N))
    net = Net(input_dim=d, hidden_dims=hidden_dims)

    optimizer = SGD(net.parameters(), lr=0.01)
    lf = nn.MSELoss()

    x_losses = []
    for ep in tnrange(n_iters):
        for k in range(N):
            input_ = Variable(torch.FloatTensor(X[k])).view(1, d)
            out = net(input_)
            optimizer.zero_grad()
            target = Variable(torch.FloatTensor([Y[k]]).view(1, 1))
            loss = lf(out, target)
            loss.backward()
            optimizer.step()

        out = net(Variable(torch.FloatTensor(X)))
        final_loss = lf(out, Variable(torch.FloatTensor([Y]).t()))
        x_losses.append(final_loss.data.numpy())

    out = net(Variable(torch.FloatTensor(X)))
    final_loss = lf(out, Variable(torch.FloatTensor([Y]).t()))

    outcome_val = final_loss.data.numpy()
    print(outcome_val)
#     plt.semilogy(x_losses)
#     plt.suptitle(f'{d}')
#     plt.show()

    outcomes.append(outcome_val < 10e-8)

HBox(children=(IntProgress(value=0), HTML(value='')))


0.56798476


HBox(children=(IntProgress(value=0), HTML(value='')))


0.163595


HBox(children=(IntProgress(value=0), HTML(value='')))


0.36478776


HBox(children=(IntProgress(value=0), HTML(value='')))


0.53142613


HBox(children=(IntProgress(value=0), HTML(value='')))


0.20229197


HBox(children=(IntProgress(value=0), HTML(value='')))


0.6466337


HBox(children=(IntProgress(value=0), HTML(value='')))


0.16524045


HBox(children=(IntProgress(value=0), HTML(value='')))


0.3159274


HBox(children=(IntProgress(value=0), HTML(value='')))


0.35452053


HBox(children=(IntProgress(value=0), HTML(value='')))


0.4477297


HBox(children=(IntProgress(value=0), HTML(value='')))


0.21858501


HBox(children=(IntProgress(value=0), HTML(value='')))


0.86649925


HBox(children=(IntProgress(value=0), HTML(value='')))


0.06990448


HBox(children=(IntProgress(value=0), HTML(value='')))


0.18400669


HBox(children=(IntProgress(value=0), HTML(value='')))


0.34196514


In [10]:
final_loss.data.numpy()

array(0.8988382, dtype=float32)