<a href="https://colab.research.google.com/github/sbasu777/emeritus/blob/master/Introduction_to_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will introduce PyTorch, talk about its important concepts and features, and eventually train an MNIST classifier using what we have learned. 

## What is PyTorch?

1. A Python GPU-accelerated tensor library (NumPy, but faster)
2. Differentiable Programming with dynamic computation graphs
3. Flexible and efficient **neural network** library
4. Python-first framework (easy to integrate with other Python libraries, debug, and extend)
  + Quick conversion from & to NumPy array, integration with other Python libs.
  + Your favorite Python debugger.
  + Adding custom ops with Python/c++ extension. 
  + Running in purely c++ environment with the c++ API.

In [1]:
# install basical image libs
!pip install Pillow>=5.0.0
!pip install -U image

# install torch and torchvision (a utility library for computer vision that provides many public datasets and pre-trained models)
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.1.0-{platform}-linux_x86_64.whl torchvision

Requirement already up-to-date: image in /usr/local/lib/python3.6/dist-packages (1.5.27)


## GPU-accelerated Tensor Library

A Tensor is a multi-dimensional array.

In [0]:
import torch

In [3]:
# Create a 3x5 matrix filled with zeros

x = torch.zeros(3, 5)
print(x)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])


In [5]:
# Create a 3x5 matrix filled with random values

y = torch.randn(3, 5)
print(y)

tensor([[-0.5828,  1.1052,  1.1359, -0.3108,  0.7542],
        [-0.1063,  0.7949, -0.0160, -0.3285,  0.0887],
        [ 2.1026, -0.8372,  0.8150,  0.7257,  0.4800]])


In [6]:
# Shape manipulations

print('\n.t()  (transpose): ')
print(y.t())

print('.reshape(5, 3): ')
print(y.reshape(5, 3))


.t()  (transpose): 
tensor([[-0.5828, -0.1063,  2.1026],
        [ 1.1052,  0.7949, -0.8372],
        [ 1.1359, -0.0160,  0.8150],
        [-0.3108, -0.3285,  0.7257],
        [ 0.7542,  0.0887,  0.4800]])
.reshape(5, 3): 
tensor([[-0.5828,  1.1052,  1.1359],
        [-0.3108,  0.7542, -0.1063],
        [ 0.7949, -0.0160, -0.3285],
        [ 0.0887,  2.1026, -0.8372],
        [ 0.8150,  0.7257,  0.4800]])


In [7]:
# Slicing

print(y[1:])

print(y[1:, ::2])

tensor([[-0.1063,  0.7949, -0.0160, -0.3285,  0.0887],
        [ 2.1026, -0.8372,  0.8150,  0.7257,  0.4800]])
tensor([[-0.1063, -0.0160,  0.0887],
        [ 2.1026,  0.8150,  0.4800]])


In [8]:
# Basic arithmetics

print(x + 2)

tensor([[2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.]])


In [10]:
print(y * (x + 2))

tensor([[-1.1656,  2.2105,  2.2717, -0.6215,  1.5084],
        [-0.2125,  1.5897, -0.0319, -0.6570,  0.1774],
        [ 4.2051, -1.6745,  1.6301,  1.4514,  0.9600]])


In [11]:
print((y * (x + 2)).exp())

tensor([[ 0.3117,  9.1201,  9.6959,  0.5371,  4.5196],
        [ 0.8085,  4.9024,  0.9686,  0.5184,  1.1941],
        [67.0280,  0.1874,  5.1043,  4.2693,  2.6117]])


#### GPU Acceleration

Everything can be run on a GPU

First, let us create a [`torch.device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device) object representing a GPU device.

In [0]:
cuda0 = torch.device('cuda:0')  # pick the GPU at index 0

In [14]:
# Move a tensor from CPU to GPU
# NOTE: the first time you access a GPU, a context is created so this may take a
# few seconds. But subsequent uses will be fast.

cuda_y = y.to(cuda0)
print(cuda_y)

tensor([[-0.5828,  1.1052,  1.1359, -0.3108,  0.7542],
        [-0.1063,  0.7949, -0.0160, -0.3285,  0.0887],
        [ 2.1026, -0.8372,  0.8150,  0.7257,  0.4800]], device='cuda:0')


In [15]:
# Or directly creating a tensor on GPU

cuda_x = torch.zeros(3, 5, device=cuda0)
print(cuda_x)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]], device='cuda:0')


In [16]:
# All functions and methods work on GPU tensors

print((cuda_y * (cuda_x + 2)).exp())  # values match the CPU results above!

tensor([[ 0.3117,  9.1201,  9.6959,  0.5371,  4.5196],
        [ 0.8085,  4.9024,  0.9686,  0.5184,  1.1941],
        [67.0280,  0.1874,  5.1043,  4.2693,  2.6117]], device='cuda:0')


### NumPy Bridge

Converting a `torch.Tensor` to a `np.ndarray` and vice versa is a breeze.

The `torch.Tensor` and `np.ndarray` will share their underlying memory locations (if the `torch.Tensor` is on CPU and `dtype` is the same), and changing one will change the other.

In [0]:
import numpy as np

In [18]:
# Converting a tensor to an array

x = torch.randn(5)
print(x)

# use `my_tensor.numpy()`
x_np = x.numpy()
print(x_np)

# or `np.asarray`

x_np = np.asarray(x)
print(x_np)

tensor([-0.7144, -0.5864, -0.0497, -0.3571,  0.6125])
[-0.7144131  -0.5864114  -0.04972229 -0.35709804  0.6124573 ]
[-0.7144131  -0.5864114  -0.04972229 -0.35709804  0.6124573 ]


In [19]:
# in-place changes on one affects the other

x[0] = -1
print(x)
print(x_np)

tensor([-1.0000, -0.5864, -0.0497, -0.3571,  0.6125])
[-1.         -0.5864114  -0.04972229 -0.35709804  0.6124573 ]


In [20]:
# Converting an array to a tensor

a = np.random.randn(3, 4)

a_pt = torch.as_tensor(a)
print(a_pt)

tensor([[ 0.8211, -0.6568, -0.3037, -0.3949],
        [-0.8449,  1.1088, -1.2807, -0.6494],
        [ 0.6119,  0.7163,  0.1532, -2.3002]], dtype=torch.float64)


In [21]:
# the resulting CPU Tensor shares memory with the array!

a_pt[0] = -1
print(a)

[[-1.         -1.         -1.         -1.        ]
 [-0.84490724  1.10883538 -1.28072948 -0.64943817]
 [ 0.61190272  0.71631575  0.1532047  -2.30022874]]


In [22]:
# But if we change dtype and/or device at the same time, a copy is made

a_half_pt = torch.as_tensor(a, dtype=torch.float16, device=cuda0)
a_half_pt[0] = 9
print(a_half_pt)

print(a)  # original array is not affected

tensor([[ 9.0000,  9.0000,  9.0000,  9.0000],
        [-0.8447,  1.1084, -1.2803, -0.6494],
        [ 0.6118,  0.7163,  0.1532, -2.3008]], device='cuda:0',
       dtype=torch.float16)
[[-1.         -1.         -1.         -1.        ]
 [-0.84490724  1.10883538 -1.28072948 -0.64943817]
 [ 0.61190272  0.71631575  0.1532047  -2.30022874]]


## Differentiable Programming with Dynamic Computation Graphs

Gradient-based optimization is an essential part of the modern deep learning frenzy. PyTorch uses [reverse-mode automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to efficiently compute gradients through any computations done on tensors.

### Dynamic vs. Static

A neural network is essentially a sequence of mathematical operations on tensors, which build up a computation graph.

Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world. One has to build a neural network, and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch.

PyTorch uses a technique called reverse-mode auto-differentiation, which allows you to change the way your network behaves arbitrarily with zero lag or overhead. 


### Dynamic computation graphs

When you create a tensor with its `requires_grad` flag set to `True`, the [`autograd`](https://pytorch.org/docs/stable/autograd.html) engine considers it as a **leaf** node of the computation graph. As you compute with it, the graph is dynamically expanded. When you ask for gradients (e.g., via `tensor.backward()`), the `autograd` engine traces backwards through the graph, and automatically computes the gradients for you.

![alt text](https://github.com/pytorch/pytorch/raw/master/docs/source/_static/img/dynamic_graph.gif)


**Let's see this in action!**

In [23]:
# Now, we want tensors with `requires_grad=True`

a = torch.ones(3, 5, requires_grad=True)  # tensor of all ones
print(a)  # notice that the `requires_grad` flag is on!

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]], requires_grad=True)


Why 1for all elements -  because 

A = [A11........A15
        .....................
        A31........A35]
        
 d/dAij (∑ij Aij) = 1

In [24]:
# Currently `a` has no gradients

print(a.grad)

None


In [25]:
# Let's compute the gradient wrt the sum

s = a.sum()
print('sum of a is', s)

sum of a is tensor(15., grad_fn=<SumBackward0>)


In [26]:
# Notice the `grad_fn` of `s`. it represents the function used to propagate 
# gradients from `s` to previous nodes of the graph (`a` in this case).

s.backward()  # compute gradient!
print(a.grad)

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])


In [0]:
# Yay! Indeed d \sum_a / d a_ij = 1

In [28]:
# Gradients are automatically **accumulated**

a.sum().backward()
print(a.grad)  # now the new gradients are added to the old ones

# Don't worry, we have easy ways to clear the gradients too. 
# We will talk about those later!

tensor([[2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.],
        [2., 2., 2., 2., 2.]])


In [29]:
# Now let's do something slightly fancier, on GPU!

a = torch.ones(3, 4, device=cuda0, requires_grad=True)
b = torch.randn(4, 4, device=cuda0, requires_grad=True)

result = (torch.mm(a, b.t().exp()) * 0.5).rfft(2).sum() * b.prod() - b.mean()
print('this complicated chain of operation gives....')
print(result)

this complicated chain of operation gives....
tensor(0.2051, device='cuda:0', grad_fn=<SubBackward0>)


In [30]:
result.backward()
print('\ngradient wrt a is')
print(a.grad)
print('\ngradient wrt b is')
print(b.grad)


gradient wrt a is
tensor([[-2.6964e-06, -1.5461e-04, -3.5010e-05, -1.5756e-04],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]], device='cuda:0')

gradient wrt b is
tensor([[-0.0621, -0.0632, -0.0623, -0.0660],
        [-0.0627, -0.0653, -0.0622, -0.0603],
        [-0.0619, -0.0619, -0.0603, -0.0628],
        [-0.0644, -0.0633, -0.0622, -0.0622]], device='cuda:0')


In [32]:
#########################
#                       #
#       Excercise       #
#                       #
#########################


a = torch.linspace(-3, 3, 10, dtype=torch.float32, requires_grad=True)
b = torch.logspace(0.2, 2, 10, requires_grad=True)

z= torch.log(b.sum() / torch.exp(a).sum()) - b.mean()
print(z)

z.backward()
print(a.grad)
print(b.grad)


# FIXME!! Compute z as indicated below, and the gradients of z wrt a and b.

tensor(-24.9533, grad_fn=<SubBackward0>)
tensor([-0.0012, -0.0024, -0.0046, -0.0089, -0.0174, -0.0339, -0.0659, -0.1284,
        -0.2501, -0.4872])
tensor([-0.0963, -0.0963, -0.0963, -0.0963, -0.0963, -0.0963, -0.0963, -0.0963,
        -0.0963, -0.0963])


Compute 

$$z = \log \left( \frac{1}{\sum_i \exp(a_i)} \sum_j b_j \right) - \frac{1}{\lvert \mathbf{b} \rvert} \sum_k b_k,$$

and then the gradients of $z$ w.r.t. $\mathbf{a}$ and $\mathbf{b}$.

They should look like:

```
# Gradient wrt a
tensor([-0.0121, -0.0235, -0.0458, -0.0892, -0.1738, -0.3385, -0.6594, -1.2843,
        -2.5014, -4.8720])

# Gradient wrt b
tensor([ 5.3096e-01,  2.9811e-01,  1.5119e-01,  5.8489e-02,  7.4506e-09,
        -3.6904e-02, -6.0189e-02, -7.4881e-02, -8.4151e-02, -9.0000e-02])
```

#### Manipulating the `requires_grad` flag

In [33]:
# Other than directly setting it at creation time, you can change this flag 
# in-place using `my_tensor.requires_grad_()`, or, as in the above example, or
# just directly setting the attribute.

x = torch.randn(1, 4, 5)
print(x)
print('x does not track gradients')

tensor([[[ 1.5907, -1.2147, -1.0973, -0.9304,  0.0787],
         [-1.0679, -0.3764,  0.7608, -0.1658,  0.7324],
         [-0.6533,  2.1273,  0.1848, -1.7720,  1.2010],
         [-0.0467,  0.0066, -0.1344, -0.0926,  1.3753]]])
x does not track gradients


In [34]:
x.requires_grad_()
print(x)
print('x now **does** track gradients')

tensor([[[ 1.5907, -1.2147, -1.0973, -0.9304,  0.0787],
         [-1.0679, -0.3764,  0.7608, -0.1658,  0.7324],
         [-0.6533,  2.1273,  0.1848, -1.7720,  1.2010],
         [-0.0467,  0.0066, -0.1344, -0.0926,  1.3753]]], requires_grad=True)
x now **does** track gradients


## Flexible and Efficient Neural Network Library

The [`torch.nn`](https://pytorch.org/docs/stable/nn.html) and [`torch.optim`](https://pytorch.org/docs/stable/optim.html) packages provide many efficient implementations of neural network components:
  + Affine layers and [activation functions](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity)
  + Normalization methods
  + [Initialization schemes](https://pytorch.org/docs/stable/nn.html#torch-nn-init)
  + [Loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions)
  + [Embeddings](https://pytorch.org/docs/stable/nn.html#sparse-layers)
  + [Distributed and Multi-GPU training](https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed)
  + [Gradient-based optimizers](https://pytorch.org/docs/stable/optim.html)
  + [Learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)
  + etc.

In [0]:
import torch.nn as nn
import torch.nn.functional as F

#### `torch.nn` Layers

We will use the [fully connected linear layer (`nn.Linear`)](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear) as an example. 

A fc layer performs an affine transform with a 2D weight parameter $\mathbf{w}$ and a 1D bias parameter $\mathbf{b}$:

$$ f(\mathbf{x}) = \mathbf{w}^\mathrm{T} \mathbf{x} + \mathbf{b}.$$

In [36]:
fc = nn.Linear(in_features=8, out_features=8)
print(fc)

Linear(in_features=8, out_features=8, bias=True)


In [37]:
# It has two parameters, the weight and the bias

for name, p in fc.named_parameters():
    print('param name: {}\t shape: {}'.format(name, p.shape))

param name: weight	 shape: torch.Size([8, 8])
param name: bias	 shape: torch.Size([8])


In [38]:
# These parameters by default have `requires_grad=True`, so they will collect gradients!

print(fc.bias)

Parameter containing:
tensor([-0.1421, -0.2153,  0.0239,  0.1992, -0.3260,  0.1663,  0.2311,  0.2320],
       requires_grad=True)


In [0]:
# Let's construct an input tensor with 2 dimensions:
#   - batch dimension of size 64
#   - 8 features

x = torch.randn(64, 8)

In [41]:
# Pass it through the fc layer

result = fc(x)
print(result.shape)

# Why does the `result` have shape [64, 8]?
#   - batch dimension of size 64
#   - 8 output features

torch.Size([64, 8])


In [42]:
# Even though the input `x` has `requires_grad=False`, the convolution
# weight and bias parameters has `requires_grad=True`. So the result also
# requires gradient, with a `grad_fn` to compute backward pass for 
# convolutions.
print(result.requires_grad)
print(result.grad_fn)  # It says `AddmmBackward` because the fc layer performs a matmul and an addition

True
<AddmmBackward object at 0x7f8526386e10>


In [0]:
# Say (arbitrarily) we want the layer to behave like the cosine function (yes I know it is impossible)

target = x.cos()

In [44]:
# Let's try MSE loss

loss = F.mse_loss(result, target)
print(loss)

tensor(0.8142, grad_fn=<MseLossBackward>)


In [45]:
# Compute gradients

loss.backward()
print(fc.bias.grad)

tensor([-0.1908, -0.2228, -0.1108, -0.1024, -0.2257, -0.1332, -0.0776, -0.1101])


In [46]:
# We can manually perform SGD via a loop

print('bias before SGD', fc.bias)

lr = 0.1
with torch.no_grad():  
    # this context manager tells PyTorch that we don't want ops inside to be 
    # tracked by autograd!
    for p in fc.parameters():
        p -= lr * p.grad
        
print('bias after SGD', fc.bias)

bias before SGD Parameter containing:
tensor([-0.1421, -0.2153,  0.0239,  0.1992, -0.3260,  0.1663,  0.2311,  0.2320],
       requires_grad=True)
bias after SGD Parameter containing:
tensor([-0.1230, -0.1930,  0.0350,  0.2094, -0.3034,  0.1796,  0.2388,  0.2430],
       requires_grad=True)


#### `torch.optim` optimizers

More easily, we can use the provided [`torch.optim`](https://pytorch.org/docs/stable/optim.html#torch.optim) optimizers. Let's use the [`torch.optim.SGD`](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) optimizer for example!

In [0]:
# Let's optimize for 5000 iterations

# First, put the layer on GPU so things run faster

fc = fc.to(cuda0)

In [0]:
# Construct an optimizer

optim = torch.optim.SGD(fc.parameters(), lr=0.1)

In [49]:
# training loop

batch_size = 256

for ii in range(5000):
    # clear gradients accumulated on the parameters
    optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = fc(x)
    
    # compute loss
    loss = F.mse_loss(result, target)
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))


iteration 0	loss 0.83702
iteration 500	loss 0.19215
iteration 1000	loss 0.20539
iteration 1500	loss 0.19414
iteration 2000	loss 0.19394
iteration 2500	loss 0.19764
iteration 3000	loss 0.19878
iteration 3500	loss 0.19556
iteration 4000	loss 0.20001
iteration 4500	loss 0.19454


### Building Deep Neural Neworks

A single `nn.Linear` layer didn't do very well! The MSE loss above is still pretty large.

But this is expected as it is simply a linear transformation and thus has limited expressive power. Let's replace it with a deep network and see out it works!

For simplicity, we will use the following feedforward network architecture (from top to bottom):

```
        [Input]
           ||
[Fully-Connected 8 -> 32]
           ||
    [ReLU activation]
           ||
[Fully-Connected 32 -> 32]
           ||
    [ReLU activation]
           ||
[Fully-Connected 32 -> 8]
           ||
        [Output]
```

In PyTorch, a model is represented by a [`nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) object. The `nn.Linear` layer we looked at above is also an instance of it:

In [0]:
assert isinstance(nn.Linear(8, 8), nn.Module)

Now we want to build a deep network, we can compose the needed layers together by writing a custom `nn.Module` ourselves.

In [0]:
class MyNet(nn.Module):  # subclass nn.Module
    def __init__(self):
        super(MyNet, self).__init__()
        
        # We need 3 fully-connected layers!
        # Simply assigning them as attributes will
        # make sure that PyTorch keeps track of them.
        
        # 8 => 32
        self.fc1 = nn.Linear(8, 32)
        # 32 => 32
        self.fc2 = nn.Linear(32, 32)
        # 32 => 8
        self.fc3 = nn.Linear(32, 8)
        
        
    # We also need to define a `forward()` method that details
    # what should happen when this module is used.
    def forward(self, x):
        x = self.fc1(x)
        x = x.relu()
        x = self.fc2(x)
        x = x.relu()
        return self.fc3(x)

In [0]:
# Okay! Now we are ready to use this deep network! 

# Construct a network and move to GPU
net = MyNet().to(cuda0)

# Construct an optimizer
optim = torch.optim.SGD(net.parameters(), lr=0.1)

In [53]:
# The same training loop, but now using a deep network!

batch_size = 256

for ii in range(5000):
    # clear gradients accumulated on the parameters
    optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = net(x)
    
    # compute loss
    loss = F.mse_loss(result, target)
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))


iteration 0	loss 0.61165
iteration 500	loss 0.17693
iteration 1000	loss 0.10772
iteration 1500	loss 0.06474
iteration 2000	loss 0.03068
iteration 2500	loss 0.02252
iteration 3000	loss 0.00968
iteration 3500	loss 0.00805
iteration 4000	loss 0.00754
iteration 4500	loss 0.00620


The network did so much better than a single fully-connected layer!

#### `nn.Module` Containers

`torch.nn` also provides many other [`nn.Module` containers](https://pytorch.org/docs/stable/nn.html#containers) for easily building complex networks. E.g., [`nn.Sequential`](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential) executes a list of submodules sequentially, passing each output to the next's input. 

Using `nn.Sequential`, the above network can be equivalently written as:

In [0]:
net = nn.Sequential(
    nn.Linear(8, 32),
    nn.ReLU(),               # This nn.Module does the ReLU activation on its input
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 8),
).to(cuda0)

In [0]:
#########################
#                       #
#       Excercise       #
#                       #
#########################

Perform the same regression task (i.e., modeling $f(x) = \cos(x)$), but with the following modifications:

+ Use one *more* hidden layer
+ Use the `tanh` activation function (see [`my_tensor.tanh()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tanh))
+ Use a batch size of 128
+ Use the [`torch.optim.Adam`](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) optimizer
+ Use the [L1 loss](https://pytorch.org/docs/stable/nn.html#torch.nn.L1Loss) function


The following code skeleton is provided. Fill in the places with `FIXME!!!`.

In [0]:
class MyDeeperNet(nn.Module):
    def __init__(self):
        super(MyNet, self).__init__()
        
        # We need 4 fully-connected layers now! 
        # Each should have 32 output features, except for the last one, which outputs 8 values.
        
        # FIXME!!!
         
        # Alternatively, you can use an `nn.Sequential` to implement this.
        
        
    def forward(self, x):
        # FIXME!!!
        # Remember to use the `tanh` activation function
        pass
        
        
# Construct our new awesome deeper network and move to GPU
deeper_net =  None  # FIXME!!!

# Construct an Adam optimizer
deeper_net_optim = None  # FIXME!!!


# Training loop

batch_size = None  # FIXME!!! Batch size of 128

for ii in range(5000):
    # clear gradients accumulated on the parameters
    optim.zero_grad()
    
    # get an input (say we only care inputs sampled from N(0, I))
    x = torch.randn(batch_size, 8, device=cuda0)  # this has to be on GPU too
    
    # target is the cos(x)
    target = x.cos()
    
    # forward pass
    result = deeper_net(x)
    
    # compute loss
    loss = None  # FIXME!!! Now with L1 loss
    
    # compute gradients
    loss.backward()
    
    # let the optimizer do its work; the parameters will be updated in this call
    deeper_net_optim.step()
    
    # add some printing
    if ii % 500 == 0:
        print('iteration {}\tloss {:.5f}'.format(ii, loss))
