## Optimizers 
 Optimizers update the model parameters to minimize loss of the model
 
* Stocastic Gradient Descent:
    * Use minibatch of data for forward and backward passes
    * Gradient is computed on minibatch loss w.r.t current model parameters
* Parameters are updated according to:
    * w(i+1) = w(i) + lr*(-grad(loss))
    * lr = learning rate ( how big step the optimizer takes, it specifically impacts the speed of convergence)

In [1]:
import mxnet as mx
from mxnet import nd, autograd, optimizer, gluon

In [2]:
net = gluon.nn.Dense(1)
net

Dense(None -> 1, linear)

In [3]:
net.initialize()

In [5]:
batch_size = 8
x = nd.random.uniform(shape=(batch_size, 4))
y = nd.random.uniform(shape=(batch_size,))
loss = gluon.loss.L2Loss()

In [7]:
def forward_backward():
    with autograd.record():
        l = loss(net(x), y)
    l.backward()
    
forward_backward()

In [9]:
trainer = gluon.Trainer(net.collect_params(), optimizer='sgd', optimizer_params={'learning_rate':1})

In [10]:
curr_weight = net.weight.data().copy()
curr_weight


[[-0.0196689   0.01582889 -0.00881553  0.0563288 ]]
<NDArray 1x4 @cpu(0)>

In [11]:
# update the new model parameters using
trainer.step(batch_size) # provide batch size to normalize the size of gradient and make it independent of batch

In [12]:
# SGD explicitly update
print(curr_weight - net.weight.grad() * 1 / batch_size)


[[0.2073231  0.2075868  0.370947   0.36136353]]
<NDArray 1x4 @cpu(0)>


In [14]:
option = optimizer.Adam(learning_rate = 1)
trainer = gluon.Trainer(net.collect_params(), option)

In [15]:
forward_backward()

In [16]:
trainer.step(batch_size)

In [18]:
net.weight.data()


[[-0.7926822  -0.79241884 -0.62905896 -0.6386422 ]]
<NDArray 1x4 @cpu(0)>

In [19]:
trainer.learning_rate

1

In [20]:
trainer.set_learning_rate(0.1) # change learning rate to 0.1