# Manual: How to set up learning hyper-parameters
 


This manual explains how to set learning rate, momentum and other hyper-parameters for the learners (opitimizers) supported in CNTK:

* [AdaDelta](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adadelta)
* [AdaGrad](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adagrad)
* [FSAdaGrad](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.fsadagrad)
* [Adam](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.adam)
* [MomentumSGD](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.momentum_sgd)
* [Nesterov](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.nesterov)
* [RMSProp](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.rmsprop)
* [SGD](https://cntk.ai/pythondocs/cntk.learners.html#cntk.learners.sgd)

Additional details regarding the learners and how to use them in training can found at:

* For the details of learning rate schedule, please see [learning_rate_schedule](https://cntk.ai/pythondocs/cntk.learners.html?highlight=learning_rate_schedule#cntk.learners.learning_rate_schedule); for the details of momentum schedule, please see [momentum_schedule](https://cntk.ai/pythondocs/cntk.learners.html?highlight=learning_rate_schedule#cntk.learners.momentum_schedule);

* For how to make user the of learners to train model, please refer to [Manual on how to train using declarative and imperative API](https://github.com/Microsoft/CNTK/blob/master/Manual/Manual_How_to_train_using_declarative_and_imperative_API.ipynb).

Prepare a simple 1-layer neural network classifiation model for demonstration purpose:

In [10]:
import cntk as C
x = C.input_variable(shape=(4,4))
label = C.input_variable(2)
model = C.layers.Dense(2, activation=C.sigmoid)(x)

## Simple hyper-parameter set-up



### Passing learning rates, momentm and other hyper-parameters directly as arguments to the learners

* The simplest way to set up a learner with specified learning rate is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<int or float>, minibatch_size=<None or int>, ...other parameters)
```

* The simplest way to set up a learner with specified learning rate and momentum is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<int or float>, momentum=<int or float>, minibatch_size=<None or int>, ...other parameters)
```

* The simplest way to set up a learner with specified learning rate, momentum and var_momentum (i.e. squared gradient momentum) in Adam and CNTK specific FSAdaGrad is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<int or float>, momentum=<int or float>, variance_momentum=<int or float>, minibatch_size=<None or int>, ...other parameters)
```

In the above, CNTK learner requires a *minibatch_size* parameter. This is the case due to CNTK's capability of automatically  adapting a learning hyper-parameter tuned for a specific minibatch size to variable minibatch sizes. To deal with variable length sequences in texts and other other application, CNTK allows variable minibatche sizes. Variable minibatch size also simplifies the design and implementation for efficient distributed training and data randomization. However, it is well-known that for different minibatch sizes, the learning rates and other hyper parameters need to be adjusted accordingly. This is why CNTK learners need such a design-time minibatch size to understand how to scale the hyper-parameters automatically.

Please see below a list of concrete examples of all supported learners using this simple set-up approach:

In [11]:
mysgd = C.sgd(parameters=model.parameters, lr=0.4, minibatch_size=32)

mymomentum = C.momentum_sgd(parameters=model.parameters, lr=0.4, momentum=0.9, minibatch_size=32)

myadadelta = C.adadelta(parameters=model.parameters, lr=0.4, minibatch_size=32)

myadam = C.adam(parameters=model.parameters, lr=0.4, momentum=0.9, variance_momentum=0.9, minibatch_size=32)

myadagrad = C.adagrad(parameters=model.parameters, lr=0.4, minibatch_size=32)

myfsadagrad = C.fsadagrad(parameters=model.parameters, lr=0.4, momentum=0.9, variance_momentum=0.9, minibatch_size=32)

mynesterov = C.nesterov(parameters=model.parameters, lr=0.4, momentum=0.9, minibatch_size=32)

myrmsrop = C.rmsprop(parameters=model.parameters, lr=0.4, gamma=0.5, inc=1.2, dec=0.7, max=10, min=1e-8, minibatch_size=32)

### Passing a schedule of learning rates, momentm and other hyper-parameters directly as arguments to the learners

* The simplest way to set up a learner with specified learning rate is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<list of int or float>, minibatch_size=<None or int>, epoch_size=<None or int>, ...other parameters)
```

* The simplest way to set up a learner with specified learning rate and momentum is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<list of int or float>, momentum=<int or float>, minibatch_size=<None or int>, epoch_size=<None or int>,  ...other parameters)
```

* The simplest way to set up a learner with specified learning rate, momentum and var_momentum (i.e. squared gradient momentum) in Adam and CNTK specific FSAdaGrad is as follows: 

```python
    C.cntk_learner(parameters=model.parametes, lr=<list int or float>, momentum=<int or float>, variance_momentum=<int or float>, minibatch_size=<None or int>, epoch_size=<None or int>, ...other parameters)
```

For how a list of hyper-parameter (e.g. learning rates) and the epoch size spepcification create a hyper-parameter schedule, please see [learning_rate_schedule](https://cntk.ai/pythondocs/cntk.learners.html?highlight=learning_rate_schedule#cntk.learners.learning_rate_schedule) for details. 

A list of concrete examples of all supported learners using this simple set-up approach are as follows:

In [12]:
mysgd = C.sgd(parameters=model.parameters, lr=[0.4, 0.1, 0.001], minibatch_size=32, epoch_size=512)

mymomentum = C.momentum_sgd(parameters=model.parameters, lr=[0.4, 0.1, 0.001], momentum=[0.9], 
                            minibatch_size=32, epoch_size=512)

myadadelta = C.adadelta(parameters=model.parameters, lr=[0.4, 0.1, 0.001], 
                        minibatch_size=32, epoch_size=512)

myadam = C.adam(parameters=model.parameters, lr=[0.4], momentum=[0.9, 0.1, 0.001], variance_momentum=[0.9], 
                minibatch_size=32, epoch_size=512)

myadagrad = C.adagrad(parameters=model.parameters, lr=[0.4, 0.1, 0.001], minibatch_size=32, epoch_size=512)

myfsadagrad = C.fsadagrad(parameters=model.parameters, lr=[0.4, 0.1, 0.001], momentum=[0.9], variance_momentum=[0.9], 
                          minibatch_size=32, epoch_size=512)

mynesterov = C.nesterov(parameters=model.parameters, lr=[0.4, 0.1, 0.001], momentum=[0.9], 
                        minibatch_size=32, epoch_size=512)

myrmsrop = C.rmsprop(parameters=model.parameters, lr=[0.4, 0.1, 0.001], gamma=0.5, inc=1.2, dec=0.7, max=10, min=1e-8,
                     minibatch_size=32, epoch_size=512)

## Handling the complicated scenario combinations --- that different hyper-parameters are designed for different minibatch sizes and having different schedules based on different epoch sizes

Use C.adam as an example, we can set up the learner with hyper-parameters each of which are designed for difference minibatches:

In [13]:
lr = C.learning_parameter_schedule([0.4, 0.1, 0.001], minibatch_size = 8, epoch_size = 128)
momentum = C.learning_parameter_schedule([0.92, 0.91, 0.9], minibatch_size = 32, epoch_size = 64)
var_momentum = C.learning_parameter_schedule([0.99, 0.96, 9.96], minibatch_size = 64, epoch_size = 256)
myadam = C.adam(parameters=model.parameters, lr=lr, momentum=momentum, variance_momentum=var_momentum)

## Applying hyper-parametes as they are over all minibatches (possibly with different minibatch sizes)

We can set up the hyper-parameters to be applied to any minibatch as they are without any scaling. Please note that this is assuming that all minibatch sizes are roughly of the same size. This can be done by simply set minibatch_size to be 0 as follows taking Adam as an example: 

In [14]:
myadam = C.adam(parameters=model.parameters, lr=[0.4], momentum=[0.9, 0.1, 0.001], variance_momentum=[0.9], 
                minibatch_size=0, epoch_size=512)

## The complete story

Finally let us wrap up this manual with a very common use case which sets the same minibatch size for both minibatch source and the learner: 
```python
import cntk as C
x = C.input_variable(shape=(4,4))
label = C.input_variable(2)
model = C.layers.Dense(2, activation=C.sigmoid)(x)
loss = C.squared_error(model, label)
err = C.classification_error(model, label)

minibatch_size = 32
mb_source = C.MinibatchSource(C.CTFDeserializer(input_file, streams))

myadam = C.adam(parameters=model.parameters, lr=[0.4], momentum=[0.9, 0.1, 0.001], variance_momentum=[0.9], 
                minibatch_size=minibatch_size, epoch_size=512)
trainer = C.train.Trainer(model, (loss, err), learner, ProgressPrinter(freq=10))
                
while train and trainer.total_number_of_samples_seen < max_samples:
    data = mb_source.next_minibatch(minibatch_size, input_map)
    train = trainer.train_minibatch(data)
```