# Cyclical Learning Rates for Training Neural Networks

Link to the [article](https://arxiv.org/pdf/1506.01186.pdf)

## 1. Introduction

The method "Cyclical learning" means using a constantly varying between two values learning rate instead of a monotonically decreasing one. This method claims there is a significant increase in model performance compared to its alternative in metrics like accuracy. Additionally, CLRs (Cyclic Learning Rates) take essentially no additionall computation.

The benefits of this method are demonstrated on the CIFAR-10, CIFAR-100 and ImageNet datasets with multiple neural networks.

An example:
Classification accuracy on the CIFAR-10 with the same model architecture:
- Traditional learning rate: 81,4%, 70000 itearations
- CLR: 81,4%, 25000 itearations

## 2. Related work

Adaptive learning rates can be considered competitors to CLRs in terms of performance but they have a much higher computational cost. However, while adaptive learning policies are very different, adaptive learning can be combined with CLR.

## 3. Optimal Learning Rates

### 3.1. Cyclical Learning Rates

The premise is that increasing learning rate might have a negetive effect temporarily but a long-term beneficial effect, becuase of being able to easily traverse saddle points. So we are going to let the learning rate vary between two set boundries -`max_lr` and `base_lr`

There are multiple ways the learning rate can vary between two values:
- Triangular - the lr varies linearly between the boundries
- Welch window - the lr varies parabolically between the boundries
- Haan window - the lr varies sinusoidally between the boundries

There are also two policies that include dynamic change of the boundries:
- Triangular2 - the lr varies linearly between the boundries, but the learning rate difference is halved at the end of each cycle
- Exp_range - the lr varies between boundries, which decline by an exponential factor $γ^{iteration}$

### 3.2. How to estimate a good value for the cycle length

Setting the stepsize to 2-10 times the number of iterations in an epoch is optimal. It is shown that replacing a constant learning rate step with 3-4 cycles yields significantly better results. It is recommended you stop training when the learning rate is at the lower boundry

### 3.3. How to estimate reasonable minimum and maximum boundry values

The "LR range test":
- set some boundry values
- run your model for several epochs while letting the lr cycle
- plot the accuracy (or any other metric) on the Y axis and the learning rate on the X axis
- locate the points where the model starts to converge and where it starts to diverge
- set `base_lr` and `max_lr` to these points or , alternatively, `base_lr` = $\frac{1}{3}$ first point and `max_lr` = $\frac{1}{4}$ second point

## 4. Experiments

This technique was tested on several datasets:

### 4.1. CIFAR-10 and CIFAR-100 Datasets

- Caffe CIFAR-10, Triangular2 policy - reduced number of iterations to reach optimal accuracy from 70000 to 25000
- Further testing the Exp_range policy - outperformed the standard exponential decay in accuracy and efficiency
- Applying CLR to ResNets, Stochastic Depth networks and DenseNets improved accuracy compared to their standard learning rate technique counterparts

### 4.2. ImageNet Dataset

- AlexNet showed improvements in accuracy with fewer iterations needed
- Improved performance of GoogLeNet/Inception

These experiments show that CLR is definitelly affective in image-related tasks, having exceeded traditional performance metrics with reduced training time.

## 5. Conclusion

The results from the experiments confirm the benefits of CLRs - they improve performance, do not require additional computational power, easy to find the right boundries.