Efficient On-device Training via Gradient Filtering

Yuedong Yang, Guihong Li, Radu Marculescu

This is the official repo for the paper Efficient On-device Training via Gradient Filtering accepted in CVPR 2023.

Abstract

Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device CNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple CNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19 $\times$ speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20 $\times$ speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.

Features

Reduce Computation and Memory Complexity for Backpropagation via Gradient Filter

Because of the high computation and memory complexity, backpropagation (BP) is the key bottleneck for CNN training. Our method reduces the complexity by introducing the gradient filter (highlighted in red in the bottom figure). The gradient filter approximates the gradient map with one consisting fewer unique elements and special structures. By doing so, operations in BP for a convolution layer can be greatly simplified, thus saving computation and memory.

Over 10 $\times$ Speedup with Marginal Accuracy Loss

Our method achieves significant speedup on both edge devices (Raspberry Pi 3 and NVIDIA Jetson Nano) and desktop devices with marginal accuracy loss.

Environment Setup

Create and activate conda virtual environment

conda create -n gradfilt python=3.8
conda activate gradfilt

Install PyTorch 1.13.1

Here we consider a system with x86_64 CPU, Nvidia GPU with CUDA 11.7, Ubuntu 20.04 OS. For systems with different configurations, please refer to pytorch's official installation guide.
```
conda install pytorch==1.13.1 torchvision==0.14.1 pytorch-cuda=11.7 -c pytorch -c nvidia
```

Install dependencies for the classification task

pip install "jsonargparse[signatures]" pytorch_lightning==1.6.5 torchmetrics==0.9.2 pretrainedmodels
git clone https://github.com/mit-han-lab/mcunet.git
cd mcunet
git checkout be404ea0dbb7402783e1c825425ac257ed35c5fc
python setup.py install
cd ..

Install dependencies for semantic segmentation
```
pip install openmim
mim install mmcv-full==1.6.1
cd segmentation/mmsegmentation
pip install -e .
```
The installation process for MMCV can be very slow if there is no pre-compiled mmcv-full package. In such case, please to MMCV installation guide and build MMCV from source.
Install dependencies for latency test

Latency test depends on OneDNN (a.k.a. MKLDNN) v2.6 and CUDNN v8. OneDNN can be installed via conda:
```
conda install -c conda-forge onednn==2.6
```
For CUDNN installation, please refer to CUDNN website

Build latency test

cd latency
mkdir build
cd build
cmake ..
make

Extract pretrained / calibtrated model checkpoints and reference experiment logs

Download link:
Extract checkpoints:
```
cd classification
tar xvf <path to cls_pretrained_ckpts.tar.gz>
cd ..
cd segmentation
tar xvf <path to seg_calib_ckpt.tar.gz>
cd ..
```
Extract reference experiment logs anywhere you like.
Setup datasets

Classification:
- CIFAR10/100 will be downloaded automatically under classification/data/cifar[10|100]
- Download ImageNet and place the train, val folders under classification/data/imagenet
Semantic segmentation: Please refer to Prepare Datasets for Cityscapes and Pascal VOC12 Aug.

Experiments for Accuracy Evaluation

We provide example scripts for launching and reproducing our experimental results under classification/scripts and segmentation/scripts.

For example, to train MobileNet-V2 on CIFAR100 dataset with our gradient filter with a patch size $2\times 2$, run:

```
cd classification
bash scripts/mbv2/mbv2_cifar100_r2.sh
```

results are stored under classification/runs

Our experimental framework is config-based, so you can try different experimental setups by simply changing config files under classification/configs and segmentation/configs.

We provide the configs and launching scripts for:

[Classification] MobileNet-V2: configs, scripts
[Classification] MCUNet: configs, scripts
[Classification] ResNet-18: configs, scripts
[Classification] ResNet-34: configs, scripts
[Segmentation] DeepLabV3-ResNet18: configs, scripts
[Segmentation] DeepLabV3-MobileNetV2: configs
[Segmentation] FCN-ResNet18: configs
[Segmentation] PSPNet-ResNet18: configs
[Segmentation] PSPNet-MobileNetV2: configs
[Segmentation] UPerNet-ResNet18: configs

Launching scripts for other segmentation models can be adapted from DeepLabV3-ResNet18 by replacing the config file and checkpoints.

Experiments for Latency Evaluation

We provide an example implementation of our gradient filter with OneDNN and CUDNN under latency. To run the latency test, simply launch run_test.sh under latency folder. Results are saved in cpu.csv and gpu.csv.

Cite

@inproceedings{yang2023efficient,
  title={Efficient On-device Training via Gradient Filtering},
  author={Yang, Yuedong and Li, Guihong and Marculescu, Radu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3811--3820},
  year={2023}
}

Acknowledgement

Codes for classification backbones are adopted from segmentation models. Experiments for semantic segmentation are developed based on MMSegmentation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
classification		classification
latency		latency
segmentation		segmentation
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

classification

classification

latency

latency

segmentation

segmentation

LICENSE

LICENSE

Readme.md

Readme.md

Repository files navigation

Efficient On-device Training via Gradient Filtering

Features

Reduce Computation and Memory Complexity for Backpropagation via Gradient Filter

Over 10 $\times$ Speedup with Marginal Accuracy Loss

Environment Setup

Experiments for Accuracy Evaluation

Experiments for Latency Evaluation

Cite

Acknowledgement

About

Releases 1

Packages

Contributors 2

Languages

License

SLDGroup/GradientFilter-CVPR23

Folders and files

Latest commit

History

Repository files navigation

Efficient On-device Training via Gradient Filtering

Features

Reduce Computation and Memory Complexity for Backpropagation via Gradient Filter

Over 10 $\times$ Speedup with Marginal Accuracy Loss

Environment Setup

Experiments for Accuracy Evaluation

Experiments for Latency Evaluation

Cite

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages