Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

alanchiao · 2019-12-06T16:39:26Z

As suggested here, model pruning currently only provides benefits in model compression/size reduction. Further framework support is necessary to provide latency improvements in TF/TFLite.

sujoyrc · 2020-03-30T19:35:00Z

When do you think this will be included in tensorflow / TFLite release ? Is there a targeted timeline ? We are planning to do an internal development if this is not expected within this year (2020) based on this.

raziel · 2020-03-30T20:02:24Z

Hi.
We're expecting a Q2/Q3 release date, though full TFLite kernel support will be an ongoing process after that (i.e. not all TFLite kernels will have sparse execution support).

Also, we're hoping the current working-from-home situation won't affect things further.

Thanks

sujoyrc · 2020-03-31T11:33:38Z

Thank you

sujoyrc · 2020-04-03T19:11:29Z

Why is this closed ? will this be integrated in next version ?

alanchiao · 2020-04-07T18:10:12Z

Reopened. Will not be integrated necessarily in next release.

shariq-audiofocus · 2020-04-23T04:33:01Z

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

paulaksm · 2020-04-27T17:31:24Z

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

Same here! Currently my *tflite model and its sparse counterpart have the same storage requirements.

If TFLite could detect the zeros and change their type to uint8, this would make a huge difference on model size (MBs).

gordinmitya · 2020-06-20T06:01:41Z

@paulaksm @shariq-audiofocus
have you tried structural pruning instead?
if you worry about storage only why not to consider gzip compressing for the model file? (like cryptopp gzip)

shariq-audiofocus · 2020-06-22T17:25:20Z

@gordinmitya Thanks, I hadn't heard of structural pruning, seems like that could lead to smaller tflite binaries if it eliminates entire filters. Is structural pruning on the model-optimization roadmap?

Re: storage - I'm not worried about offline storage. I'm worried about latency & power usage during inference on tiny edge devices (probably MCUs). ARM is developing processors [1] that can do online decompression of weights on-the-fly during inference. This is interesting because now you can fit larger models in memory by utilizing their compression technique. If the model fits in memory (SRAM) you get lower latency & power usage. I'm wondering if the model-optimization & TFLite team are thinking about this or if it's outside their scope.

[1] https://www.theregister.com/2020/02/10/arm_cortex_m_ai_accelerator/ - "To fit this all into a small memory and silicon footprint, the microNPU can decompress trained INT8 models on the fly for inference."

willbattel · 2020-06-23T00:02:42Z

Structural pruning is really important to my team, too. The current zero-weight pruning for compression is nice but we're far more interested in reduced file sizes to be able to fit models into SRAM instead of DRAM.

I'm hopeful that this library will eventually support structural pruning- but so far I haven't seen any mention of it.

edumotya · 2020-08-26T08:31:58Z

Any updates on this? Can we expect latency improvements for our pruned models?

pedroska777 · 2020-08-27T22:54:44Z

Can you estimate its release date for inference time optimization?

liyunlu0618 · 2020-08-27T22:59:08Z

Sorry for keeping you waiting. We're actively working on making the initial release of sparse inference support in TFLite. It's hard to give an exact date but hopefully before Q3 ends. Thanks for your patience!

liyunlu0618 · 2020-08-27T23:02:40Z

A spoiler:
https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/sparsity/keras/mnist/mnist_e2e.py

Please note that we're still finalizing the API. The workflow in the released version may look different.

ghost · 2020-08-28T16:38:33Z

@liyunlu0618 looking at your approach right now and training to implement that. does this latency improved inference also work for Conv and not only Dense filters (how would one do it for Conv filters)? Also why is the block [4,1] exactly. How does that ensuring inference time improvements? Thanks!

liyunlu0618 · 2020-08-28T19:14:56Z

For the Conv op we only support these hosted models at the moment:
https://github.com/google-research/google-research/tree/master/fastconvnets

We need the block config to use SIMD instructions on Arm neon architecture. Feel free to check out the kernel here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/neon_tensor_utils.cc#L1962-L1990

js14083 · 2020-12-29T12:17:20Z

Hi, are there updates on this?

dathudeptrai · 2021-02-24T03:19:05Z

@alanchiao any update process?

liyunlu0618 · 2021-02-24T18:54:15Z

This is currently available as an experimental feature in TFLite.

For sparse CNNs, it needs to run with the XNNPack delegate. Please refer to this.

For sparse RNNs and transformers, TFLite has built-in support. This has a few examples.

We'll have formal blogposts/docs soon. In the meanwhile if you could provide more details on your use case, I can suggest how to apply this optimization accordingly. Key points that are helpful:

Model type and key operators
Hardware backend you're targeting
Whether to combine with quantization
Target performance/accuracy numbers

dathudeptrai · 2021-02-25T06:49:32Z

@liyunlu0618 thanks for your information, I will play around it a bit :D. Do you know when the documentation will be finished ?

aa12356jm · 2021-03-01T03:12:53Z

mark

eejlny · 2021-03-01T14:22:50Z

Hello,

I was wondering if there is the intention of adding structural prunning support for conv layers (in addition to dense layers) ? Is this something possible to do or some fundamental issue prohibits it ? Thanks

shariq-audiofocus · 2021-03-29T17:22:15Z

@liyunlu0618 - My use case:

Online, Streaming, Speech-Enhancement-like Task. Input Audio -> Dense -> LSTM -> Dense -> Output Audio. During training the Dense layers are actually CONV layers but I don't think that matters. Current model is ~8MB after int8 quantization, would like < ~4MB with sparsity/pruning features.
Now: processor on an iPhone 11, or possibly edge TPU (Coral Dev Board). Later (2022): Syntiant's NDP120 or NDP500 chip [1].
Yes need quantization + compression via pruning.
Last time I checked quantization had minimal or no effect, 8dB -> 7.9dB. Hoping for similar results with 50% sparsity/structured pruning compression.

[1] https://www.syntiant.com/ndp120

willbattel · 2021-06-24T20:47:11Z

Any chance we will get support for pruned CNNs on other TFLite delegates? We rely on the NNAPI and CoreML delegates for quick and efficient inference on Android and iOS, respectively, but so far it looks like XNNPack is the only supported delegate.

STAROFWIND · 2022-03-30T05:05:10Z

I have the same issue here. After pruning, I got the same size model and the same inference time. even I convert to tflite but it can run on CPU so, the inference time is still not good. XNNPack doesn't not support my network. So, could you tell me what can I do next for improve the inference time with my pruned model ? Thank you so much !

zoythum · 2022-11-16T15:34:56Z

Is there any update on this topic? What's the correct way to improve the inference time of a model with pruning?

sampathrajapaksha · 2023-03-06T08:49:18Z

It seems still there's no proper solution to improve the inference time on pruned model

shariq-audiofocus · 2023-03-06T19:06:00Z

@sampathrajapaksha - We've found the best approach is to do knowledge distillation (KD) to shrink your model and therefore improve inference time. This paper has some good ideas: https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E and shows you can do it with minimal performance degradation. We're still experimenting but these seems to be a better path forward rather than relying on pruning optimizations

sampathrajapaksha · 2023-03-06T20:10:11Z

@shariq-audiofocus Thank you very much for sharing this with me. My use case is quite similar to yours. I'll read this and see how I can apply this to reduce inference time

alanchiao assigned liyunlu0618 Dec 6, 2019

alanchiao added the feature request feature request label Dec 6, 2019

This was referenced Dec 6, 2019

After pruning the model, is model type always NoneType? #164

Closed

Did not support <class 'tensorflow.python.keras.layers.normalization_v2.BatchNormalization'>? #165

Closed

alanchiao mentioned this issue Jan 2, 2020

Object detection API #132

Open

alanchiao mentioned this issue Feb 2, 2020

Does TF-MOT quantization will keep use tfLiteConverter? #236

Closed

alanchiao added the technique:pruning Regarding tfmot.sparsity.keras APIs and docs label Feb 6, 2020

alanchiao mentioned this issue Mar 30, 2020

TFLite: Pruning & Distillation Features tensorflow/tensorflow#29360

Closed

raziel closed this as completed Apr 3, 2020

alanchiao reopened this Apr 7, 2020

alanchiao mentioned this issue May 15, 2020

Inference time remains unchanged #391

Closed

yummychop mentioned this issue Dec 7, 2020

Model optimization using quantization on desktop PC? tensorflow/tensorflow#45413

Closed

alanchiao pinned this issue Jan 30, 2021

Rikorose mentioned this issue May 11, 2021

Support for Sparse/Pruned Models sonos/tract#493

Open

vvolhejn mentioned this issue Jun 22, 2022

Try block quantization for TFLite vvolhejn/thesis#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

alanchiao commented Dec 6, 2019

sujoyrc commented Mar 30, 2020

raziel commented Mar 30, 2020

sujoyrc commented Mar 31, 2020

sujoyrc commented Apr 3, 2020

alanchiao commented Apr 7, 2020

shariq-audiofocus commented Apr 23, 2020

paulaksm commented Apr 27, 2020

gordinmitya commented Jun 20, 2020

shariq-audiofocus commented Jun 22, 2020

willbattel commented Jun 23, 2020

edumotya commented Aug 26, 2020

pedroska777 commented Aug 27, 2020

liyunlu0618 commented Aug 27, 2020

liyunlu0618 commented Aug 27, 2020

ghost commented Aug 28, 2020 •

edited by ghost

Loading

liyunlu0618 commented Aug 28, 2020

js14083 commented Dec 29, 2020

dathudeptrai commented Feb 24, 2021

liyunlu0618 commented Feb 24, 2021 •

edited

Loading

dathudeptrai commented Feb 25, 2021

aa12356jm commented Mar 1, 2021

eejlny commented Mar 1, 2021

shariq-audiofocus commented Mar 29, 2021

willbattel commented Jun 24, 2021

STAROFWIND commented Mar 30, 2022

zoythum commented Nov 16, 2022

sampathrajapaksha commented Mar 6, 2023

shariq-audiofocus commented Mar 6, 2023

sampathrajapaksha commented Mar 6, 2023

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

Comments

alanchiao commented Dec 6, 2019

sujoyrc commented Mar 30, 2020

raziel commented Mar 30, 2020

sujoyrc commented Mar 31, 2020

sujoyrc commented Apr 3, 2020

alanchiao commented Apr 7, 2020

shariq-audiofocus commented Apr 23, 2020

paulaksm commented Apr 27, 2020

gordinmitya commented Jun 20, 2020

shariq-audiofocus commented Jun 22, 2020

willbattel commented Jun 23, 2020

edumotya commented Aug 26, 2020

pedroska777 commented Aug 27, 2020

liyunlu0618 commented Aug 27, 2020

liyunlu0618 commented Aug 27, 2020

ghost commented Aug 28, 2020 • edited by ghost Loading

liyunlu0618 commented Aug 28, 2020

js14083 commented Dec 29, 2020

dathudeptrai commented Feb 24, 2021

liyunlu0618 commented Feb 24, 2021 • edited Loading

dathudeptrai commented Feb 25, 2021

aa12356jm commented Mar 1, 2021

eejlny commented Mar 1, 2021

shariq-audiofocus commented Mar 29, 2021

willbattel commented Jun 24, 2021

STAROFWIND commented Mar 30, 2022

zoythum commented Nov 16, 2022

sampathrajapaksha commented Mar 6, 2023

shariq-audiofocus commented Mar 6, 2023

sampathrajapaksha commented Mar 6, 2023

ghost commented Aug 28, 2020 •

edited by ghost

Loading

liyunlu0618 commented Feb 24, 2021 •

edited

Loading