Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173

Open
alanchiao opened this issue Dec 6, 2019 · 29 comments
Open
Assignees
Labels
feature request feature request technique:pruning Regarding tfmot.sparsity.keras APIs and docs

Comments

@alanchiao
Copy link

As suggested here, model pruning currently only provides benefits in model compression/size reduction. Further framework support is necessary to provide latency improvements in TF/TFLite.

@sujoyrc
Copy link

sujoyrc commented Mar 30, 2020

When do you think this will be included in tensorflow / TFLite release ? Is there a targeted timeline ? We are planning to do an internal development if this is not expected within this year (2020) based on this.

@raziel
Copy link

raziel commented Mar 30, 2020

Hi.
We're expecting a Q2/Q3 release date, though full TFLite kernel support will be an ongoing process after that (i.e. not all TFLite kernels will have sparse execution support).

Also, we're hoping the current working-from-home situation won't affect things further.

Thanks

@sujoyrc
Copy link

sujoyrc commented Mar 31, 2020

Thank you

@raziel raziel closed this as completed Apr 3, 2020
@sujoyrc
Copy link

sujoyrc commented Apr 3, 2020

Why is this closed ? will this be integrated in next version ?

@alanchiao
Copy link
Author

Reopened. Will not be integrated necessarily in next release.

@alanchiao alanchiao reopened this Apr 7, 2020
@shariq-audiofocus
Copy link

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

@paulaksm
Copy link

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

Same here! Currently my *tflite model and its sparse counterpart have the same storage requirements.

If TFLite could detect the zeros and change their type to uint8, this would make a huge difference on model size (MBs).

@gordinmitya
Copy link

@paulaksm @shariq-audiofocus
have you tried structural pruning instead?
if you worry about storage only why not to consider gzip compressing for the model file? (like cryptopp gzip)

@shariq-audiofocus
Copy link

@gordinmitya Thanks, I hadn't heard of structural pruning, seems like that could lead to smaller tflite binaries if it eliminates entire filters. Is structural pruning on the model-optimization roadmap?

Re: storage - I'm not worried about offline storage. I'm worried about latency & power usage during inference on tiny edge devices (probably MCUs). ARM is developing processors [1] that can do online decompression of weights on-the-fly during inference. This is interesting because now you can fit larger models in memory by utilizing their compression technique. If the model fits in memory (SRAM) you get lower latency & power usage. I'm wondering if the model-optimization & TFLite team are thinking about this or if it's outside their scope.

[1] https://www.theregister.com/2020/02/10/arm_cortex_m_ai_accelerator/ - "To fit this all into a small memory and silicon footprint, the microNPU can decompress trained INT8 models on the fly for inference."

@willbattel
Copy link
Contributor

Structural pruning is really important to my team, too. The current zero-weight pruning for compression is nice but we're far more interested in reduced file sizes to be able to fit models into SRAM instead of DRAM.

I'm hopeful that this library will eventually support structural pruning- but so far I haven't seen any mention of it.

@edumotya
Copy link

Any updates on this? Can we expect latency improvements for our pruned models?

@pedroska777
Copy link

Can you estimate its release date for inference time optimization?

@liyunlu0618
Copy link
Contributor

Sorry for keeping you waiting. We're actively working on making the initial release of sparse inference support in TFLite. It's hard to give an exact date but hopefully before Q3 ends. Thanks for your patience!

@liyunlu0618
Copy link
Contributor

A spoiler:
https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/sparsity/keras/mnist/mnist_e2e.py

Please note that we're still finalizing the API. The workflow in the released version may look different.

@ghost
Copy link

ghost commented Aug 28, 2020

@liyunlu0618 looking at your approach right now and training to implement that. does this latency improved inference also work for Conv and not only Dense filters (how would one do it for Conv filters)? Also why is the block [4,1] exactly. How does that ensuring inference time improvements? Thanks!

@liyunlu0618
Copy link
Contributor

For the Conv op we only support these hosted models at the moment:
https://github.com/google-research/google-research/tree/master/fastconvnets

We need the block config to use SIMD instructions on Arm neon architecture. Feel free to check out the kernel here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/neon_tensor_utils.cc#L1962-L1990

@js14083
Copy link

js14083 commented Dec 29, 2020

Hi, are there updates on this?

@alanchiao alanchiao pinned this issue Jan 30, 2021
@dathudeptrai
Copy link

@alanchiao any update process?

@liyunlu0618
Copy link
Contributor

liyunlu0618 commented Feb 24, 2021

This is currently available as an experimental feature in TFLite.

For sparse CNNs, it needs to run with the XNNPack delegate. Please refer to this.

For sparse RNNs and transformers, TFLite has built-in support. This has a few examples.

We'll have formal blogposts/docs soon. In the meanwhile if you could provide more details on your use case, I can suggest how to apply this optimization accordingly. Key points that are helpful:

  1. Model type and key operators
  2. Hardware backend you're targeting
  3. Whether to combine with quantization
  4. Target performance/accuracy numbers

@dathudeptrai
Copy link

@liyunlu0618 thanks for your information, I will play around it a bit :D. Do you know when the documentation will be finished ?

@aa12356jm
Copy link

mark

@eejlny
Copy link

eejlny commented Mar 1, 2021

Hello,

I was wondering if there is the intention of adding structural prunning support for conv layers (in addition to dense layers) ? Is this something possible to do or some fundamental issue prohibits it ? Thanks

@shariq-audiofocus
Copy link

@liyunlu0618 - My use case:

  1. Online, Streaming, Speech-Enhancement-like Task. Input Audio -> Dense -> LSTM -> Dense -> Output Audio. During training the Dense layers are actually CONV layers but I don't think that matters. Current model is ~8MB after int8 quantization, would like < ~4MB with sparsity/pruning features.
  2. Now: processor on an iPhone 11, or possibly edge TPU (Coral Dev Board). Later (2022): Syntiant's NDP120 or NDP500 chip [1].
  3. Yes need quantization + compression via pruning.
  4. Last time I checked quantization had minimal or no effect, 8dB -> 7.9dB. Hoping for similar results with 50% sparsity/structured pruning compression.

[1] https://www.syntiant.com/ndp120

@willbattel
Copy link
Contributor

Any chance we will get support for pruned CNNs on other TFLite delegates? We rely on the NNAPI and CoreML delegates for quick and efficient inference on Android and iOS, respectively, but so far it looks like XNNPack is the only supported delegate.

@STAROFWIND
Copy link

I have the same issue here. After pruning, I got the same size model and the same inference time. even I convert to tflite but it can run on CPU so, the inference time is still not good. XNNPack doesn't not support my network. So, could you tell me what can I do next for improve the inference time with my pruned model ? Thank you so much !

@zoythum
Copy link

zoythum commented Nov 16, 2022

Is there any update on this topic? What's the correct way to improve the inference time of a model with pruning?

@sampathrajapaksha
Copy link

It seems still there's no proper solution to improve the inference time on pruned model

@shariq-audiofocus
Copy link

@sampathrajapaksha - We've found the best approach is to do knowledge distillation (KD) to shrink your model and therefore improve inference time. This paper has some good ideas: https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E and shows you can do it with minimal performance degradation. We're still experimenting but these seems to be a better path forward rather than relying on pruning optimizations

@sampathrajapaksha
Copy link

@shariq-audiofocus Thank you very much for sharing this with me. My use case is quite similar to yours. I'll read this and see how I can apply this to reduce inference time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature request technique:pruning Regarding tfmot.sparsity.keras APIs and docs
Projects
None yet
Development

No branches or pull requests