-
Notifications
You must be signed in to change notification settings - Fork 332
Add Anti Zero-Drift functionality for Sparsity-Aware clustering #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Anti Zero-Drift functionality for Sparsity-Aware clustering #520
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
1 similar comment
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
@alanchiao, could you please check the CLA bot again? Matteo has contributed to TF before so the check should be passing. |
Regarding the CLA, it seemed like there was an issue on our side. Now sorted |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
It's clear that this PR does enable us to preserve sparsity, which is nice. Related to #513: As an example, for pruning + clustered for Mobilenet V2 with sparsity-awareness, we see a 7.44% drop to get to 2.32 Mb or a -6.45% to get to 2.71 MB. However, with just clustering, there are results to ~3% accuracy drop to get to 2.98 MB and <1% drop to get to 3.13MB. It may be interesting to pick an approximate target model size based on what would enable you to fit in a smaller sized memory and compare all three. Until there is evidence, I suggest creating an experimental API (e.g. experimental.cluster_weights) which we can encourage people to run experiments through and try to create better compressed models given the same accuracy. I imagine the results don't factor in Ruomei's PR also with better gradients. |
To elaborate on the discussion in the call: For the approach of creating something under clustering.experimental, it should just require a couple of lines of changes on top of what you have currently. The starting point is the nice property that the path of the public API is independent of where the code itself lives. This definition of the tfmot.clustering.keras is here. For the hypothetical change of moving the existing With the actual code outside of core/api, we can add a thin layer by renaming the current cluster_weights fn to _cluster_weights with all of the flags, and then make the experimental.cluster_weights and cluster_weights just call into that. It's probably better to have experimental.cluster_weights live under python/core/clustering/keras/experimental also, but you do have flexibility. This approach is very safe and equally maintainable imo. It could be that just adding something like experimental_preserve_sparsity=False to the current stable Independently, we can publicize this in a page under tensorflow.org/model_optimization focused on combining techniques, with a link to a Github issue where we encourage people to try it out / share results and model files and see what the response is. |
Thanks Alan. Agree. We will enable this feature via an experimental API and reach out to the community to have a go at combining two optimization techniques. In the meantime, we will also continue such experiments ourselves. |
996c0a9
to
5aa8748
Compare
I've updated the changes to make the new feature experimental as per suggestions. |
def cluster_weights(to_cluster, | ||
number_of_clusters, | ||
cluster_centroids_init, | ||
preserve_sparsity=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, since perserve_sparsity
is available through experimental.cluster, here preserve_sparsity
should be fixed inside the function (not available as function arg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a mistake! In fact we don't have it in our internal master. I've now removed the preserve_sparsity
from the method's signature.
5aa8748
to
3236ce4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just a minor thing to consider.
def testSparsityIsPreservedDuringTraining(self): | ||
"""Verifies that training a clustered model does not destroy the sparsity of the weights.""" | ||
original_model = keras.Sequential([ | ||
layers.Dense(5, input_shape=(5,)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed to have zeros in this model? It may be beneficial to make initial weights deterministic in this test, with the known number and location of zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest revision. I've set the random seed to a value that guarantees that some of the weights are zero for that test (now there are 5 null-weights at each test run).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
63cb72e
to
9936522
Compare
@nutsiepully, @alanchiao, can you please help progress this PR? |
* Implemented the zero-centroid initialization for all clustering methods * Implemented the sparsity masks for forward and backward propagation * Added preserve_sparsity class member to ClusterWeights to make sparsity preservation optional for all clustering methods * Refactored AbstractCentroidsInitialisation to include zero-centroid initialization for all init types * Added unit tests around the new changes
…rimental) * Created new experimental API for sparsity-aware clustering * Kept the original API implementation * Moved the new feature to a new experimental package, making the original implementation private * Updated the unit tests accordingly * Created init and BUILD files for the new experimental package
…rimental) * Fixed the signature of the public cluster_weights method
…rimental) * Set the random seed in the sparsity preservation test to a specific value to make sure that some of the weights are null
9936522
to
a12e20e
Compare
@nutsiepully, any comments on this change? |
@alanchiao, @nutsiepully, please help progress this change. We are happy with it on our side. The remaining bit is to align on how this API should be exposed. Following the comments above, we have exposed this API as experimental. If there are no further comments at the moment, it would be good to merge it and proceed as discussed. |
Looks good to me given exposure as experimental API (tfmot.clustering.keras.experimental.cluster_weights). The most important thing right now is to gauge community interest in this feature by explicitly soliciting for it in an issue and also seeing if any models produced become useful enough on the accuracy / size curve that they'd practically be used - if it isn't large enough, we can remove it given the experimental nature. If it is, we can finalize how it's exposed. In other threads, @nutsiepully had suggested exposing features like this not as a part of the individual clustering / quantization APIs, but rather as a module targeted towards different forms of combining techniques. Suppose we had this module (e.g. tfmot.cascade or tfmot.combine or tfmot.optimize), the main thing that's weird is that for certain use cases, they may end up having to use APIs from |
Thanks Alan.
Yes, it would be good to put all techniques that attempt to combine others in one place. However, I agree, that "mixed" API exposure seems weird, and this particular change demonstrates it well. We will come back to this question a bit later. It will be good to come up with a more elegant solution for it, if possible. |
Hi @alanchiao, There is an internal check that failed (the "feedback/copybara" one), it seems about the last commit I pushed, but I'm not sure what's the problem with it. Is it an internal Google procedure or is there anything I can do to help fixing it? Thanks |
Hi @nutsiepully, I'm forwarding you the request I made to Alan some days ago, as he has now left the team. |
Hi @daverim, Just a friendly nudge. These changes have been sitting here untouched for quite a while now, do you think they need any more work or can they be merged? To recap: the idea was to keep this feature aside, in an "experimental" sub-directory (as they are now), to the users to try it out. If the feedback is positive, we can then consider of adding it to the main API. Thanks and Regards |
The changes:
sparsity preservation optional for all clustering methods
initialization for all init types
Description: this PR introduces a feature called Sparsity-Aware Clustering
In the model optimization workflow, clustering typically follows pruning. Therefore, the clustering operation needs to ensure that the level of sparsity is not destroyed during the fine-tuning process.
To avoid sparsity degradation, we introduced two new operations:
The sparsity masks are initialized when the clustering wrapper is created (right after pruning) and kept constant during re-training.
The sparsity masks are used when the weights are updated during both forward and backward passes.
At first, we set one centroid to zero explicitly to preserve sparsity during clustering. The remaining centroids are then proportionally allocated into the positive and negative intervals using the selected initialization method (linear, density-based, etc.). We refer to this technique as the sparsity-aware centroid initialization.
The idea is that the zero-point centroid will be assigned to the weights that have been set to zero by pruning, while the sparsity masks will keep those zero weights constant throughout re-training.
Implementation details:
A new boolean parameter called "preserve_sparsity" has been added to the clustering API to enable/disable sparsity preservation. If sparsity preservation is desired and has to be enabled during clustering, simply setting it to 'True' will enable both the new zero-centroid initialization and the application of the new sparsity masks during forward and back-propagation.
When the cluster wrapper is applied to a layer, a new sanity check is performed: if sparsity preservation is enabled, the minimum allowed number of clusters has to be 2 (instead of 1) to allow for at least a non-zero centroid other than the newly reserved zero centroid.
If sparsity preservation is enabled, a different approach is followed to initialize the centroids: one centroid is always set to zero, and the remaining number of clusters is proportionally allocated among negative and positive values, depending on the initial weight distribution of the layer.
For example, if the selected number of clusters is 32, the chosen initialization strategy is 'linear', and say 40% of the weights are negative while the remaining are positive, the clusters will be initialized such that 12 centroids will be linearly distributed between the minimum weight value and zero (excluded), followed by zero-centroid, then followed the remaining 19 positive centroid will be linearly distributed between zero (excluded) and the maximum weight value.
After the centroids have been initialized and the weights have been clustered, the sparsity masks are generated based on the distribution of the null clustered weights.
The sparsity masks are then stored in the wrapper to be used later during the re-training process.
Experimental results
MobileNet v2: 50% pruning + 32 clusters (all but depthwise conv2d + linear centroid init + NO sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
Notes: Notice how the sparsity is destroyed by clustering
MobileNet v2: 50% pruning + 32 clusters (all but depthwise conv2d + linear centroid init + sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
Notes: Unlike the previous case, notice how the sparsity is preserved during clustering, while keeping the same amount of clusters
MobileNet v2: 50% pruning + 32 clusters (all but depthwise conv2d + kmeans++ centroid init (#443) + NO sparsity-aware clustering)
Accuracy and size results:
Sparsity results:
Notes: Notice how the sparsity is destroyed by clustering
MobileNet v2: 50% pruning + 32 clusters (all but depthwise conv2d + kmeans++ centroid init (#443) + sparsity-aware clustering)
Accuracy and size results:
Sparsity results:
Notes: Unlike the previous case, notice how the sparsity is preserved during clustering, while keeping the same amount of clusters
MobileNet v1: 50% pruning + 64 clusters (all but depthwise conv2d + linear centroid init + sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
MobileNet v1: 50% pruning + 64 clusters (all but depthwise conv2d + kmeans++ centroid init (#443) + sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
Notes: Notice how the sparsity is preserved during clustering
DS-CNN-L: 50% pruning + 32 clusters (linear centroid init + sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
DS-CNN-L: 50% pruning + 32 clusters (kmeans++ centroid init (#443) + sparsity-aware clustering)
Accuracy and size results:
Sparsity and clustering results:
Notes: Notice how the sparsity is preserved during clustering