Skip to content

Conversation

@jackd
Copy link
Contributor

@jackd jackd commented Jan 21, 2020

This PR is mostly focused on adding a 'sparse_matmul' implementation of feature_steered_convolution which is faster than the base version ('gather_sum') in certain circumstances.

Differences can be seen in the benchmarking script.

cd benchmarks
python graph_convolution_benchmark.py --backward --sort
# Wall time (ms): 1173.0027198791504
# Memory (Mb):  5350.4373207092285
python graph_convolution_benchmark.py --backward --sort --sparse
# Wall time (ms): 885.279655456543
# Memory (Mb):  4426.739643096924

Benefits are similar though reduced when using --jit option.

Note these benefits are not across-the-board, hence the choice of implementation is left to the user. For example, smaller networks are faster using gather_sum. Verify with --num_layers=2 or similar.

Other minor changes:

  • numerically stable softmax implementation (this is faster than tf.nn.softmax when using JIT)
  • documentation: feature_steered_convolution can take non-square neighborhood tensors. Updated documentation to reflect this.
  • layers: added get_config implementations and tf.keras.*.get calls in constructors, allowing Layer.from_config to work as intended.

@jackd
Copy link
Contributor Author

jackd commented Jan 21, 2020

@julienvalentin @amakadia this supercedes PR 101.

@jackd jackd requested a review from amakadia January 21, 2020 07:46
@amakadia
Copy link
Contributor

Thanks for making this change. I'm testing this now, and should be able to respond soon.

@amakadia
Copy link
Contributor

Hi Jack,
This change looks great. Can you share some benchmark results from your trials... I'm specifically interested in timings and memory allocations on a GPU. I just want to check against some of the numbers I'm seeing on my end.
Thanks!

@avneesh-sud
Copy link

Hi Jack,

Could you also test this change on the mesh-segmentation model shared in the mesh-segmentation-demo.ipynb colab? We want to verify consistency with the existing implementation as a baseline.

@jackd
Copy link
Contributor Author

jackd commented Feb 28, 2020

@avneesh-sud sorry about the delay on this - ECCV deadline is taking priority right now.

@avneesh-sud
Copy link

@avneesh-sud sorry about the delay on this - ECCV deadline is taking priority right now.

@jackd - hope you are safe and recovered from ECCV deadline. Checking in if you have bandwidth to benchmark the changes and test with the segmentation model now?

@jackd
Copy link
Contributor Author

jackd commented Apr 19, 2020

@avneesh-sud - sorry about the delay, rolled straight into a new role after ECCV. I've had another look and... things aren't as straight forward as I remember. I'm running experiments on my laptop now with a 1050-Ti (previous results were on a desktop / 1070), but having done some more digging on experiments with full forward/backwards pass and sorted inputs, I'm observing:

  • sparse implementation is much better than standard implementation without jit - faster and consumes less memory
  • sparse impl. doesn't see much improvement with jit
  • standard impl. gets much better with jit
  • standard impl. is slightly faster than sparse impl. with jit and advertises significantly lower memory usage
  • with jit, standard imp. still OOMs earlier than sparse impl, despite advertising significantly lower memory usage for runs without OOMing.

Standard implementation OOMs with default num_vertices, so maybe I'm just not hitting the cross-over point I advertised above?

I've never really trusted the memory usage values that come out of jitted code, and this experiment hasn't given me any more faith. I don't know of any other profiling tools for this - happy to take suggestions though.

Will hopefully have a chance to play with the model tomorrow :)

@jackd
Copy link
Contributor Author

jackd commented Apr 20, 2020

For the models from the notebook, the sparse implementation (proposed in this PR) takes ~170ms per step, vs 215ms for the original implementation on my laptop with 1050Ti. Not much observed difference in memory usage based on random polling of nvidia-smi output.

In terms of model accuracy, there are tests in place that show the results are the same but I wanted to do up a keras model anyway so I did up this gist. I got a bit carried away changing stuff and it's entirely possible I stuffed something up, but overall loss numbers look roughly equivalent to the notebook file. Unsurprisingly, the results using different layer implementations are very similar. Somewhat surprising was that my sparse keras implementation in tf 2.1 was ~10% slower than the notebook version and tf 1.15, where as my keras implementation using the default layer implementations ~30% slower than the notebook equivalent. Again, not really an apples-to-apples comparison because I changed a fair bit (though I'd hoped my changes would have made things faster if anything). Maybe I screwed up something...

tb-sparse
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants