For the impl
I have several questions about the motivation and use cases for this class:
Could you provide examples of scenarios where this class can improves performance? compare against default Python GC?
To my understanding, during backward, activation cuda memory should be released timely when we run backward in computational graph, will the GarbageCollection affect how we release cuda memory?
What are the tradeoffs of disabling automatic GC (gc.disable())?