Description
🐛 Describe the bug
By default, libraries which hook into the frame evaluation API are encouraged to store compiled products on co_extra. However, storing arbitrary Python objects on code objects is a recipe for reference cycles, because most people assume that holding a reference to a code object is "safe" (unlikely to result in a cycle); e.g., that it is safe to store a CapturedTraceback on a compiled code object. I recently ran into this on the stack at #107457 where I wanted to store debugging tracebacks on internal data structures in PT2; I only noticed because we had two bugs in GC traverse/clear and this resulted in code getting leaked entirely.
I'm not sure how to 100% solve this problem.
- Stop storing compiled products in co_extra, store them somewhere else. However, if you torch.compile a global function that takes only Tensor inputs, there is no where else to store it, the code object really is your only logical choice.
- Stop storing so much stuff in our compiled products. This might be plausible but we'll need to take a close look at our object graph on the compiled code and see what can be dropped and what can't
- Something else?
cc @msaroufim @wconstab @bdhirsh @anijain2305 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov
Versions
main