Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Aliasing of distinct objects due to garbage collection during a dump #60

artm opened this Issue Apr 17, 2012 · 4 comments


2 participants

artm commented Apr 17, 2012

Garbage collection during dumping may cause object_id's to be reused. This results in psych treating unrelated objects as the same and linking them using yaml aliases. This only happens in some really rare conditions. I have a made up an example to reproduce the problem, but it actually happened to be in real life code.

The conditions are:

  • custom encode_with() are called several times during serialization of a large data structure
  • encode_with() passes temporary data to psych via coder
  • the whole data structure being serialized is large / complex enough that garbage collection happens between calls to encode_with()
  • a little bit of bad luck (this made it almost impossible for me to debug, until I read the sources of psych and understood what's going on)

what happens is - encode_with generates some data and passes it to psych, psych registers that data using object_id as a key, but doesn't hold on to it, so eventually it is garbage collected and its object_id is returned to the unused state. Next call to encode_with generates new temporary data and might reuse the object_id, passes that to psych and psych links the two distinct objects using yaml alias syntax because they are "the same" as far as psych's concerned.

the simplest solution is probably disabling garbage collection for the duration of a dump, next simplest is holding on to the "registered" objects (also for the duration of the dump).

artm commented Apr 17, 2012

made up example to reproduce the problem:



tenderlove commented Apr 25, 2012

Urgh. I don't think we should shut off GC during dumps. I think it would be better if I make the dumping code generate a unique id for an object as it's dumped rather than using the object id.

artm commented Apr 25, 2012

well, actually, if you used objects as hash keys when registering, they would not be collected and their IDs would be unique and you'd be able to use them for aliasing.


tenderlove commented Apr 27, 2012

Ya, definitely. Technically we only need to hold references to coders when an object implements a custom emit function. The rest of the object graph should be stored on the stack, so it won't get collected. This patch fixes the test case you gave me:

diff --git a/lib/psych/visitors/yaml_tree.rb b/lib/psych/visitors/yaml_tree.rb
index 80af046..646fed7 100644
--- a/lib/psych/visitors/yaml_tree.rb
+++ b/lib/psych/visitors/yaml_tree.rb
@@ -20,6 +20,7 @@ module Psych
         @st       = {}
         @ss       = ss
         @options  = options
+        @coders   = []

         @dispatch_cache = Hash.new do |h,klass|
           method = "visit_#{(klass.name || '').split('::').join('_')}"
@@ -406,6 +407,7 @@ module Psych

       def dump_coder o
+        @coders << o
         tag = Psych.dump_tags[o.class]
         unless tag
           klass = o.class == Object ? nil : o.class.name

I think I made a mistake by depending on object id though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment