Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Aliasing of distinct objects due to garbage collection during a dump #60

Closed
artm opened this Issue · 4 comments

2 participants

@artm

Garbage collection during dumping may cause object_id's to be reused. This results in psych treating unrelated objects as the same and linking them using yaml aliases. This only happens in some really rare conditions. I have a made up an example to reproduce the problem, but it actually happened to be in real life code.

The conditions are:

  • custom encode_with() are called several times during serialization of a large data structure
  • encode_with() passes temporary data to psych via coder
  • the whole data structure being serialized is large / complex enough that garbage collection happens between calls to encode_with()
  • a little bit of bad luck (this made it almost impossible for me to debug, until I read the sources of psych and understood what's going on)

what happens is - encode_with generates some data and passes it to psych, psych registers that data using object_id as a key, but doesn't hold on to it, so eventually it is garbage collected and its object_id is returned to the unused state. Next call to encode_with generates new temporary data and might reuse the object_id, passes that to psych and psych links the two distinct objects using yaml alias syntax because they are "the same" as far as psych's concerned.

the simplest solution is probably disabling garbage collection for the duration of a dump, next simplest is holding on to the "registered" objects (also for the duration of the dump).

@artm

made up example to reproduce the problem:

https://gist.github.com/2405586

@tenderlove
Owner

Urgh. I don't think we should shut off GC during dumps. I think it would be better if I make the dumping code generate a unique id for an object as it's dumped rather than using the object id.

@artm

well, actually, if you used objects as hash keys when registering, they would not be collected and their IDs would be unique and you'd be able to use them for aliasing.

@tenderlove
Owner

Ya, definitely. Technically we only need to hold references to coders when an object implements a custom emit function. The rest of the object graph should be stored on the stack, so it won't get collected. This patch fixes the test case you gave me:

diff --git a/lib/psych/visitors/yaml_tree.rb b/lib/psych/visitors/yaml_tree.rb
index 80af046..646fed7 100644
--- a/lib/psych/visitors/yaml_tree.rb
+++ b/lib/psych/visitors/yaml_tree.rb
@@ -20,6 +20,7 @@ module Psych
         @st       = {}
         @ss       = ss
         @options  = options
+        @coders   = []

         @dispatch_cache = Hash.new do |h,klass|
           method = "visit_#{(klass.name || '').split('::').join('_')}"
@@ -406,6 +407,7 @@ module Psych
       end

       def dump_coder o
+        @coders << o
         tag = Psych.dump_tags[o.class]
         unless tag
           klass = o.class == Object ? nil : o.class.name

I think I made a mistake by depending on object id though.

@tenderlove tenderlove closed this issue from a commit
@tenderlove * ext/psych/lib/psych/visitors/yaml_tree.rb: keep a reference to
  custom coders so that GC does not impact dumped yaml reference ids.

Fixes #60
1606060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.