New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-dirty serialized columns are being saved anyway #8328

Closed
elado opened this Issue Nov 26, 2012 · 63 comments

Comments

Projects
None yet
@elado

elado commented Nov 26, 2012

New Rails 3.2.9 app with one model that has a serialized column:

rails new ar_serialized_dirty_test
cd ar_serialized_dirty_test

rails g model Product name:string data:text
rake db:migrate

# product.rb

class Product < ActiveRecord::Base
  serialize :data
end

On every save, this column is being updated, regardless if it was changed or not.

> p = Product.create(name: "A")
  SQL (0.6ms)  INSERT INTO "products" ("created_at", "data", "name", "updated_at") VALUES (?, ?, ?, ?)  [["created_at", Mon, 26 Nov 2012 23:08:26 UTC +00:00], ["data", nil], ["name", "A"], ["updated_at", Mon, 26 Nov 2012 23:08:26 UTC +00:00]]
 => #<Product id: 1, name: "A", data: nil, created_at: "2012-11-26 23:08:26", updated_at: "2012-11-26 23:08:26"> 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:08:34.660484', "data" = '---\n...\n' WHERE "products"."id" = 1
 => true 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:08:45.777419', "data" = '---\n...\n' WHERE "products"."id" = 1
 => true 
> p.data_changed?
 => false 
> p.save!
   (0.3ms)  UPDATE "products" SET "updated_at" = '2012-11-26 23:13:20.839794', "data" = '---\n...\n' WHERE "products"."id" = 1
@felipecsl

This comment has been minimized.

felipecsl commented Nov 26, 2012

+1 for that

@senny

This comment has been minimized.

Member

senny commented Nov 27, 2012

This is the intended behavior. It was added with this commit 144e869 by @jonleighton . There is even an explantation why it behaves the way it does:

# Serialized attributes should always be written in case they've been
# changed in place.
def keys_for_partial_write
  changed | (attributes.keys & self.class.serialized_attributes.keys)
end

The problem is that serialized data-structures can be changed without using the ActiveRecord setters. Then there is no way to detect that a change actually happend. To work around this problem serialized attributes are written all the time.

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Nov 27, 2012

@senny right. Closing it.

@elado thank you for reporting.

@elado

This comment has been minimized.

elado commented Nov 28, 2012

Thanks.
Indeed, looks like _changed? method doesn't return true if I change a value within a value, p.data[:x] = 1

What about keeping the original serialized value after fetching and compare it to the current serialized value before saving?

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Nov 28, 2012

@elado I think this is a unnecessary overhead, much more than writing the serialized attribute every time.

@unorthodoxgeek

This comment has been minimized.

unorthodoxgeek commented Nov 28, 2012

@rafaelfranca think large scale, 1M+ writes per minute to the RDB. on the ruby side, that wouldn't take too much time, but it can bring a really strong SQL machine down on its knees.

there are less expensive ways of doing this than comparing, for instance, extend the serialized object to have a changed attribute accessor and when a setter method is called on it, it will also set that value to true.

@elado

This comment has been minimized.

elado commented Nov 28, 2012

@rafaelfranca, as @unorthodoxgeek said, it's definitely a performance issue.

Instead of an expensive DB access for no reason, Rails could either serialize the initial value after fetching and compare to it to the current serialized value for dirty check, or have #eql? method in the serializer class (the one that has #load/#dump) that'll compare things by content.

I believe usually the serialized attribute is a Hash or an Array, so #eql? shouldn't be that expensive.

Let's remember this code overhead (not performance overhead) only happens when there are serialized attributes, and current state is that a query is executed every time and this is far longer operation.

Thanks

@senny

This comment has been minimized.

Member

senny commented Nov 28, 2012

@elado If you can come up with a good PR I think there is nothing that will prevent it from getting merged. As the current behavior is known and expected we don't treat this as a bug though. Feel free to use the Rails Core GoogleGroup to discuss the idea and the change you want to make. Of course you can also just hack up the PR and start a discussion with it.

@rafaelfranca

This comment has been minimized.

Member

rafaelfranca commented Nov 30, 2012

@elado @unorthodoxgeek I see. It make sense.

Mind to open a pull request?

I'm reopening this issue.

@rafaelfranca rafaelfranca reopened this Nov 30, 2012

@jonleighton

This comment has been minimized.

Member

jonleighton commented Nov 30, 2012

@tenderlove has talked to me about using Object#hash to determine which attributes have changed. That seems like to way to go to me. Then we don't need special handling for serialized attributes.

@al2o3cr

This comment has been minimized.

Contributor

al2o3cr commented Dec 2, 2012

The challenge here is how to "save the original" for comparison purposes - just stashing it someplace isn't sufficient. For instance, if you've got a Hash with Array values:

irb: h = { a: [1,2,3], b: [4,5,6] }
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h2 = h.deep_dup
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h2[:a][0] = -1
===> -1
irb: h
===> {:a=>[-1, 2, 3], :b=>[4, 5, 6]}

Neither dup nor clone will do the right thing here either - there is one reliable way to do it, but it's (somewhat) expensive and fails if an object can't be Marshalled:

irb: h = { a: [1,2,3], b: [4,5,6] }
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h_copy = Marshal.load(Marshal.dump(h))
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}
irb: h_copy[:a][0] = -1
===> -1
irb: h
===> {:a=>[1, 2, 3], :b=>[4, 5, 6]}

There's also the side-issue that such copying would have to be done pessimistically (every time an object is loaded from the DB) or users would likely encounter bugs if they mutate the serialized attribute in-place (since calling some_serialized_attribute_will_change! isn't currently required by the "save serialized things every time" strategy). In particular, users may wind up with silent data loss...

@unorthodoxgeek

This comment has been minimized.

unorthodoxgeek commented Dec 2, 2012

what about marshalling it and saving an md5 of the output string? should be
a fairly good way of comparing two hard-to-compare objects...

@ragalie

This comment has been minimized.

ragalie commented Dec 3, 2012

This isn't the ideal solution, but why not note when the serialized attribute is read/written to and only save it in that case. If the attribute hasn't been read it couldn't have been modified.

For applications where the serialized attribute is rarely accessed this would cut down on extraneous SQL calls while avoiding some of complexities in the proposals that try to compare versions of the serialized attribute.

@ndbroadbent

This comment has been minimized.

Contributor

ndbroadbent commented Dec 5, 2012

I agree with @elado's initial suggestion of storing the original unserialized string, but maybe as a MD5 hash. If you have:

class Product < ActiveRecord::Base
  serialize :categories, Array
end

then the original string's MD5 hash could be stored somewhere like @product.unserialized_categories_hash or @product.unserialized_attribute_hashes['categories'], before being parsed.

That would make it easy to detect changes properly, even if the attribute is modified in place.

@darthdeus

This comment has been minimized.

darthdeus commented Jan 4, 2013

+1

@njakobsen

This comment has been minimized.

Contributor

njakobsen commented Feb 23, 2013

+1

@tenderlove via @jonleighton had the answer. Every object has a .hash method. Call that to see if the serialized hash is different. If so, mark as dirty.

@neerajdotname

This comment has been minimized.

Member

neerajdotname commented Apr 22, 2013

@jonleighton any thoughts on how to retrieve old value so that when p.data_changed? is invoked we can do the Object#hash matching. Looking at the code I could not find any clean way to retrieve the old value when the value is changed in-memory.

@venkatesanmurali

This comment has been minimized.

venkatesanmurali commented Jun 10, 2013

So how do we handle the following scenario?

  1. I modify some data on a serialized data. (say, i delete some values in the data array).
  2. I would want to save the data only if its changed. Why should i unnecessarily update a row if there is no value changed. (i m on a loop and only some data will be changed based on condition)
@letronje

This comment has been minimized.

letronje commented Jun 14, 2013

Is there a way to disable this at a per-model level ?

@njakobsen

This comment has been minimized.

Contributor

njakobsen commented Aug 22, 2013

@senny Referring back to your code quote from @jonleighton, I'm not sure why serialized data is so special just because it can be changed in-place. A string can also be changed in place without the changes being detected.

user = User.first
user.first_name.replace('Changed')
user.first_name #=> "Changed"
user.changes #=> {}
@al2o3cr

This comment has been minimized.

Contributor

al2o3cr commented Aug 22, 2013

@njakobsen - sure, but the situation with serialized fields is different (IMO) for two reasons:

  • in-place modifications are far more likely for serialized data; the first_name.replace example is far less common than something like user.some_serialized_field[:foo] = 'bar'
  • users who are performing in-place modifications on String columns are either already calling user.first_name_will_change! (in the example above) beforehand or are already experiencing bugs. Users who are modifying serialized data in-place are not, since that data is always saved.
@myitcv

This comment has been minimized.

myitcv commented Aug 22, 2013

Supporting in-place changes feels like it is beyond the scope of this discussion, no? Not least because trying to support it (and being consistent) would require that a shadow .clone always be held for every attribute, serializable or not. There's a clear cost associated with that. Or the ambiguity that is created around saying 'you can modify some attributes in place but not others' if you only support it for serializable attributes - a recipe for more bugs perhaps?

As @jonleighton pointed out, using the .hash 'interface' at least gives a clean, standardised way for comparing equality when it comes to setting an attribute. It would also push the task of .clone (or similar) onto the caller, but this seems like a fair compromise if speed/efficiency is the goal.

So +1 for the use of .hash in determining whether an attribute has changed when it is set via = etc. This would then allow accurate tracking of whether the attribute is dirty or not and therefore short-circuiting of database writes if the attribute hasn't changed.

And then another +1 for spelling out in the docs that in-place changes won't be picked up (particularly here) - I got bitten by this because I (wrongly?) wrapped my obj.save with if obj.changed?

@njakobsen

This comment has been minimized.

Contributor

njakobsen commented Aug 22, 2013

@myitcv If the .hash implementation is used, there will be no need to document gotchas with in-place changes since they would be detected.

@jonashuckestein

This comment has been minimized.

jonashuckestein commented Jan 22, 2014

@kbrock storing a hash or even a clone of the serialized value on initialization is fine, if the developer has explicitly opted into the behavior and knows how to turn it off. Another performance implication of dirty tracking serialized columns is the comparison/hash calculation on each save, which you can't avoid anyway. This is likely another reason the current behavior was implemented.

I'll take some time tomorrow to gist this

@mrrooijen

This comment has been minimized.

mrrooijen commented May 17, 2014

What's the status on this issue? This is a terrible issue. I see that @kbrock got #13799 pulled into master, but has anything else changed since then that remedies this? I'm running into race conditions where my worker processes keep reverting user configuration because it keeps saving the CLEAN state of that configuration (serialized attribute) in the background right after a user updated the configuration through the web interface. This bug basically causes data (i.e. user input) to be lost.

@sgrif

This comment has been minimized.

Member

sgrif commented May 31, 2014

This is doable after the refactoring I've done recently. I'll be submitting a fix some time this week.

@sgrif

This comment has been minimized.

Member

sgrif commented Jun 1, 2014

This issue is fixed in #15458

@kbrock

This comment has been minimized.

Contributor

kbrock commented Jun 16, 2014

thanks @sgrif

@Fryguy

This comment has been minimized.

Contributor

Fryguy commented Jun 16, 2014

Great work @sgrif ... Glad to see this got merged in.

EDIT: Should really say thanks as well to the many others who contributed to this long-standing issue.

@schmurfy

This comment has been minimized.

schmurfy commented Jun 17, 2014

Thanks !
long standing is quite an understatement ;)
glad to see it fixed.

@felixbuenemann

This comment has been minimized.

Contributor

felixbuenemann commented Aug 26, 2014

This is currently broken with serialized binary columns. The changes hash will be populated after load, so it will still always be saved. See #16701 for details.
Update: It's fixed now, thanks @sgrif.

@steakchaser

This comment has been minimized.

steakchaser commented Jan 5, 2015

Rails 4.2.0 still seeing this issue when parent -> child with accepts_nested_attributes_for and child record has serialize field with type specified as Array.

If I omit the class_name param from serialize, it works fine. If present, saving the parent causes a save on each child.

@sgrif

This comment has been minimized.

Member

sgrif commented Jan 5, 2015

@steakchaser Can you please open a new issue, with a reproducible script to demonstrate the issue, using this template? Ping me on the issue, and I'll look into it.

@Ajaxy

This comment has been minimized.

Ajaxy commented Feb 17, 2015

Which rails version should this be fixed in? Still facing in 4.1.6 and 4.2.0

@felixbuenemann

This comment has been minimized.

Contributor

felixbuenemann commented Feb 17, 2015

Should be in 4.2.0, if you're still having problems open a new issue and maybe add a reference to this ticket.

@sgrif

This comment has been minimized.

Member

sgrif commented Feb 17, 2015

There's a few edge cases not handled in 4.2.0 which are fixed on 4.2.1. Please check on the 4-2-stable branch before opening a new issue, and provide a reproduction script using our template here: https://github.com/rails/rails/blob/master/guides/bug_report_templates/active_record_master.rb

@sgrif

This comment has been minimized.

Member

sgrif commented Feb 17, 2015

(In a new issue, not as a comment on this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment