Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching bug: Post content replaced by another post #839

Closed
sudara opened this issue Oct 20, 2019 · 37 comments · Fixed by #854
Closed

Caching bug: Post content replaced by another post #839

sudara opened this issue Oct 20, 2019 · 37 comments · Fixed by #854
Labels
Milestone

Comments

@sudara
Copy link
Contributor

sudara commented Oct 20, 2019

Really strange (and bad!) bug here on Rails 6 with Thredded 0.16.13

I figured I would first open this issue in case others have ran into it, then investigate cause.

In some cases, it seems that every single post in a topic has its content replaced by a the last post.

Perhaps it’s caching related?

65352820-dcd11c00-dbeb-11e9-964c-bc02fcd324bc

See: sudara/alonetone#780

@glebm
Copy link
Collaborator

glebm commented Oct 20, 2019

Certainly looks caching-related. What's your cache backend?

@sudara
Copy link
Contributor Author

sudara commented Oct 20, 2019

:mem_cache_store in production: https://github.com/sudara/alonetone/blob/9bf6ab5d8219158967c2ff5308c05650678d3726/config/environments/production.rb#L64

I’ll be able to dig in a bit later today, just wanted to get an issue in, in case others are hitting the same problem.

@sudara
Copy link
Contributor Author

sudara commented Oct 21, 2019

irb(main):014:0> topic.posts.map {|p| p.content.last(100)}
=> ["one. It's been a huge effort.\r\n\r\nAnd yes, it looks like a new dark theme will rise from the ashes...", " page or something there'd be more than enough folks on here willing to chip in. Just a thought.\r\n\r\n", "reon — I'm hoping it will make sense after we finally switch over and things are humming and happy! ", " troubles ad efforts! alonetone always was and is a safe haven for my sounds. and will be I hope :))"]

Looks like the content is still there in the db.

Updating one piece of content does seem to break its cache, but only its cache.

However I'm not sure why the cache key would be the same for all the posts in a topic, that's next to figure out I guess. I can't reproduce locally so far....

@sudara
Copy link
Contributor Author

sudara commented Oct 21, 2019

Alonetone is on Dalli 2.7.10 which looks to be the same as Thredded.

@sudara
Copy link
Contributor Author

sudara commented Oct 21, 2019

Running Rails.cache.clear fixes the problem on production, which is a temporary fix, not very sustainable. I will post back here if/when I see the problem again. Perhaps it had to do with the fact that we were heavily moderating and deleting users.

The caching logic is fairly custom and specific (vs. out of the box rails cache helpers in the view) so I guess it's not surprising that there may be edge cases.

@glebm
Copy link
Collaborator

glebm commented Oct 21, 2019

Here is how Thredded generates cache keys:

key = ActiveSupport::Cache.expand_cache_key(
view_context.cache_fragment_name(item, virtual_path: template.virtual_path), :views
)

This cache key should contain a fingerprint of the partial, the post ID, and the updated_at timestamp of the post.

@sudara
Copy link
Contributor Author

sudara commented Oct 31, 2019

Going to close this for now, assuming there was some cross-version cache key issue. I'll reopen if we run into it again!

@sudara sudara closed this as completed Oct 31, 2019
@sudara
Copy link
Contributor Author

sudara commented Nov 14, 2019

This has recurred again and is making some of our threads unusable :/

It's only happened on one reply in the latest case:

alonetone | How can I delete songs from my playlist? 2019-11-14 15-57-26

@sudara sudara reopened this Nov 14, 2019
@sudara sudara changed the title Topic posts’ content all suddenly replaced by 1 post Caching bug: Posts content replaced by another post Nov 14, 2019
@sudara
Copy link
Contributor Author

sudara commented Nov 14, 2019

This cache key should contain a fingerprint of the partial, the post ID, and the updated_at timestamp of the post.

I'm going to try and track down if the problem is on writing or reading. Logically, the options are

  1. On write, the same key is being used for multiple posts
  2. On write, the same content is being written into multiple post keys
  3. On read, the wrong key is being used

@sudara sudara changed the title Caching bug: Posts content replaced by another post Caching bug: Post content replaced by another post Nov 14, 2019
@fosterfarrell9
Copy link
Contributor

Any news on this? We're running into the same problem.

@andreibondarev
Copy link
Contributor

I've seen this issue before as well. But we don't have Thredded running in Prod yet.

@sudara
Copy link
Contributor Author

sudara commented Dec 10, 2019

@fosterfarrell9 @andreibondarev Thanks for reporting. I never got around to completely untangling the caching logic, though I did do some preliminary logging/inspection of the code there.

I think the next step is to find out how (what conditions) reliably reproduce the problem. Let me know if you have any ideas!

@drusepth
Copy link

drusepth commented Jan 1, 2020

I've also been seeing this pretty frequently lately around the same time I upgraded to Rails 6 (coincidentally, with also a couple thredded minor version bumps). I've got some time to dig in and try and diagnose what's going on, but I'm not sure where to start yet.

@sudara I see you have three logical options of what might be causing it above; are those all still possibilities after your preliminary inspections, or are any more/less likely now?

@fosterfarrell9
Copy link
Contributor

fosterfarrell9 commented Jan 4, 2020

Could this possibly be related to the introduction of collection cache versioning which is default in Rails 6.0+? ActiveSupport::Cache.expand_cache_key resolves to cache_key. This caused some other caching problems e.g. this Rails bug, which has been adressed in here.

@glebm glebm added the bug label Jan 4, 2020
glebm added a commit that referenced this issue Jan 4, 2020
glebm added a commit that referenced this issue Jan 4, 2020
Use `combined_fragment_cache_key` instead of eagerly converting cache keys to strings.
This may be slightly slower but is easier to maintain.

Refs #839
glebm added a commit that referenced this issue Jan 4, 2020
Use `combined_fragment_cache_key` instead of the older methods.
This may be slightly slower but is easier to maintain going forward.

Refs #839
glebm added a commit that referenced this issue Jan 4, 2020
Use `combined_fragment_cache_key` instead of the older methods.
This may be slightly slower but is easier to maintain going forward.

Refs #839
@glebm
Copy link
Collaborator

glebm commented Jan 4, 2020

Thanks @fosterfarrell9!

If you're affected by this bug, please try #852 for a few days to see if it fixes the issue.

@fosterfarrell9
Copy link
Contributor

Thanks @glebm ! Have #852 running in production now, will report back in a couple of days.

@drusepth
Copy link

drusepth commented Jan 8, 2020

(I also have #852 running in production and will report back if any users report the issue persists.)

@sudara
Copy link
Contributor Author

sudara commented Jan 9, 2020

Back from vacation and pushed #852 into production just now. Thanks @glebm!

@drusepth
Copy link

drusepth commented Jan 9, 2020

It looks like I'm still seeing the issue in production. At first I thought perhaps it was because I didn't manually bust any caches (and assumed some threads might still be affected if they were still cached), but I've already seen it a few times on new threads (since pushing it live).

I'd be interested to see if things are fixed (or at least better) for others, though. I'll definitely give it more time while I keep digging in.

@sudara
Copy link
Contributor Author

sudara commented Jan 10, 2020

I didn't bust cache on my deploy and see the issue post-deploy. I'll bust cache now and report if we see the issue recur.

glebm added a commit that referenced this issue Jan 13, 2020
glebm added a commit that referenced this issue Jan 13, 2020
Use `combined_fragment_cache_key` instead of the older methods.
This may be slightly slower but is easier to maintain going forward.

Refs #839
@glebm
Copy link
Collaborator

glebm commented Jan 13, 2020

Is everyone affected using dalli? Dalli is not tested with Rails 6 yet, does the issue disappear if you switch to mem_cache_store?

Also finding out which one of the 3 things mentioned in #839 (comment) is happening would be very helpful for debugging the issue.

@sudara
Copy link
Contributor Author

sudara commented Jan 17, 2020

https://github.com/sudara/alonetone/blob/master/config/environments/production.rb#L64

alonetone is using mem_cache_store

finding out which one of the 3 things mentioned

The prerequisite to determining that is reproduction. The problem happens to us on production, but I haven't yet reproduced it on purpose. Has anyone else?

I will dig back in with the goal to reproduce, but it would be helpful to understand more about the purpose of the custom caching code and why/what part of rails view caching wasn't enough.

I'm not seeing any specs for CollectionToStringsWithCacheRenderer but it seems like thredded takes over the responsibility of collection view rendering, and therefore has to take over the responsibility of collection caching?

The only place that calls the renderer is the render_posts helper, but I'm not sure what its purpose is. Why don't the views that call the helper use normal rails view caching? My guess is something something permissions?

@sudara
Copy link
Contributor Author

sudara commented Jan 17, 2020

partial: content_partial, collection: posts, as: :post, expires_in: 1.week,

Also, what's the reasoning for expires_in: 1.week, was there some feeling that the cache will get stale and/or we can't rely on key based expiration? (again, just trying to understand intention!)

@glebm
Copy link
Collaborator

glebm commented Jan 18, 2020

@sudara We want to render post contents with cache for multiple posts quickly, with a single call to get all the cached posts, and concurrent rendering of the uncached posts.

We only want to cache post contents and not the other stuff around the post, such as the post actions because they are user-specific.

With the methods Rails provides ouit-of-the-box, you either have to cache everything (not what we want), or cache the partials individually (no get_multi).

We render uncached posts using multiple threads because rendering can be slow, e.g. onebox rendering does network I/O.

I always set some cache expiration on everything, so that the cache does not end up with unused data in it long term: the cache key is fingerprinted based on the partial's code, can change across Rails versions, etc.

If the issue is visible in prod, perhaps you can inspect the contents of the production cache?

Another thing to try is to make post rendering single-threaded like this:

# config/initializers/thredded.rb
Thredded::CollectionToStringsWithCacheRenderer.render_threads = 1

@drusepth
Copy link

drusepth commented Feb 12, 2020

I had a PM thread that I could reliably reproduce this on tonight (where every refresh showed a different comment repeated over and over in the thread, refreshing ~100 times without not seeing the bug). I set Thredded::CollectionToStringsWithCacheRenderer.render_threads = 1 and the issue immediately disappeared.

Just to make sure it wasn't a fluke related to the deploy, I set render_threads back to the default 50 and deployed it again and was able to reproduce the bug in that thread 100% of the time again (with ~50 refreshes before I was satisfied).

While a "hey it works for me!" isn't super useful, limiting the threads did indeed seem to accurately fix the issue for me in this instance, which might help a bit in the detective work of tracking down what's happening here.

@sudara
Copy link
Contributor Author

sudara commented Feb 13, 2020

Gleb, thanks for the detail!

not the other stuff around the post, such as the post actions because they are user-specific.

I'm almost tempted to look at whether the actions displaying in the UI could be managed with a bit of js sprinkles, allowing normal rails collection caching — but it sounds like there would still be some performance issue with onebox, etc...

@drusepth Thanks for sharing!

where every refresh showed a different comment repeated over and over in the thread

That's interesting. It sounds like the issue is on read, assuming that cache isn't being broken or rewritten between refreshes.

I set Thredded::CollectionToStringsWithCacheRenderer.render_threads = 1 and the issue immediately disappeared.

Also super interesting. I wonder if the answer is in the difference between these two code branches:

if num_threads == 1
render_partials_serial(view_context, collection, opts)
else
collection.each_slice(collection.size / num_threads).map do |slice|
Thread.start { render_partials_serial(view_context.dup, slice, opts) }
end.flat_map(&:value)

However, I'm still a bit confused by the render_collection_to_strings_with_cache method in general — it seems to be reading and writing to cache on every render, which is throwing me for a loop and making me less sure the problem is on read....

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

Thanks for trying out num_threads = 1!

Writing to cache and reading from it happens in a single thread.
Not quite sure what the problem is but it likely isn't in the caching but is in the rendering itself.

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

Reproduced locally in development mode

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

The issue is indeed during rendering, not caching

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

By the time we get to format_content, we have the same contents input.

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

The @locals instance variable in ActionView::PartialRenderer appears to be the same object.

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

Figured out the root cause and a solution

@glebm
Copy link
Collaborator

glebm commented Feb 13, 2020

v0.16.16 released with a fix!

@sudara
Copy link
Contributor Author

sudara commented Feb 17, 2020

@glebm Just saw this!! Amazing, thanks!

@MakeSoftwareGreat
Copy link

I'm experiencing this exact issue in Thredded 1.0.1. Could there have been a regression?

@timdiggins
Copy link
Collaborator

@MakeSoftwareGreat I don't think it's exactly the same issue as this one, as the underlying problem was fixed by a very specific change (mutation of locals by rails method requiring thedded to provide a duplicate) , but it could be an identical symptom with a new (or even very similar) cause. Can you start a new issue with the details of your thredded and rails versions? Also if it was working before, what version it was.

@timdiggins
Copy link
Collaborator

FYI Anyone still observing this issue, should not that another issue was solved which caused behaviour like this (now with tests) and has now been fixed in v1.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants