Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#961 - Negative performance impact in the User model #962

Conversation

jasonbahl
Copy link
Collaborator

@chriszarate reported a performance issue on sites with larger data sets within the User model.

This provides a more efficient method of checking if a User has published posts, which makes them non-private entities. This PR provides a filter for plugins/themes to bypass the check altogether, and also makes the check much more efficient by caching the check and making the actual query more efficient than the currently used count_user_posts query.


closes #961

- This provides a more efficient method of checking if a User has published posts, which makes them non-private entities. This PR provides a filter for plugins/themes to bypass the check altogether, and also makes the check much more efficient by caching the check and making the actual query more efficient than the currently used `count_user_posts` query.
@jasonbahl jasonbahl added Type: Enhancement New feature or request Status: In Review Needs to be reviewed by a project maintainer before merge impact: med minor perf improvements, fix broad userbase issues labels Sep 20, 2019
@jasonbahl jasonbahl self-assigned this Sep 20, 2019
@codecov
Copy link

codecov bot commented Sep 20, 2019

Codecov Report

Merging #962 into develop will increase coverage by 0.04%.
The diff coverage is 85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #962      +/-   ##
==========================================
+ Coverage    61.85%   61.9%   +0.04%     
==========================================
  Files          135     135              
  Lines         8196    8213      +17     
==========================================
+ Hits          5070    5084      +14     
- Misses        3126    3129       +3
Impacted Files Coverage Δ
src/Model/User.php 80.95% <85%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a878d8...32be1af. Read the comment docs.

Copy link
Member

@CodeProKid CodeProKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some questions here @jasonbahl

* to see if the user has published posts.
* @param User $this Instance of the User Model
*/
$is_private = apply_filters( 'graphql_user_model_is_private', null, $this );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of already handled here -> https://github.com/wp-graphql/wp-graphql/blob/develop/src/Model/Model.php#L222 though the query will still run, so maybe we just need to rework that lower level filter, but I think we should keep all of the Model Layer hooks & filters at the abstract level so they are consistent across all types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodeProKid If we were to use the existing graphql_data_is_private filter, it still executes this is_private method, which will execute the query to check for published posts, even if a filter were applied to override the results.

Perhaps we could do what I'm doing here in the Model.php file though, to centralize it better.

Basically, I want to provide an "escape hatch" for folks to completely override the automated checks we're doing on their behalf.

For example, if you just have a hard rule that ALL USERS should be considered private, always, then you could avoid the lookup for published posts altogether and just return true with no overhead.

If you were to do that with the current filter, and just return__true it still executes the query.

I didn't think it would, but upon testing and actually monitoring, it indeed does. 🤷‍♂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's what I'm proposing. Move the existing filter in Model.php to before we call $this->is_private(); and only call $this->is_private(); if the filter is empty, otherwise use the results of the filter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 we would use this filter and it would solve the entire perf problem for us

*/
$post_types = \WPGraphQL::get_allowed_post_types();
unset( $post_types['revision'] );
unset( $post_types['attachment'] );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand revisions, but why unset attachments? I think it might be better to do an array intersect on all public post types maybe?

This gets pretty complicated actually, as we technically shouldn't expose any authors that don't have published posts exposed in GraphQL. So if someone goes and locks down the Pages type, and a user only has posts in that post type, we technically shouldn't expose them. In a headless context, there would be no other way to enumerate that user. 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attachments don't get the publish status, just inherit.

So, it's a waste of energy to check for publish attachments, as there won't be any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. Would be good to add an inline comment about this for context, since this seems super arbitrary.

*
* @param int @userid ID of the User
* @param mixed|string|array $post_type The post type(s) to check for published posts in
* @param boolean $public_only Whether to count only publicly published posts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing is messed up here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I used the composer run fix-cs command here, so it should be formatted according to the WP code standards...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! Spacing is because of @userid should have been $userid

*
* We only care if there's at least 1 published post, not the total.
*/
if ( false === ( $count = wp_cache_get( $cache_key, $cache_group ) ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This falls apart if someone un-publishes the post. We would have to have some code hooked into transition post status to bust this cache.

Also, where does this cache get set? I only see the get call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya. . .that's what I was thinking too. 😬

Where's DFM Transients when you need them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, where does this cache get set? I only see the get call.

uhhh. . .looks like I caught a case of the "moving too fasts" 🤢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also get stuck in a bad state if an author has 0 published posts, and then publishes a post, as it will still be cached as 0. So you'll need to bust the cache in that instance as well.

We definitely want to cache 0 as a legitimate result since querying for published posts for a user that has none will do a full table scan to result in the 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would consider calling to wpcom_vip_count_user_posts behind a function_exists check. that function basically implements what you've done here but i believe they handle the cache busting:

https://github.com/svn2github/wordpress-vip-plugins/blob/master/vip-do-not-include-on-wpcom/wpcom-caching.php#L410

i don't have a strong opinion on needing to bust the cache, i'd probably lean towards not busting it and just documenting the delay.

can i ask what the threat model is here? why exactly do users without published posts have an expectation of privacy? the whole reason that this method is unperformant is that it always runs, even when the user connected to a post (in which case, they obviously have pubbed a post).

is there a way to make this run conditional depending on the $source object?

@jasonbahl
Copy link
Collaborator Author

Closing this, per discussion here: #961 (comment)

@jasonbahl
Copy link
Collaborator Author

jasonbahl commented Oct 10, 2019

can i ask what the threat model is here? why exactly do users without published posts have an expectation of privacy? the whole reason that this method is unperformant is that it always runs, even when the user connected to a post (in which case, they obviously have pubbed a post).

@chriszarate WordPress treats users as private entities by default until they publish posts. If you install WordPress and create a new user, that user is not discoverable (by a non authenticated user) in any way until they have published a post. We're trying to respect existing access control patterns that WordPress has set.

@jasonbahl jasonbahl closed this Oct 18, 2019
chriszarate added a commit to Quartz/wp-graphql that referenced this pull request Jan 17, 2021
There has been a long-standing performance issue with the is_private
method of the User model, because it attempts to count the number of
published posts for each user, a potentially very expensive operation.
And since the operation is uncached, this can very quickly overwhelm the
database.

More discussion of the issue:
wp-graphql#961

At Quartz, this has prevented us from upgrading WPGraphQL, and it
appears to have affected other users as well. A few different PRs exist
to address this issue:

wp-graphql#962
wp-graphql#972
wp-graphql#1677

This PR is my attempt to fix this issue the way that Jason and others
have described as ideal: using DataLoader.

Instead of creating a new loader (e.g., "UserVisibilityLoader"), I am
including this in UserLoader. The reason is that they seem tightly
coupled and making them separate would just introduce unnecessary
complexity. You'd need to remember to use UserVisibilityLoader
everywhere you used UserLoader or risk losing the performance benefit
it provides.

My fix simply plugs into the existing "loadKeys" implementation and
produces a single query that determines visibility for each user. I'd
love feedback on the efficiency of this query; I hope that this is
good enough to sidestep the need for caching. If its not possible to
generate a single efficient query, generating an extra query for each
user along the lines of `SELECT id FROM wp_posts WHERE post_author = X
LIMIT 1` seems like the next best approach.

Finally, we need a way to provide the visibility to the User model.
Luckily, WP_User is not a final class and provides a setter:

https://developer.wordpress.org/reference/classes/wp_user/__set/

This allows us to set an "is_private" property on the WP_User instance,
which is then available to the model via "$this->data". This seems to
be done widely, but we could implement a wrapper around WP_User if we
wanted to avoid mutating WP_User instances.

From there, it's a very simple matter of having the User model check
this property.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: med minor perf improvements, fix broad userbase issues Status: In Review Needs to be reviewed by a project maintainer before merge Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Negative performance impact in the User Model
3 participants