renumbering categories should not affect existing resources #308

Closed
kaos opened this Issue Apr 16, 2012 · 20 comments

4 participants

@kaos
Zotonic member

When adding new categories, they may be renumbered.

Existing resources may be under another category during the renumbering process, which may be rather lengthy when there are a lot of resources.

See comments of #307 for more info.

@arjan
Zotonic member

We were just having an offline mail discussion about this with @mworrell @mmzeeman
Let's continue it here :)

@mworrell
Zotonic member

I will translate my offline Dutch answer to English here :)


We use the pivoted category number for matching subtrees in the category tree.
To keep the subtrees matching we need to renumber the category_nr whenever the category tree changes.

An alternative would be to join the category table to the rsc table whenever we need to filter on category.
(Except when we filter on a specific category)

To overcome the category-renumbering-problem there are two possibilities:

  1. Number the categories in such a way that we can always insert between two existing categories.
    This can be done by using strings or numbers with big increments.
    Note: This doesn't solve the problem of moving a category, then all items within and below that category still need to be renumbered.

  2. Check if the join isn't too bad after all.
    We have only a limited number of categories.
    And the queries mostly don't sort on category but only filter on category.
    It could well be that the filtering by PostgreSQL is as fast with a in (...) lookup as with a range query.
    Only counting the number of rscs inside a category (sub)tree will be a bit slower, but that isn't a very common operation on most sites.

Maybe someone can run some SQL EXPLAIN queries to see the difference?

@mmzeeman
Zotonic member
@arjan arjan was assigned Sep 21, 2012
@arjan
Zotonic member

Yes but still it leaves your site in a wrong state for some time. Pivoting is pretty slow.

Im prioritizing this as it is impossible right now to do "hot" upgrades of sites if a new category needs to be added.

@kaos
Zotonic member

I'm curious about this, so I'm looking into it. So far, I've only found one function (in m_category) that depends on the pivoting of category numbers from the nested set used, and that is for getting last_modified for resources under a certain category tree. I've not yet searched in other modules that might reference the pivot_category_nr though.
However, while examining the lft and rght values in my category table, I find it odd that only lft is a nice sequential list, while rght has duplicates and holes in it... bug?

@kaos
Zotonic member

Running against a default blog site (in other words, very few resources, don't know what impact this has on the numbers):

EXPLAIN (ANALYZE off, VERBOSE off, COSTS on, BUFFERS off )select max(modified) from dev_blog.rsc where pivot_category_nr >= 2 and pivot_category_nr <= 4;
"Result (cost=4.90..4.91 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Limit (cost=0.00..4.90 rows=1 width=8)"
" -> Index Scan Backward using fki_rsc_modified on rsc (cost=0.00..29.39 rows=6 width=8)"
" Index Cond: (modified IS NOT NULL)"
" Filter: ((pivot_category_nr >= 2) AND (pivot_category_nr <= 4))"

@kaos
Zotonic member

EXPLAIN (ANALYZE off, VERBOSE off, COSTS on, BUFFERS off )select max(modified) from dev_blog.rsc r join dev_blog.category c on c.id = r.category_id where c.nr >= 2 and c.nr <= 4;
"Aggregate (cost=7.19..7.20 rows=1 width=8)"
" -> Hash Join (cost=1.40..7.18 rows=4 width=8)"
" Hash Cond: (r.category_id = c.id)"
" -> Seq Scan on rsc r (cost=0.00..5.54 rows=54 width=12)"
" -> Hash (cost=1.38..1.38 rows=2 width=4)"
" -> Seq Scan on category c (cost=0.00..1.38 rows=2 width=4)"
" Filter: ((nr >= 2) AND (nr <= 4))"

So, a bit more costly, but then again, we can cache the result, so how bad is this in practice?

@kaos
Zotonic member

Ah, ok. Now I see how left and right are calculated. No bug, then. Just different than that illustrated on wikipedia for nested set models.

@mworrell
Zotonic member

I am a bit worried about the join. Maybe we can recheck on bigger sites, I can rerun your analyze queries on maxclass or Mediafonds.

In practice the category table will always be in memory, so the impact should be quite constant. Unless we have a better index on the rsc table, combining the category and the modification time. It is a good idea to recheck the used keys anyway....

@kaos
Zotonic member

I see how the confusion arise during renumbering, since the pivot_category_nr is being used to find resources belonging to a certain category tree... which will may be wrong when there's a pending/ongoing renumbering.

I do see the benefit of this, pity that the numbers change as soon as the structure of the category tree does too...

@kaos
Zotonic member

Ah, right. Now I see, the join will cause the db to look up the category for every resources. I guess that's going to be bad when there's a lot of resources to go through, even with few categories...

@kaos
Zotonic member

I have a simple proposal to islolate the issue to do as little harm as possible. And that is to offset the left and right counts by id, possibly shifted left a few bytes or so to leave some room, and then start the left count from 1 for each root node.
This way, when a new category is added, it will only affect the tree to which it has been added, so any resource not in that particular category tree will not be affected at all.
Also, if the new category is added to the bottom of the tree, it won't affect the categories above it in the tree, hence the existing resources in that same tree won't be affected in that case either*. It will not always be possible to put the new category at the end, in case there are multiple levels and the new one needs to go at a level further up the tree, but in most cases I don't think the tree is so complex that it would cause any grief.
*well, the existing resources may have their right value affected, but nothing too weird should happen until it has been adjusted.

@kaos
Zotonic member

Ok, re-read Marc's suggestions further up. pt 1 is similar to my sugestion I think, apart from that I think I've taken it one step further by decoupling the trees from each other. Also, when moving a category, I think it could be acceptable for it to take a few minutes to propagate properly. If not, I think we need to make bigger changes (i.e. not using nested set, but a variant of it) I've found some references about other ways to number the nodes to avoid this very issue. I'll dig into that a bit more and see if it makes any sense...

@kaos
Zotonic member

Another option is to first look up the category id's from the category range, and then filter the rsc's based on that, as in:

EXPLAIN (ANALYZE off, VERBOSE off, COSTS on, BUFFERS off )select max(modified) from dev_blog.rsc where category_id in (select id from dev_blog.category where nr >= 2 and nr <= 4);
"Aggregate  (cost=7.20..7.21 rows=1 width=8)"
"  ->  Hash Semi Join  (cost=1.40..7.18 rows=9 width=8)"
"        Hash Cond: (rsc.category_id = category.id)"
"        ->  Seq Scan on rsc  (cost=0.00..5.54 rows=54 width=12)"
"        ->  Hash  (cost=1.38..1.38 rows=2 width=4)"
"              ->  Seq Scan on category  (cost=0.00..1.38 rows=2 width=4)"
"                    Filter: ((nr >= 2) AND (nr <= 4))"

This will avoid scanning the category table for each resource, at least :)

I also note that we could store an additional value in both tables, say category_pivot_seq which is incremented each time we recalculate the categories, and can thus see if the pivot_category_nr is up-to-date or not, using the queries we have today most of the time, and only some other fallback during re-pivoting... Hmm... although, it might cost more to just determine if the pivot is up-to-date than to go the alternative route to begin with... ?

@arjan
Zotonic member

Maybe we can have a "category dirty" flag which, if set, lets the query builder use the "inefficient but correct" queries while the category tree is rebuilding?

@kaos
Zotonic member

sounds more optimal than my idea, but with the same end-result, so 👍 :)

@arjan
Zotonic member

alright, I'm implementing this

@mworrell
Zotonic member

I thought of the same 👍

I also like the the idea of separate trees.

@arjan
Zotonic member

I am now implementing it not using separate trees, but just doing the category joins while the category tree is marked "dirty".

@mworrell
Zotonic member

Which is a good start.

After this we can do either the separate trees or the "number with holes so that we don't need to renumber" fix.

@arjan arjan added a commit that closed this issue Dec 10, 2012
@arjan arjan z_search: Use joins for search while the category tree is renumbering
Although using joins is considerably slower, this way, the search
results are still correct when the database is updating itself.

Added public function m_category:is_tree_dirty/1 which can be used to
check if the current category tree is dirty or now.

Fixes #308
eb0ae0a
@arjan arjan closed this in eb0ae0a Dec 10, 2012
@rpip rpip pushed a commit to rpip/zotonic that referenced this issue Aug 12, 2013
@arjan arjan z_search: Use joins for search while the category tree is renumbering
Although using joins is considerably slower, this way, the search
results are still correct when the database is updating itself.

Added public function m_category:is_tree_dirty/1 which can be used to
check if the current category tree is dirty or now.

Fixes #308
0d3b7db
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment