New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new commands ZDIFF and ZDIFFSTORE #7961
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see t_set.c saying this:
/* With algorithm 1 it is better to order the sets to subtract
* by decreasing size, so that we are more likely to find
* duplicated elements ASAP. */
qsort(sets+1,setnum-1,sizeof(robj*),
qsortCompareSetsByRevCardinality);
maybe we need to do the same in zset?
i don't see any way to reduce the "fragility", i suppose you mean it's a lot of code and repeated twice? i guess the way to combat the fragility is to make sure there's a good test coverage for it.
i'll try to comment on that commit.
It's the same algorithm used in SDIFF, modified to work with zsets in zdiff/zdiffstore.
Two things I currently need advisement:
|
src/t_zset.c
Outdated
de = dictUnlink(dstzset->dict,tmp); | ||
if (de != NULL) { | ||
/* Get the score in order to delete from the skiplist later. */ | ||
score = *(double*)dictGetVal(de); | ||
|
||
/* Delete from the hash table and later from the skiplist. | ||
* Note that the order is important: deleting from the skiplist | ||
* actually releases the SDS string representing the element, | ||
* which is shared between the skiplist and the hash table, so | ||
* we need to delete from the skiplist as the final step. */ | ||
dictFreeUnlinkedEntry(dstzset->dict,de); | ||
|
||
/* Delete from skiplist. */ | ||
int retval = zslDelete(dstzset->zsl,score,tmp,NULL); | ||
serverAssert(retval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with your suggestion, let's move this into a common function that serves both this one and zsetDel, and returns a value so the caller knows if the entry was deleted and can carry out the other tasks forthat if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I'm not sure the name of the new function is really appropriate though, could use some help with it. I called it zsetslDel
, seeing as it is a function to delete an element from a zset encoded as a skiplist (plus dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't have any definite answer. i think i would like to avoid creating a new type of prefix.
i would suggest to just use a zset
prefix, and good top comment.
but zsetDel
is taken 8-).
maybe call it zsetRemoveFromSkiplist
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
If any other set is equal to the first set in zdiff, we're subtracting the set from itself, so we already know we'll have an empty set as a result.
In addition to addressing everything pointed out, I pushed a minor optimization, which could also be done in the SDIFF command too (but in another pull requests, maybe?). |
Added some more tests, including a copy of the "SDIFF fuzzing" test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipou while looking at the complexity formulas with @inbaryuval (conclusions tomorrow) we noticed something to improve about the use of zslInsert (see new comments).
|
||
memset(&zval, 0, sizeof(zval)); | ||
zuiInitIterator(&src[0]); | ||
while (zuiNext(&src[0],&zval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going over the list backwards will make zslInsert more efficient, see:
Lines 855 to 862 in d8fbd3a
/* We save the skiplist elements from the greatest to the smallest | |
* (that's trivial since the elements are already ordered in the | |
* skiplist): this improves the load process, since the next loaded | |
* element will always be the smaller, so adding to the skiplist | |
* will always immediately stop at the head, making the insertion | |
* O(1) instead of O(log(N)). */ | |
zskiplistNode *zn = zsl->tail; | |
while (zn != NULL) { |
alternatively, maybe we can improve zslInsert to be optimized for this case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just analyzed the code, and it's a bit more complicated than I thought. The referenced code already knows that it's a skiplist. In the case of zdiff, we have to cover all cases since the src
values can have any encoding. I've checked that zuiInitIterator
and zuiNext
are only used in zinter/zunion/zdiff, so I thought about changing then to zuiInitBackIterator
and zuiPrev
. I'll have to study the data structure a bit more to understand how to do that, but shouldn't be too hard. I'll try to finish this tomorrow.
zuiInitIterator(&src[j]); | ||
while (zuiNext(&src[j],&zval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, going backwards will make zslInsert faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I had an extra concern, which is the impact that this will have on zsetslDel
. Should we iterate forward when deleting the elements? This would be simple, we could just move loop to inside the if that checks if it's the first set or not, iterating backwards in one case, and forward in the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean zslDelete
? looking at it, and considering we can't know the order of elements between efferent sets, i'm not sure if it has any principal difference.
maybe for the (possibly common case) that we have just two sets, and the majority of the elements are being deleted, it's best to iterate from head to tail. (constantly deleting the first element and no need for long searches).
maybe instead of changing the ZDIFF code, it's better to try to have zslInsert detect an insertion to the tail and make it efficient for that case too.
this way no one else will have to worry to scan insertions in the efficient direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a better idea indeed. I'll try to look into it, but I'll have to better understand the skiplist data structure, because I read the zslInsert code and couldn't see an easy way to optimize for that case. The only idea I had was to add a "hint" so that it assumes the element is greater than all, but I don't think it's possible to just reverse the loops, since it would change the way the "rank" array in the function is calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just ocurred to me: since this is an optimization that affects other commands, maybe it should be in a different pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it in a different PR. this one is already going too long, and since the change isn't gonna affect the API, and might even be in a different are in the code with implications to other parts of redis...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipou will you have time to try to handle that soon?
i wanna make ZDIFF efficient for 6.2 one way or another (we have about 2 weeks before an RC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to come up with something today, at least an initial PR that we can build on top of, that way if I don't have much time in the next weeks someone can pick up where I left. But I believe I can get something working by the next weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed the insert algorithm for skip lists, and if I understand correctly, there is no way to efficiently insert backwards. To insert, we need to update the forward pointers in all levels, so we need to traverse the list anyway. The backwards pointer is only in the first level, and is only an optimization for ZREVRANGE to traverse the list backwards. If we created backward pointers for all levels, we would end up having to traverse the list both ways when inserting (and deleting and updating).
I'll try to come up with the backward equivalents of zuiInitIterator and zuiNext instead, so that we can modify the ZDIFF code directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just finished this PR: #8105
one other random thought (maybe for another PR), maybe when ZINTER / ZUNION / ZDIFF are used without WITHSCORES, we want to avoid building the skiplist altogether? i suppose users expect the result to be sorted even if provided without scores. but maybe some use cases don't need that and will benefit from not paying that cost? this concern is new (introduced when #7794 added ZUNION / ZINTER) p.s. i guess the WITHSCORES argument should be rejected as invalid syntax when dstkey is non-NULL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipou i got feedback from my "math doctor" and tried to suggest the changes that are both correct, and hopefully easy to understand.
please merge them, or let me know if you see any problem.
other than that there's the reverse iteration issue i posted yesterday that needs to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API LGTM
@felipou what's left is two small changes:
|
Sorry for taking so long, I'll go over everything now. |
This check was copied from the zunion/zinter commands code, but is not needed for zdiff since it doesn't have the WEIGHTS argument.
c7132a0
Ok, will do that! |
Review suggestions by oranagra Co-authored-by: Oran Agra <oran@redislabs.com>
Just merged the suggested comments, and added the rejection of WITHSCORES in the case of the STORE commands. The only thing remaining is the optimization of zslInsert (or making the insert loop backwards on zdiff), I'll focus on this now. |
@redis/core-team this one is ready to be merged, please approve. besides that it makes sure the new WITHSCORES argument that was added for ZUNION isn't considered valid for ZUNIONSTORE |
Merged. Thank you @felipou ! |
- Add ZDIFF and ZDIFFSTORE which work similarly to SDIFF and SDIFFSTORE - Make sure the new WITHSCORES argument that was added for ZUNION isn't considered valid for ZUNIONSTORE Co-authored-by: Oran Agra <oran@redislabs.com>
Related to issue #446.
Simply adds the new commands ZDIFF and ZDIFFSTORE. For now, there are 3 things missing:
Side question: I noticed that the ZINTER and ZUNION commands are missing from the help.c file (only the "STORE" version is there), is that as expected, or should I fix it (here or in another PR)? I added both the ZDIFF and ZDIFFSTORE here.