Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new commands ZDIFF and ZDIFFSTORE #7961

Merged
merged 25 commits into from Nov 15, 2020
Merged

Conversation

felipou
Copy link
Contributor

@felipou felipou commented Oct 25, 2020

Related to issue #446.

Simply adds the new commands ZDIFF and ZDIFFSTORE. For now, there are 3 things missing:

  • More tests (for now I just used the two basic tests from the reference PR)
  • Add documentation with a PR for redis-doc
  • Include "algorithm 2", as in the SDIFF command. I've started something here. It works, but it doesn't look very good to me right now, it seems rather "fragile". But if this is wanted, I can add it to this PR and work on it.

Side question: I noticed that the ZINTER and ZUNION commands are missing from the help.c file (only the "STORE" version is there), is that as expected, or should I fix it (here or in another PR)? I added both the ZDIFF and ZDIFFSTORE here.

@felipou felipou marked this pull request as draft October 25, 2020 17:59
@felipou felipou mentioned this pull request Oct 25, 2020
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see t_set.c saying this:

            /* With algorithm 1 it is better to order the sets to subtract
             * by decreasing size, so that we are more likely to find
             * duplicated elements ASAP. */
            qsort(sets+1,setnum-1,sizeof(robj*),
                qsortCompareSetsByRevCardinality);

maybe we need to do the same in zset?

i don't see any way to reduce the "fragility", i suppose you mean it's a lot of code and repeated twice? i guess the way to combat the fragility is to make sure there's a good test coverage for it.
i'll try to comment on that commit.

redis.conf Outdated Show resolved Hide resolved
src/help.h Outdated Show resolved Hide resolved
@oranagra oranagra added this to In progress in 6.2 Oct 27, 2020
@oranagra oranagra linked an issue Oct 27, 2020 that may be closed by this pull request
@felipou
Copy link
Contributor Author

felipou commented Oct 28, 2020

Just pushed some changes, mainly:

  • Adding the commit with algorithm 2 from the other branch.
  • Modifying the search for the max element on algorithm 2, as suggested by @oranagra.
  • Changes to the things @oranagra pointed out.

Still missing more tests, and the PR on redis-doc.

@felipou
Copy link
Contributor Author

felipou commented Oct 28, 2020

Two things I currently need advisement:

  • Should I move dictGetMaxElementLength that I created for algorithm 2 to the dict.c file?
  • On algorithm 2, inside the else block where I remove the elements from the dstzset, I just copied that code from a part of zsetDel, including the comments. I should probably create a function for that, right?

redis.conf Outdated Show resolved Hide resolved
src/t_zset.c Outdated Show resolved Hide resolved
src/t_zset.c Outdated Show resolved Hide resolved
src/t_zset.c Show resolved Hide resolved
src/t_zset.c Outdated
Comment on lines 2275 to 2289
de = dictUnlink(dstzset->dict,tmp);
if (de != NULL) {
/* Get the score in order to delete from the skiplist later. */
score = *(double*)dictGetVal(de);

/* Delete from the hash table and later from the skiplist.
* Note that the order is important: deleting from the skiplist
* actually releases the SDS string representing the element,
* which is shared between the skiplist and the hash table, so
* we need to delete from the skiplist as the final step. */
dictFreeUnlinkedEntry(dstzset->dict,de);

/* Delete from skiplist. */
int retval = zslDelete(dstzset->zsl,score,tmp,NULL);
serverAssert(retval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with your suggestion, let's move this into a common function that serves both this one and zsetDel, and returns a value so the caller knows if the entry was deleted and can carry out the other tasks forthat if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm not sure the name of the new function is really appropriate though, could use some help with it. I called it zsetslDel, seeing as it is a function to delete an element from a zset encoded as a skiplist (plus dict).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have any definite answer. i think i would like to avoid creating a new type of prefix.
i would suggest to just use a zset prefix, and good top comment.
but zsetDel is taken 8-).
maybe call it zsetRemoveFromSkiplist ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@felipou
Copy link
Contributor Author

felipou commented Oct 31, 2020

In addition to addressing everything pointed out, I pushed a minor optimization, which could also be done in the SDIFF command too (but in another pull requests, maybe?).

@felipou
Copy link
Contributor Author

felipou commented Oct 31, 2020

Added some more tests, including a copy of the "SDIFF fuzzing" test.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipou while looking at the complexity formulas with @inbaryuval (conclusions tomorrow) we noticed something to improve about the use of zslInsert (see new comments).


memset(&zval, 0, sizeof(zval));
zuiInitIterator(&src[0]);
while (zuiNext(&src[0],&zval)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going over the list backwards will make zslInsert more efficient, see:

redis/src/rdb.c

Lines 855 to 862 in d8fbd3a

/* We save the skiplist elements from the greatest to the smallest
* (that's trivial since the elements are already ordered in the
* skiplist): this improves the load process, since the next loaded
* element will always be the smaller, so adding to the skiplist
* will always immediately stop at the head, making the insertion
* O(1) instead of O(log(N)). */
zskiplistNode *zn = zsl->tail;
while (zn != NULL) {

alternatively, maybe we can improve zslInsert to be optimized for this case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just analyzed the code, and it's a bit more complicated than I thought. The referenced code already knows that it's a skiplist. In the case of zdiff, we have to cover all cases since the src values can have any encoding. I've checked that zuiInitIterator and zuiNext are only used in zinter/zunion/zdiff, so I thought about changing then to zuiInitBackIterator and zuiPrev. I'll have to study the data structure a bit more to understand how to do that, but shouldn't be too hard. I'll try to finish this tomorrow.

Comment on lines +2297 to +2298
zuiInitIterator(&src[j]);
while (zuiNext(&src[j],&zval)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, going backwards will make zslInsert faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I had an extra concern, which is the impact that this will have on zsetslDel. Should we iterate forward when deleting the elements? This would be simple, we could just move loop to inside the if that checks if it's the first set or not, iterating backwards in one case, and forward in the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean zslDelete? looking at it, and considering we can't know the order of elements between efferent sets, i'm not sure if it has any principal difference.

maybe for the (possibly common case) that we have just two sets, and the majority of the elements are being deleted, it's best to iterate from head to tail. (constantly deleting the first element and no need for long searches).

maybe instead of changing the ZDIFF code, it's better to try to have zslInsert detect an insertion to the tail and make it efficient for that case too.
this way no one else will have to worry to scan insertions in the efficient direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a better idea indeed. I'll try to look into it, but I'll have to better understand the skiplist data structure, because I read the zslInsert code and couldn't see an easy way to optimize for that case. The only idea I had was to add a "hint" so that it assumes the element is greater than all, but I don't think it's possible to just reverse the loops, since it would change the way the "rank" array in the function is calculated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just ocurred to me: since this is an optimization that affects other commands, maybe it should be in a different pull request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it in a different PR. this one is already going too long, and since the change isn't gonna affect the API, and might even be in a different are in the code with implications to other parts of redis...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipou will you have time to try to handle that soon?
i wanna make ZDIFF efficient for 6.2 one way or another (we have about 2 weeks before an RC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to come up with something today, at least an initial PR that we can build on top of, that way if I don't have much time in the next weeks someone can pick up where I left. But I believe I can get something working by the next weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed the insert algorithm for skip lists, and if I understand correctly, there is no way to efficiently insert backwards. To insert, we need to update the forward pointers in all levels, so we need to traverse the list anyway. The backwards pointer is only in the first level, and is only an optimization for ZREVRANGE to traverse the list backwards. If we created backward pointers for all levels, we would end up having to traverse the list both ways when inserting (and deleting and updating).

I'll try to come up with the backward equivalents of zuiInitIterator and zuiNext instead, so that we can modify the ZDIFF code directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished this PR: #8105

src/t_zset.c Outdated Show resolved Hide resolved
@oranagra
Copy link
Member

oranagra commented Nov 4, 2020

one other random thought (maybe for another PR), maybe when ZINTER / ZUNION / ZDIFF are used without WITHSCORES, we want to avoid building the skiplist altogether? i suppose users expect the result to be sorted even if provided without scores. but maybe some use cases don't need that and will benefit from not paying that cost?

this concern is new (introduced when #7794 added ZUNION / ZINTER)
@yangbodong22011 @itamarhaber do you have any feedback on that?

p.s. i guess the WITHSCORES argument should be rejected as invalid syntax when dstkey is non-NULL.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipou i got feedback from my "math doctor" and tried to suggest the changes that are both correct, and hopefully easy to understand.
please merge them, or let me know if you see any problem.

other than that there's the reverse iteration issue i posted yesterday that needs to be changed.

src/t_zset.c Outdated Show resolved Hide resolved
src/t_zset.c Outdated Show resolved Hide resolved
madolson
madolson previously approved these changes Nov 9, 2020
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM

yossigo
yossigo previously approved these changes Nov 10, 2020
@oranagra
Copy link
Member

@felipou what's left is two small changes:

  1. changing the order of iteration to be more optimal for insertion
  2. accepting my suggested comments about complexity (if you concur)
    can you push this through or rather i handle it?

@felipou
Copy link
Contributor Author

felipou commented Nov 11, 2020

Sorry for taking so long, I'll go over everything now.

This check was copied from the zunion/zinter commands code, but is
not needed for zdiff since it doesn't have the WEIGHTS argument.
@felipou
Copy link
Contributor Author

felipou commented Nov 11, 2020

p.s. i guess the WITHSCORES argument should be rejected as invalid syntax when dstkey is non-NULL.

Ok, will do that!

@felipou
Copy link
Contributor Author

felipou commented Nov 13, 2020

Just merged the suggested comments, and added the rejection of WITHSCORES in the case of the STORE commands.

The only thing remaining is the optimization of zslInsert (or making the insert loop backwards on zdiff), I'll focus on this now.

@oranagra
Copy link
Member

@redis/core-team this one is ready to be merged, please approve.
it adds ZDIFF and ZDIFFSTORE which work similarly to SDIFF and SDIFFSTORE.

besides that it makes sure the new WITHSCORES argument that was added for ZUNION isn't considered valid for ZUNIONSTORE

@oranagra oranagra merged commit d8fd48c into redis:unstable Nov 15, 2020
@oranagra
Copy link
Member

Merged. Thank you @felipou !

@oranagra oranagra moved this from In progress to Done in 6.2 Nov 15, 2020
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 19, 2020
- Add ZDIFF and ZDIFFSTORE which work similarly to SDIFF and SDIFFSTORE
- Make sure the new WITHSCORES argument that was added for ZUNION isn't considered valid for ZUNIONSTORE

Co-authored-by: Oran Agra <oran@redislabs.com>
@oranagra oranagra mentioned this pull request Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository
Projects
No open projects
6.2
  
Done
Development

Successfully merging this pull request may close these issues.

zdiffstore
5 participants