Investigate unique secondary index possibilites #1716

Closed
coffeemug opened this Issue Nov 27, 2013 · 30 comments

Projects

None yet
@coffeemug
Member

Many people want Rethink to support a unique secondary index feature (see https://news.ycombinator.com/item?id=6804275 for the discussion). Currently people have to resort to code like this:

r.branch(r.table("users").getAll(email, {index: "email"}).isEmpty(),
         r.table("users").insert(user),
         {})

The main problem with this code is of course that it isn't atomic -- there is no guarantee that the result is unique. This would also be an issue with any reasonably doable implementation of a unique index feature. However, this may be an acceptable tradeoff, or there may be some other creative solution here. We should brainstorm and see if we can come up with some options.

@neumino
Member
neumino commented Dec 9, 2013

I think people using ORM/ODM are used to define relation like has_one. While it works fine with a primary key, it currently does not with secondary indexes.

People developing ORM can hack around by adding a r.branch, but it's not super pretty and not perfectly safe.

@coffeemug
Member

We could easily add sugar to make the unsafe code pretty, but I don't think we can easily add a safe alternative. I have to research the tradeoffs of how other solutions do it, there might be something we can easily do that would be acceptable to users.

@jdoliner
Contributor
jdoliner commented Dec 9, 2013

There really isn't an actually way to do this. I think the right choice here is to make the primary key functional and make the limits on btree key sizes less restrictive. Then if you want to keep a field unique you just make it part of the primary key.

@neumino
Member
neumino commented Dec 9, 2013

Hum, so I thought Mongo had this feature, but it's just in case of a single machine. They don't support unique indexes in sharded collections.
http://docs.mongodb.org/manual/tutorial/enforce-unique-keys-for-sharded-collections/

@coffeemug
Member

Ok, since there is no safe way to do it, and the state of the art is that other systems don't do it either, is everyone ok with closing this issue? We can point users that are asking about unique indexes here, so they can see why we won't get it done.

@jdoliner
Contributor
jdoliner commented Dec 9, 2013

I'm kind of opposed. I think we should just have this issue be about having functional primary keys, which isn't too hard and I think actually lets users basically do what they want to do here. Users don't actually want to be able to have secondary indexes be unique they want to be able to have multiple fields in their documents be unique which is a feature we can actually give them.

@neumino
Member
neumino commented Dec 9, 2013

What would a functional primary key look like for the user?

@jdoliner
Contributor
jdoliner commented Dec 9, 2013
table_create(..., primary_key = function)
@jdoliner
Contributor
jdoliner commented Dec 9, 2013

It might be nice to add some sugar for unique keys, something like table_create(..., unique = ["foo", "bar"]) but you get the basic idea I hope.

@AtnNn
Member
AtnNn commented Dec 9, 2013

@jdoliner How would you give a unique restriction to each key returned by the function?

With r.table_create('foo', primary_key=[r.row['a'], r.row['b']]), it seems that neither a nor b would be unique.

@jdoliner
Contributor
jdoliner commented Dec 9, 2013

@atnnn >.< whoops I didn't think this through enough. I guess it would have to be some sort of a multi index. If we wanted to actually enforce uniqueness. Alright I'm cool closing this, I don't think there's an obvious solution.

I do still think that functional primary keys might be a nice feature by themselves.

@neumino
Member
neumino commented Dec 9, 2013

I'm ok closing it.

We should still write about it and explain why we won't provide such feature and mention somehow that unique secondary indexes in Mongo works only for non-sharded collections to avoid any confusion.

@coffeemug
Member

Ok closing. I agree with the expanding primary keys, but we can open a separate issue for it if/when we do it.

@coffeemug coffeemug closed this Dec 10, 2013
@faxal
faxal commented Dec 10, 2013

So there is no way to enforce uniqueness except for primary key? or is there a workaround?

For example what can I do if I want to store a unique URL and a unique slug for that URL in a table?

I could make a separate table to store slugs, but that's not very elegant.

@coffeemug
Member

Correct, there is no general way to enforce uniqueness except for primary key. This is a general property of distributed systems. Unique secondary indexes would require a distributed locking protocol which would slow things immensely.

@dessalines

The ability to create multiple unique indexes is something all SQL systems have. Why was this issue closed out, there's a lot of demand for this.

@neumino
Member
neumino commented Mar 31, 2014

@tchoulihan -- Thanks for the feedback. It was closed because in sharded/distributed environment, such feature is usually not provided.
But I'm not an expert in all SQL databases, so I may have missed some. Do you have a SQL system in mind that have unique indexes on a sharded table?

I tried to look on Google, but from what I've found, the unique indexes can only be defined per table (if not partitioned), or per partition (in which case you cannot enforce a unique index on your whole table).

@coffeemug
Member

@tchoulihan -- to add to @neumino's comment, it turns out that unique indexes are nearly impossible to implement in a distributed system while still maintaining performance. If you look at other NoSQL systems, this functionality is rarely if ever implemented, and practically never used for this reason. So we decided not to implement it in RethinkDB because the laws of physics prevent us from doing it effectively.

(I agree it's inconvenient; we'll continue thinking about possible options to see if we can do better)

@juancampa

@coffeemug just out of curiosity: What's special about primary indexes that secondary indexes don't have.

@larkost
Collaborator
larkost commented Jan 26, 2016

@juancampa: as @coffeemug hinted at above there is a much larger potential cost to enforcing uniqueness on a secondary index. This is because what server (or more specifically: shard) manages the data is decided based on the primary key. So a record with a primary key of 4 will always go to the same server. So when you go to add a second record with that same primary key the single server knows that it already has that record, and can immediately take action.

In contrast a secondary index does not have any bearing on where the record is stored. So when a new record comes in the server storing that value would have to ask every other primary shard-holder if it had a value with the same value. Not only is there a cost in terms of time in making all of those requests, but there is also a large complexity cost in guaranteeing that this sort of thing works reliably in a complex environment. This complexity is the reason you will probably not find distributed systems that offer unique secondary indexes.

@juancampa

So a record with a primary key of 4 will always go to the same server

Thanks @larkost. Makes sense.

Now, this SO answer suggests that one way to guarantee uniqueness is to create another table but I just realized that you end up with data split across shards. i.e. user data is in one shard but the user's email might be in a different one (different PKs)

Is there a best practice to this problem or is this just an inherent problem of distributed DBs which must be handled at the application layer?

@coffeemug
Member

This is an inherent problem in distributed DBs (can't have a unique secondary index without dramatically slowing everything down). The best way to get around it is to use compound primary keys and make one of the components the bit of data you'd normally store in a secondary index. This is fairly counterintuitive to a lot of people, so a long term solution might be to allow unique secondary indexes and making them a part of a primary key under the hood, but this solution has a bunch of problems. We'll try to look into it and come up with something viable in 2016.

@juancampa

The best way to get around it is to use compound primary keys and make one of the components the bit of data you'd normally store in a secondary index

Hmm this option sounds worth exploring (let's call it option 3). Let's say I go ahead and create a table user with primary key [r.row('id'), r.row('email')]. How would I then create a robust query that checks to see if the email exists insert a new one if it doesn't? Using this approach, don't I need to know the id of the existing user to check if it exists?

So far the three options are:

  1. Use a single table: consistency is compromised due to lack of atomicity.
  2. Use two tables: performance is compromised due to join required and data being spread across shards. Lack of atomicity might also be a problem but a minor/recoverable one IMO.
  3. Use a single table with a compound index: ?

Thanks for the feedback btw. You guys are awesome

@coffeemug
Member

Well, in this case you'd just create a primary key with email and drop the id entirely. But in a more general case where you can't drop the second field, it would have to be a field you already know (or can query). Then you insert both fields, and if they fail you know the row already exists. (Again, not ideal, but a reasonable workaround in many cases)

@jab
Contributor
jab commented Jan 27, 2016

Well, in this case you'd just create a primary key with email and drop the id entirely

The advantage of also having an id field is it allows users to change their email address, and other records that refer to user records can still refer to them by the same id without having to be updated.

@thelinuxlich

this approach fails if the unique id needs to be updatable :(

@devinivy

How might this relate to dynamodb's dealings with global secondary indices? (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html)

Edit ah, of course those indices don't enforce uniqueness...

@danielmewes
Member

@devinivy Yeah, from the document you linked:

In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.

Non-unique global secondary indexes are already supported by RethinkDB.

@titoBouzout
titoBouzout commented May 29, 2016 edited

Been reading about rethinkdb for years, last days intensively as I finally had opportunity to scale a project and decided to use it. I was very puzzled for long hours, even took of my sleep time figuring out how to keep a unique username and also a unique email for a user. It seems to me I reached the deepest of the ocean on this thread and is not possible. I may or may not regret in future, but I want to offer a far from ideal workaround for these people that really want to use rethinkdb which seems really nice and also solve this problem, just force it via the application:

if(taken[email] !== undefined || taken[username] !== undefined) throw Error('username or email already taken!'); else taken[email] = taken[username] = null

I believe this should workaround the situation for most websites. By adding new instances syncing this could also get really complicated. In that case I wish you success with the project and you will probably found a way to do it right. There's also possibility for improving this with some clever short hash.

EDIT: Seems to me storing 500.000 usernames and emails of avg length on a js plain object takes around 200Mb so pretty cheap. I tested a hash function to make the strings smaller, but I noticed that rehashing (in case of restart) 1.000.000 items could take more than half an hour. So I'm gonna store staff as is which takes seconds.

@jedwards1211
jedwards1211 commented Sep 7, 2016 edited

@neumino Maybe this is new in Mongo 3.0 or 3.2, but I just discovered that it's not entirely true that you can't have unique secondary indexes in sharded Mongo collections. If the shard key is a prefix of the index, Mongo can guarantee uniqueness, because guaranteeing uniqueness within an individual chunk is easy, and the chunks are guaranteed to be disjoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment