Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple pkey autogen schemes #117

Open
coffeemug opened this issue Nov 28, 2012 · 29 comments

Comments

Projects
None yet
@coffeemug
Copy link
Contributor

commented Nov 28, 2012

The uuid key autogen scheme we use isn't enough. We should allow picking an autogen scheme at table creation. Here are some alternative schemes we should support:

  1. Approximately chronological (e.g. non-colliding timestamps, roughly ordered)
  2. Approximately ordered (e.g. incrementing from zero, roughly ordered)
@pzol

This comment has been minimized.

Copy link

commented May 30, 2013

You probably are aware how mongodb creates its ids? This is similar to twitter snowflake. The hidden timestamp is a great bonus

http://docs.mongodb.org/manual/reference/object-id/

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2013

Yes, the snowflake approach would probably be the first one we implement once we add another autogeneration scheme (I also suspect that coupled with the current scheme, it will be enough for most people).

@sandstrom

This comment has been minimized.

Copy link

commented Jul 31, 2013

I agree, Snowflake or something similar to MongoDB would be neat. Although not too different than UUIDs they are shorter [nicer in urls] and a bit easier to pass around.

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2013

For those watching this issue, this almost certainly won't happen until Rethink hits 2.0 -- sorry. Between UUID and upcoming date support, people should be able to find workarounds for almost every use case. While additional generation schemes are really nice, unfortunately there are much more important things we have to do first.

If you run into a use case where this is a real showstopper, please let us know and we'll see if we can reprioritize.

@bryanhelmig

This comment has been minimized.

Copy link

commented Nov 4, 2013

👍 for this.

The UUIDv1 scheme could work, although the MAC address encoding would probably be replaced with a random seed.

@coffeemug coffeemug referenced this issue Mar 5, 2014

Closed

r.uniqueId() #2063

@sandstrom

This comment has been minimized.

Copy link

commented Mar 11, 2014

Two thoughts, regarding migrations from other databases to Rethink and string representation.

When moving from MongoDB/MySQL to Rethink it's nice to use a uniform id format. It would be good if the Rethink ID function could both generate them randomly (normal method of operation) or based on a seed (which would be the old id from another structure).

For example:

rdbUuidFunction('mysql_table_name-1') => '1u7v7tzap6ua18f6lyr' // mysql to rethink
rdbUuidFunction('5318cd64421aa9f2e800002d') => '40aubkd55r8yhoo2qr' // mongodb to rethink

This use-case would be only for migrations, and it would always generate the same output for one input string.

It would leave it to the developer to ensure that the input is unique (i.e. database_table_id for e.g. MySQL and another unique id for e.g. MongoDB.


I know nothing about database design and choice of id algorithms, but having an id where the string representation is short (and url compatible) is neat, i.e. [a-z0-9] [alpha num] is better than [a-f0-9] [hex].

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2014

@sandstrom -- in case of a migration, why not just use the original ID as the new RethinkDB ID? (unless you want IDs for old data and new data to look consistent, which is the only consideration I can think of at the moment)

@sandstrom

This comment has been minimized.

Copy link

commented Mar 12, 2014

@coffeemug The thing I thought of was consistency. Maybe it's just my OCD :) (but I think others may find it useful too). And for some types of uuid generation it would not be difficult to include.

@coffeemug coffeemug modified the milestones: 1.13-polish, backlog, subsequent Mar 26, 2014

@mike-marcacci

This comment has been minimized.

Copy link

commented Apr 10, 2014

@sandstrom, not sure that's a good idea – you're going to completely break relationships between records that use the ID field. This is really something that would need to be scripted for your particular situation.

@sandstrom

This comment has been minimized.

Copy link

commented Apr 10, 2014

@mike-marcacci Yes, foreign keys must be updated too, naturally.

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2014

Note, this is related to #2063.

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2014

Also somewhat related to #2920.

@danielmewes

This comment has been minimized.

Copy link
Member

commented Aug 20, 2014

Another one we should add is a "packed" UUID. It should be stored in the shortest possible way, e.g. storing 7 bits of the UUID per character in a string.
That would make storing small documents a lot more space efficient and would also speed up access and reduce i/o costs.

@coffeemug coffeemug modified the milestones: reql-discussion, subsequent Sep 4, 2014

@neumino

This comment has been minimized.

Copy link
Member

commented Sep 8, 2014

What's the scope of this issue?
And what's the use case?

People are used to have incremental integers from MySQL, but that's mostly because they do not have partitions.

If users want primary keys that are sorted, they could use r.now() right? If they are worried about collisions, they could also do r.now().add(r.uuid()).

If users want more human-readeable ids, maybe we should have shorted uuids that we can convert in base 10 like Google Plus have?

Should we wait for hash shards before having incremental uuids?

@mlucy

This comment has been minimized.

Copy link
Member

commented Sep 10, 2014

I think we should allow multiple pkey autogeneration schemes as follows: insert takes an optarg pkey_gen that can be either a string specifying a preset strategy or a function of 0 arguments that's called once per row (so someone could write r.now().coerce_to('string') + r.uuid if they wanted).

@coffeemug -- could you come up with a list of pkey generation schemes people actually want?

@mlucy mlucy added the tp:active label Sep 10, 2014

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2014

I had a slightly different proposal that I think would be better. r.uuid would take an optarg with the generation scheme. I think there should be two -- the current one (random), and a semi-monotonically-increasing one (essentially time.now() plus a random value tacked on at the end):

> r.uuid(scheme='random').run()
'e0f75d10-d770-4434-ae3e-23194ffbe581'

> r.uuid(scheme='ordered').run()
'2014-09-17;05:15:59.123;e0f7'

The users could specify the uuid generation scheme on table creation:

r.table_create('foo', id_scheme='ordered')

@mlucy pointed out in person that this would work way better after we have hash sharding (because if we don't and the user picks 'ordered', all the writes will be routed to one shard), so I think we should wait until hash sharding is in to do this (see #364).

Moving this out of the discussion period until we start working on hash sharding; please complain if you want this back in now.

@coffeemug coffeemug modified the milestones: subsequent, reql-discussion Sep 11, 2014

@coffeemug coffeemug removed the tp:active label Sep 11, 2014

@thelinuxlich

This comment has been minimized.

Copy link

commented Aug 16, 2015

A good pkey autogen scheme: https://github.com/ericelliott/cuid

@mbrevda

This comment has been minimized.

Copy link

commented Sep 26, 2015

Perhaps considering additional flexability would be usefull. For example, we don't like the hyphens in our UUID's. It would be nice if it was possible to remove these

@coffeemug

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2015

👍

@deividasjackus

This comment has been minimized.

Copy link

commented Mar 9, 2016

Hopefully this proves to be of some value... I find it to be a pretty elegant solution: Sharding & IDs at Instagram

@xfg

This comment has been minimized.

Copy link

commented Jul 16, 2016

+1 for human-readable ids

@RubenKelevra

This comment has been minimized.

Copy link

commented Aug 2, 2017

wondering why we don't have a simple increasing int, I know this might hurt performance to do a global lock on the database, but it would be useful for some application types.

So we would need to run something like this atomically:

r.db('nodes').table('basicdata').insert({'id': r.db('nodes').table('basicdata').max('id').getField('id')+1 })

@thelinuxlich

This comment has been minimized.

Copy link

commented Aug 2, 2017

The problem with autoincrementing ids is that RethinkDB uses range sharding, so you would need to run rebalance() on the table often.

@RubenKelevra

This comment has been minimized.

Copy link

commented Aug 2, 2017

@thelinuxlich are you sure this is really an issue? I've imported a table with an auto-incrementing id from a MySQL-Server and this is the data-distribution:

bildschirmfoto von 2017-08-02 12-27-03

@thelinuxlich

This comment has been minimized.

Copy link

commented Aug 2, 2017

I'm sure it's a issue

@RubenKelevra

This comment has been minimized.

Copy link

commented Aug 2, 2017

@thelinuxlich forgive me the noobish question, but why is my table in this case perfectly balanced?

@thelinuxlich

This comment has been minimized.

Copy link

commented Aug 2, 2017

I don't have the technical knowledge to explain it, but I've tested inserting millions of records and the final result is unbalanced.

@mike-marcacci

This comment has been minimized.

Copy link

commented Aug 2, 2017

@RubenKelevra are you sure your primary keys are numbers and not in fact strings? Range sorting the string representations of auto incremented ints would balance, although not in the order you would expect of numbers.

Also, how did you do the import? If you used a bulk utility, it’s possible it ran a rebalance for you; also, if you changed the sharding or replication strategies after your initial import or added/removed nodes, you likely triggered a rebalance.

@RubenKelevra

This comment has been minimized.

Copy link

commented Aug 3, 2017

@mike-marcacci yes - type is int.

Imported by webui and simple r....insert({...}) a JSON dump of a MySQL Database-query.

Because my export contains only strings, I‘ve bulk converted the export with sed to get an int as primary key and floats to be floats etc.

I haven’t changed the geometry of the database table after import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.