Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Add Support for Partitioned Databases #214

Closed
jdmintz opened this issue Sep 23, 2019 · 18 comments

Comments

@jdmintz
Copy link

@jdmintz jdmintz commented Sep 23, 2019

Both Cloudant and CouchDB support partitioned databases which make querying less expensive - computationally in CouchDB, monetarily on Cloudant.

https://cloud.ibm.com/docs/services/Cloudant/guides?topic=cloudant-database-partitioning

Please consider adding this support for Loopback users.

(Updated by JannyHou):

Design thought see #214 (comment) and reference in #214 (comment)

Proposal and follow up stories see #214 (comment), we can break down the implementation into 6 stories accordingly.

Acceptance Criteria

  • Investigate the work required to support the partitioned database in Cloudant. It could be, but not limited to, one or more of the following:
    • update to @cloudant/cloudant driver (mentioned above)
    • user can utilize the partitioned database feature by changing some configurations
    • some changes need to be done within the loopback-connector-cloudant connector to support that.
  • Create follow up tasks to implement this if changes are required on the LB side
    • If applies, make sure we include documentation to let our users know how to enable the connector to support partitioned database.

Reference

@dhmlau

This comment has been minimized.

Copy link
Contributor

@dhmlau dhmlau commented Sep 23, 2019

@raymondfeng

This comment has been minimized.

Copy link
Member

@raymondfeng raymondfeng commented Sep 23, 2019

See cloudant/nodejs-cloudant@d400375

@jannyHou I think we need to switch to @cloudant/cloudant as the driver now.

@dhmlau

This comment has been minimized.

Copy link
Contributor

@dhmlau dhmlau commented Sep 24, 2019

Gathering information from @jannyHou and @raymondfeng, we'll need to have further investigation to see what's the work involved. It might involve one or more of the following:

  • update to @cloudant/cloudant driver (mentioned above)
  • user can utilize the partitioned database feature by changing some configurations
  • some changes need to be done within the loopback-connector-cloudant connector to support that.

I'd like to make this task as a spike first.

@dhmlau dhmlau changed the title Add Support for Partitioned Databases [Spike] Add Support for Partitioned Databases Sep 24, 2019
@agnes512

This comment has been minimized.

Copy link
Contributor

@agnes512 agnes512 commented Sep 24, 2019

Discussion from the estimation meeting:

  1. check partition key settings
  2. investigate how to pass partition key to drivers
  3. notice that couchDB connector has the most implementations but CloudantDB connector only has a few
@dhmlau dhmlau added the p1 label Sep 25, 2019
@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Sep 30, 2019

Did a quick research, we can do the following to support partitioned database:

  • Upgrade the driver to be cloudant/nodejs-cloudant
  • Expose partition related APIs from cloudant/nodejs-cloudant so that users can leverage native APIs to create db/query records in a particular partition/etc...
  • Make sure the database is created to support partition. (Either manually or by calling LB APIs.)
  • Modify the index to distinguish between partitioned and global search
    • A comparison between partitioned search vs global search see comment
    • The index for 'loopback__model__name' should still be global.
    • When people define the index, they need to configure the partitioned as true/false, default to false to support global search
    • If an index is partitioned:
      • Do we have to specify the index name in the query? Please note the mongo query system will automatically search for the best matched global index for a global query.
        • Yes
      • Does partitioned query support the same syntax of the global mongo query system?
        • Yes
  • Insert proper id field in pattern <partition_name>: id
    • Problem: For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.
    • Proposal: If _id is provided then we honor the entire string. If _id is missing but partition is provided, we generate a uuid and append it after the partition. If neither of them provided the request will be rejected by the cloudant service.

Will update more finding and PoC asap.

Some notes:

  • I was thinking of having a config to enable modelName as the default partition key, but after reading topic "a good partition key", I realized it won't be a good practice. Like new orders are better to have the value of userId as the their partition key instead of order. Therefore we should only honor the partition key from the request itself.
@agnes512 agnes512 mentioned this issue Sep 30, 2019
18 of 25 tasks complete
@emonddr emonddr added this to the Oct 2019 milestone milestone Oct 1, 2019
@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 2, 2019

A summary of partitioned index and query:

Partitioned Search

use case 1

partitioned index defined
partitionedFind()

✔️ correct result is returned
✔️ index is used

use case 2

partitioned index defined
partitionedFind()
sort with a field in index

✔️ correct result is returned
✔️ index is used

use case 3

partitioned index defined
partitionedFind()
sort with a field NOT in index

✖️ error - index not found

use case 4

partitioned index defined
partitionedFind()
sort with a field NOT in index

✖️ error - index not found

use case 5

partitioned index defined
partitionedFind()

✔️ correct result is returned
✔️ index is used

use case 6

global index defined
partitionedFind()

✔️ correct result is returned
✖️ no matched index to optimize the performance

Global Search

use case 1

partitioned index defined
find()

✔️ correct result is returned
✖️ no matched index to optimize the performance

use case 2

global index defined
find()

✔️ correct result is returned
✔️ index is used

Advanced Query

use case 1

partitioned index defined
partitionedFind()
advanced query

✔️ $regex
✔️ nested property (e.g. address.city)
✔️ array search (e.g. $elemMatch)

@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 2, 2019

Proposal for Partitioned Support

Design thought see #214 (comment) and #214 (comment)

Example: Order model attached to Cloudant datasource.

  • Update driver to https://github.com/cloudant/nodejs-cloudant

  • Find a new db (support partition)for test

    • The docker one is retired and don't support partition
    • Check the couchdb2 image as well, and replace it with a new one if necessary. (since 95% of the code are written in loopback-connector-couchdb2)
  • The global search and index stay the same

    • By default Order.find() still invokes db.find() as it is.
    • By default an index is created with {partitioned: false} as global.
    • When people define the index in the model definition, they can configure the partitioned as true/false.
  • (UX)Optimize query with partition

    • We need to think of a signature to distinguish a partitioned query from a global one
    • Proposal 1: passes in an option in the find method, like Order.find({<query_object>}, {partitionKey: '<name_of_key>'})
      • In Cloudant.prototype.find(), if options.partitionKey is provided, then invoke db.partitionedFind('<name_of_key>', query)
      • pro: it won't be mixed with other query properties.
      • con: request won't have options, you can only call function with options from code.
    • Proposal 2: provide the partitionKey in query, like Order.find({partitionKey: 'akey', somefield: 'somevalue'})
      • In Cloudant.prototype.find(), if partitionKey is detected in the query, then invoke db.partitionedFind('<name_of_key>', queryWithPartitionKeyExcluded)
      • pro: you can provide the partitionKey in the request
      • con: it's mixed with other properties, if the document just have a field called partitionKey, then the query will be broken...Maybe we can name it as lb_partition_key as a preserved field name as a solution?
    • Proposal 3: We just define a new connector method that wraps db.partitionedFind()
      • Just like how we wrap viewDocs and support the view search.
      • pro: the simplest solution
      • con: no model api to use, you have to call it from connector(personally I wouldn't recommand this proposal)
  • (UX)Insert when _id is missing

    • Problem: For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.
    • Proposal:
      • If _id is provided then we honor the entire string.
      • If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.
      • If neither of them provided the request will be rejected by the cloudant service.
  • Document how to call other partition APIs from connector

    • There are a brunch of other parition APIs like db.partitionInfo(), db.partitioned.List(), db.partitionedSearch(). People can just get the driver instance from the connector instance and execute these native driver APIs.
@jdmintz

This comment has been minimized.

Copy link
Author

@jdmintz jdmintz commented Oct 2, 2019

Awesome progress here. CouchDB 2.x doesn't support Partition Querying yet. It will be available in 3.0 (being wrapped up soon-ish according to the email group)

Can confirm that the Cloudant Developer edition was retired in favor of CouchDB containers.

@bajtos

This comment has been minimized.

Copy link
Member

@bajtos bajtos commented Oct 3, 2019

Re proposal 2:

con: it's mixed with other properties, if the document just have a field called partitionKey, then the query will be broken...Maybe we can name it as lb_partition_key as a preserved field name as a solution?

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

I don't know anything about Partition Queries in Cloudant/CouchDB. Purely from the LoopBack server & client perspective, I like the proposal 2 most.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

For a non-partitioned db, db.insert() will give the document a random string as _id if it’s missing in the payload. While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo. Let's figure out a better UX for users to provide the id part.

In LB3 days, we have offline-sync feature, where data is created on the client first (including an autogenerated uuid/v4 value, and then synced with the server. Can we use this feature for Cloudant/CouchDB too?

  • If _id is provided then we honor the entire string.
  • If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.
  • If neither of them provided the request will be rejected by the cloudant service.

Sounds good to me. Personally, I'd ask developers to configure the id property as follows:

{ 
  type: 'string',
  id: true, 
  defaultFn: 'uuidv4'
}

In my limited understanding of the problem domain, this may be all that's needed to make things work with Cloudant:

  • If the client provides id, then we honor the value
  • If the client did not provide it, then we generate a unique one (irrespectively of partitioning setup)
@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 4, 2019

@bajtos Thank you for the detailed review and giving feedback!

Please note the filter argument of find method has properties where, include, skip, limit, etc. Model properties are nested under the where field. There is no need to worry about partitionKey clashing with model properties.

Good point! If unknown filter properties are not removed from the request(IIRC they aren't) this will definitely be a decisive reason to choose proposal 2.

Are there any security implications to be aware of? Can the partitionKey property be exploited by a malicious client?

My understanding is, partitionKey is similar to the path parameter in an url, so it's ok to make it public.

E.g. "GET /users/{id}/orders" VS "get all the orders with partitionKey equals to someUserId"
I can double check this.

I'd ask developers to configure the id property as follows:

{ 
  type: 'string',
  id: true, 
  defaultFn: 'uuidv4'
}

Sound good 👍

If the client did not provide it, then we generate a unique one (irrespectively of partitioning setup)

The pattern for _id in a partitioned db is partitionKey: id, uuidv4 can only generate id.
And partitioned database does NOT allow inserting a document without the partitionKey as prefix, that's why user will have to provide at least the partition key part, or the full _id.

@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 7, 2019

Follow up stories created:

@emonddr

This comment has been minimized.

Copy link

@emonddr emonddr commented Oct 7, 2019

Design thought see #214 (comment) and #214 (comment)

same link both times.

@emonddr

This comment has been minimized.

Copy link

@emonddr emonddr commented Oct 7, 2019

In Cloudant.prototype.find(), if partitionKey is detected in the query, then invoke db.partitionedFind('<name_of_key>', queryWithPartitionKeyExcluded)

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

@emonddr

This comment has been minimized.

Copy link

@emonddr emonddr commented Oct 7, 2019

While for partitioned database, _id is a must provide field, the random string need to be generated using 'uuid/v4', see some examples in the driver repo

For a partitioned database, isn't the partition key used as the id value?

If so, I am confused about this statement:

If the id part is missing but partitionKey is provided, we generate a uuid and append it after the partition.

@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 7, 2019

@emonddr Thank you for the reiview!

Design thought see #214 (comment) and #214 (comment)
same link both times.

If you click on the links they have different anchors, I typed the full address, the names are auto generated(converted) by github.

if partition key is provided, shouldn't it be queryWithPartitionKeyIncluded ?

Ah let me explain, db.partitionedFind() takes in two arguments, the first is the partitionKey as a string, the second is the rest of the query(that why I put it as query with partition key EXCLUDED)

For a partitioned database, isn't the partition key used as the id value?

Not really, the pattern of an _id in the partitioned db is <partition_name>: id, it consists of two parts:

  • partition name, or say partition key
  • the unique id

Does the id related proposal make more sense now?

@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 7, 2019

Had a chat with @raymondfeng , here is a summary:

model definition

Ideally, a partition key would map to a model property, like model User has countryCode as its partition key. This would be consistent with the behavior in loopback-connector-cassandra. E.g.

customers = db.define('customers', {
  userId: {type: Number, id: true},
  countryCode: {type: String, isPartitionKey: true},
  name: String,
  zipCode: Number,
});

CRUD APIs

create(user)

compose _id as <countryCode>: userId then create document by calling db.insert()

find(user, options)

  • first try get the partition key from options like {partitonKey: 'US'}
  • if not found, try get it from query like {where: {name: 'somename', countryCode: 'US'}}
  • then invoke db.partitionedFind('US', {selector: {name: 'somename'}})

findById

connector parses the provided id, if it's in pattern partitonKey: id, then invoke db.partitionedFind()

use case 1
findById('fdb2ff86-78c1-47bb-bc63-f239db06c578')

  • invokes db.find()
    • partitioned db will reject the request
    • still a valid case for non-partitioned db

use case 2
findById('USA: fdb2ff86-78c1-47bb-bc63-f239db06c578')

  • invokes partitionedFind()

Follow-up stories

Stage one

Stay the same

Stage two

Developers still need to provide a full _id with partition key and uuid when create a model instance.
We support read partition key from query options
And optimize findById

  • #220 read partition key from options
  • #217 read partitionKey from model property

Stage three

  • #221 optimize findById
  • #218 compose id
@jannyHou

This comment has been minimized.

Copy link
Contributor

@jannyHou jannyHou commented Oct 9, 2019

See epic #219
Closing this spike

@jannyHou jannyHou closed this Oct 9, 2019
@dhmlau

This comment has been minimized.

Copy link
Contributor

@dhmlau dhmlau commented Oct 15, 2019

MVP - epic #219
Post MVP (enhancement) - epic #222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.