Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Model Management #57

Closed
davemoore- opened this issue Jan 22, 2021 · 10 comments · Fixed by #82
Closed

Bulk Model Management #57

davemoore- opened this issue Jan 22, 2021 · 10 comments · Fixed by #82
Assignees
Labels
enhancement New feature or request release:1.8.0

Comments

@davemoore-
Copy link
Member

Implement an API endpoint to submit multiple entity model management operations in bulk, borrowing the functionality for bulk operations introduced in #50. This will enable a more efficient handling of multiple entity model management operations. One envisioned implementation is an entity model management user interface that provides checkboxes to delete multiple models in one request.

Proposed syntax

POST /_zentity/models/_bulk[?PARAMS]
{ PARAMS }
{ PAYLOAD }
...
POST /_zentity/models/ENTITY_TYPE/_bulk[?PARAMS]
{ PARAMS }
{ PAYLOAD }
...

Accepted operations

The bulk endpoint would support operations that create, update, or delete entity models. Unlike the implementation in #50, this will require a field in PARAMS that indicates which action is to be executed. See the Elasticsearch Bulk API implementation for reference, which requires this convention, too.

@davemoore- davemoore- added the enhancement New feature or request label Jan 22, 2021
@davemoore- davemoore- mentioned this issue Jan 22, 2021
12 tasks
@davemoore- davemoore- mentioned this issue Feb 20, 2021
9 tasks
@davemoore- davemoore- self-assigned this Feb 22, 2021
@davemoore-
Copy link
Member Author

PR #82 pending review.

I'm not implementing the second half of the proposal (see below).

POST /_zentity/models/ENTITY_TYPE/_bulk[?PARAMS]
{ PARAMS }
{ PAYLOAD }
...

It occurred to me that there isn't a compelling use case for multiple bulk operations on a single entity type. Two creates would cause a conflict, two updates would be redundant because each one requires the full document, a create followed by an update would be redundant because updates create the document if it doesn't exist, and two deletes would be redundant.

@austince
Copy link
Contributor

austince commented Mar 1, 2021

Makes a lot of sense to not have bulk endpoints on a single entity_type scope. I'm sorry I didn't comment earlier, but I think it's a bit odd that the nested PARAMS structure is different than the bulk resolution endpoints. Why not encoding the action as a separate param? I think it might follow the same thinking that people will never be doing multiple operations on a single entity type.

POST /_zentity/models/ENTITY_TYPE/_bulk[?PARAMS]
{ action: "create", "entity_type": "entity_a" }
{ PAYLOAD }
{ action: "delete", "entity_type": "entity_b" }
{ PAYLOAD }

@davemoore- davemoore- reopened this Mar 1, 2021
@davemoore-
Copy link
Member Author

My rationale was to keep the syntax consistent with the Elasticsearch Bulk API, which uses the command ("index", "create", "update", "delete") as a top-level field in both the request and response.

But this is a good time to consider the what the syntax really should be, while it's still pre-release. Let's think.

There are remaining differences between the zentity Bulk APIs and the Elasticsearch Bulk API:

  • Elasticsearch requires a trailing newline (\n). zentity requires no trailing newline, and to me it feels odd to require it.
  • Elasticsearch expects no payload after a "delete" operation. zentity expects pairs of lines for all operations, and to me having inconsistent requirements could be confusing.

Since there are differences, and since Elasticsearch client libraries hide the details of the Bulk API, then to me it raises the question, How similar do the APIs really need to be?

What you propose looks good. I like the consistency.

Going further, what if the concept of pairs was omitted altogether, and replaced with a syntax where one line is one request, containing the params and the body? Examples for both Bulk APIs below:

Bulk Model - Request

POST _zentity/models/_bulk
{ action: ..., entity_type: ..., body: ENTITY_MODEL }
{ action: ..., entity_type: ..., body: ENTITY_MODEL }
{ action: ..., entity_type: ..., body: ENTITY_MODEL }
...

Bulk Model - Response - Same approach as Bulk Resolution: Each response item is simply the response of an individual operation. Don't nested the item under an "action" field.

Bulk Resolution - Request

POST _zentity/resolution/_bulk
{ entity_type: ..., body: RESOLUTION_PAYLOAD }
{ entity_type: ..., body: RESOLUTION_PAYLOAD }
{ entity_type: ..., body: RESOLUTION_PAYLOAD }
...
POST _zentity/resolution/{entity_type}/_bulk
{ body: RESOLUTION_PAYLOAD }
{ body: RESOLUTION_PAYLOAD }
{ body: RESOLUTION_PAYLOAD }
...

Bulk Resolution - Response - No change from current implementation.

What is your reaction on this implementation?

@davemoore-
Copy link
Member Author

My own thinking on the one-line-per-request implementation above is:

  • It may be more efficient to process and easier for the client implement.
  • It may be more difficult to read the raw request if that's important.
  • Mixing params and body on one line might be confusing, or it might not. Would be good to get your opinion.
  • It avoids requiring a line with an empty body for "delete" operations, if we believe that would be confusing.

@austince
Copy link
Contributor

austince commented Mar 2, 2021

Hmm, after your nice examples and reading more about the ES bulk API, I'm actually thinking keeping what is currently implemented is best. I would be interested to know how the ES bulk API responds if multiple actions are given.

While I think the one-line-per-request is slightly easier to understand and use, I don't think it's worth such a big deviation from ES conventions over. Requiring a body after delete makes sense to me though, and I think that's minor enough that we should keep it different.

The most important things, imho, for the bulk formats is that they are supported by current ES tools like Kibana and other clients. The JS client has easy support for encoding bulk formats into NDJSON, so introducing a secondary "bulk" format might only complicate user implementations. I'm not sure, in that case, if it would be easier for the client to implement. I also think separating the params from the body does make the relationship between the bulk and non-bulk formats easier to see.

What are you leaning towards?

@davemoore-
Copy link
Member Author

davemoore- commented Mar 2, 2021

Having slept on it, I agree with your current thoughts. For me it comes down to a consistent experience. The use of pairs will be a familiar syntax for many users. The remaining differences with the Elasticsearch Bulk API are so small that I doubt many people will care or even notice. I find the pairs more readable, too, because there's a clear separation of the operation and the contents, whereas the single line approach will make some pretty dense lines.

It's worth noting that any of these propositions would work in Kibana Dev Tools, and probably any other client that can submit NDJSON, as long as this endpoint consumes NDJSON.

Re: your question on how the Elasticsearch Bulk API responds to multiple actions, it's essentially the same as the current Bulk Models API: each action is executed in series, and the responses of each action are returned as an array in the order they were executed (and each response is nested under a key with the name of the action, e.g. "index" or "delete"):

Elasticsearch Bulk API - Request

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }

Elasticsearch Bulk API - Response

{
  "took": 30,
  "errors": false,
  "items": [{
      "index": {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 201,
        "_seq_no": 0,
        "_primary_term": 1
      }
    },
    {
      "delete": {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_version": 1,
        "result": "not_found",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 404,
        "_seq_no": 1,
        "_primary_term": 2
      }
    }
  ]
}

@austince
Copy link
Contributor

austince commented Mar 3, 2021

I was thinking more along the lines of submitting multiple actions in the same header object, like:

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_id" : "2" } }
{ "field1" : "value1" }

POST _bulk
{ "index" : { "_index" : "test", "_id" : "2" }, "delete" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "1" } }

It is actually seems to just ignore the extra delete action, with no warnings whatsoever. But anyway, I agree the minor differences will be essentially unnoticable.

@davemoore-
Copy link
Member Author

Ah I get it now. I've only ever seen the Elasticsearch Bulk API used with one action per header object and hadn't considered what would happen if multiple were given. Ignoring the extra command makes sense though a warning would have been useful I think. I believe the intent, at least, was to have one action per header object.

@austince
Copy link
Contributor

austince commented Mar 3, 2021

Yeah, I don't think we need to do anything necessarily, but just interesting to note. Summing up, I agree with all your points and think the current interface is great 🙂 What's your feeling on the 1.8.0 release?

@davemoore-
Copy link
Member Author

Feeling good! Both bulk features work well, and I took care of some housekeeping that needed to happen. New documentation is nearly done. I might even publish the release tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request release:1.8.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants