Add endpoint to write multiple docs #1043

EricBorczuk · 2021-06-21T15:00:27Z

Add an endpoint to allow writing multiple documents in a single request.

Data sent to this endpoint is expected to be in JSON lines format (1 document per line).

Additionally, you can supply an id-path query parameter to use the value at a particular path in each document as the document's key in the database, so if all your documents have an id field, you could set id-path=id and treat the value in id as the document's key. You can also use any valid path syntax (globs not allowed), e.g. id-path=user.emails.[0].id

If id-path is excluded, random UUID's will be assigned to every document, and the response will have the ID's created corresponding in the same order as the documents were supplied in.

ivansenic

Few small fixes needed imo.. 👍

ivansenic · 2021-06-22T09:05:59Z

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java

+      @ApiParam(
+              value = "A JSON Lines payload where each line is a document to write",
+              required = true)
+          InputStream payload,


can we add a @NonNull now when we have everything validated..

What's the error response from dropwizard look like for that annotation?

it's generated in ViolationExceptionMapper, but it requires a message, please do:

@NotNull(message = "payload must not be null")

ivansenic · 2021-06-22T09:07:47Z

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

+    DocumentDB db = dbFactory.getDocDataStoreForToken(authToken, headers);
+    JsonSurfer surfer = JsonSurferGson.INSTANCE;
+
+    boolean created = db.maybeCreateTable(keyspace, collection);


can we extract this to some private method, isn't there some other places that uses the same as well? what about inserting the single document?

ivansenic · 2021-06-22T09:12:24Z

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

+
+          String docId = UUID.randomUUID().toString();
+          if (idPath.isPresent()) {
+            String docsPath = convertToJsonPtr(idPath.get());


so everything related to the idPath conversion to the doc path can be moved out of the loop, this way you are not doing the same thing over and over again as this seems to be a constant..

testing/src/main/java/io/stargate/it/http/docsapi/BaseDocumentApiV2Test.java

dimas-b · 2021-06-22T16:50:29Z

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

+    try (BufferedReader reader = new BufferedReader(new InputStreamReader(payload, "UTF-8"))) {
+      Iterator<String> iter = reader.lines().iterator();
+      ExecutionContext context = ExecutionContext.NOOP_CONTEXT;
+      String docsPath = convertToJsonPtr(idPath.get());
+      int chunkIndex = 0;
+      final int CHUNK_SIZE = 250;
+      while (iter.hasNext()) {
+        chunkIndex++;
+        Map<String, String> docsInChunk = new LinkedHashMap<>();
+        while (docsInChunk.size() < CHUNK_SIZE && iter.hasNext()) {
+          String doc = iter.next();
+          JsonNode json = mapper.readTree(doc);


Why do we require one doc per line? Is it not possible to parse JSON input in a stream fashion?

Technically, the input isn't JSON, since it's actually multiple delimited JSON objects (JSON lines format). I'll look into possible methods to stream JSON lines, but Jackson doesn't have anything for JSON lines afaik

Another option would be to make the input valid JSON (i.e. require that it's a JSON array) and then streaming probably becomes possible/easy

Yeah, I guess it may depend on how the library works with input streams. Please post your findings :)

Requiring input to be a JSON array is not unreasonable from my POV.

Perhaps an option to import and/or export in either format?

So using a JsonParser (Jackson's impl for streaming JSON) is feasible, but only with valid JSON, so I am going to stick with expecting array data over the wire; then, each object in the array can get parsed one by one using an ObjectMapper, which won't pull in much into memory at once 👍 impl is here

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java

dougwettlaufer · 2021-06-23T18:02:12Z

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java

+  @ApiOperation(
+      value = "Write multiple documents in one request",
+      notes =
+          "Auto-generates an ID for the newly created document if an idPath is not provided as a query parameter. When an idPath is provided, this operation is idempotent.",


this operation is idempotent

As in a NOOP or it's an upsert?

It will be an upsert - if you provide an idPath and do two back-to-back requests, the second request will update instead of inserting (this avoids read-before-write)

dimas-b

LGTM overall 👍 Just some minor, but critical tweaks remain to be done, I think.

dimas-b · 2021-06-23T22:19:43Z

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

+        docs.put(docId, json.toString());
+      }
+
+      // Write the chunk of (at most) 250 documents by firing a single batch with the row inserts


I thought we removed "chunks" 🤔 is this a leftover comment?

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java

dougwettlaufer · 2021-06-24T15:21:40Z

@EricBorczuk would you mind writing up a short example and filing an issue in stargate/docs so we don't forget to document this?

EricBorczuk · 2021-06-24T15:28:54Z

Done stargate/docs#132

ivansenic

I think my comments are resolved now..

ivansenic · 2021-06-25T12:39:23Z

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java

+      @ApiParam(
+              value = "A JSON Lines payload where each line is a document to write",
+              required = true)
+          InputStream payload,


it's generated in ViolationExceptionMapper, but it requires a message, please do:

@NotNull(message = "payload must not be null")

ivansenic · 2021-06-25T12:42:10Z

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java

+      while (jsonParser.nextToken() != JsonToken.END_ARRAY) {
+        JsonNode json = mapper.readTree(jsonParser);
+        String docId;
+        if (idPath.isPresent()) {


shouldn't this be docsPath.isPresent()?

technically it is logically equivalent, but yea that would make way more sense to read :) I'll change

Wondering if we could hit a JsonToken.END_ARRAY inside a document itself and loosing track with the streaming parser here?

[ { "id":1, "sample":[1,2,3] }, { "id":2, "sample":[4,5,6] } ]

I've tested this out, and the parser actually works properly! This is because mapper.readTree(...) on the parser reads the next full object.

See https://github.com/stargate/stargate/pull/1043/files#diff-585290b1d8a4cd9f6b594dd2e6a35102c4ab7c86988b9df67452177d38c73868R1, which is used in a few tests

EricBorczuk requested review from dimas-b, dougwettlaufer, ivansenic, mpenick, olim7t and tomekl007 as code owners June 21, 2021 15:00

EricBorczuk force-pushed the multi-write-doc branch 7 times, most recently from 89c26d0 to b48224e Compare June 22, 2021 01:01

ivansenic suggested changes Jun 22, 2021

View reviewed changes

dimas-b reviewed Jun 22, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java Outdated Show resolved Hide resolved

dimas-b reviewed Jun 22, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java Outdated Show resolved Hide resolved

dimas-b reviewed Jun 22, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java Outdated Show resolved Hide resolved

dimas-b reviewed Jun 23, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java Outdated Show resolved Hide resolved

dimas-b reviewed Jun 23, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java Outdated Show resolved Hide resolved

dougwettlaufer reviewed Jun 23, 2021

View reviewed changes

EricBorczuk force-pushed the multi-write-doc branch from 2f11bf2 to 5d22887 Compare June 23, 2021 19:16

EricBorczuk requested review from ivansenic, dougwettlaufer and dimas-b June 23, 2021 19:17

EricBorczuk force-pushed the multi-write-doc branch 2 times, most recently from 1712c6c to 2304be2 Compare June 23, 2021 21:42

dimas-b suggested changes Jun 23, 2021

View reviewed changes

EricBorczuk requested a review from dimas-b June 23, 2021 22:33

dougwettlaufer reviewed Jun 24, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java Outdated Show resolved Hide resolved

dougwettlaufer reviewed Jun 24, 2021

View reviewed changes

restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java Outdated Show resolved Hide resolved

dougwettlaufer approved these changes Jun 24, 2021

View reviewed changes

EricBorczuk mentioned this pull request Jun 24, 2021

Add the new docs "batch write" endpoint stargate/docs#132

Closed

dimas-b approved these changes Jun 24, 2021

View reviewed changes

EricBorczuk added 7 commits June 24, 2021 16:15

Batch writing works!

56bef83

Add some int tests

0862dee

Add unit test

a173ffa

Review comments

7f856d4

Review changes, use streaming and JSON array

1f496f2

Review changes

9303881

update swagger

749b479

EricBorczuk force-pushed the multi-write-doc branch from 1347ce0 to 3d6ffe7 Compare June 24, 2021 20:20

ivansenic approved these changes Jun 25, 2021

View reviewed changes

Rebase, unglob imports

718269c

EricBorczuk force-pushed the multi-write-doc branch from 3d6ffe7 to 718269c Compare June 25, 2021 15:41

EricBorczuk merged commit 72eef47 into master Jun 25, 2021

EricBorczuk deleted the multi-write-doc branch June 25, 2021 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add endpoint to write multiple docs #1043

Add endpoint to write multiple docs #1043

EricBorczuk commented Jun 21, 2021 •

edited

ivansenic left a comment

ivansenic Jun 22, 2021

dougwettlaufer Jun 23, 2021

ivansenic Jun 25, 2021

ivansenic Jun 22, 2021

ivansenic Jun 22, 2021

dimas-b Jun 22, 2021

EricBorczuk Jun 23, 2021

EricBorczuk Jun 23, 2021

dimas-b Jun 23, 2021

dimas-b Jun 23, 2021

gconaty Jun 23, 2021

EricBorczuk Jun 23, 2021

dougwettlaufer Jun 23, 2021

EricBorczuk Jun 23, 2021 •

edited

dimas-b left a comment

dimas-b Jun 23, 2021

dougwettlaufer commented Jun 24, 2021

EricBorczuk commented Jun 24, 2021

ivansenic left a comment

ivansenic Jun 25, 2021

ivansenic Jun 25, 2021

EricBorczuk Jun 25, 2021

clun Jun 25, 2021 •

edited

EricBorczuk Jun 25, 2021

EricBorczuk Jun 25, 2021

Add endpoint to write multiple docs #1043

Add endpoint to write multiple docs #1043

Conversation

EricBorczuk commented Jun 21, 2021 • edited

ivansenic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricBorczuk Jun 23, 2021 • edited

Choose a reason for hiding this comment

dimas-b left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dougwettlaufer commented Jun 24, 2021

EricBorczuk commented Jun 24, 2021

ivansenic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clun Jun 25, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricBorczuk commented Jun 21, 2021 •

edited

EricBorczuk Jun 23, 2021 •

edited

clun Jun 25, 2021 •

edited