Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add endpoint to write multiple docs #1043

Merged
merged 8 commits into from Jun 25, 2021
Merged

Add endpoint to write multiple docs #1043

merged 8 commits into from Jun 25, 2021

Conversation

EricBorczuk
Copy link
Collaborator

@EricBorczuk EricBorczuk commented Jun 21, 2021

Add an endpoint to allow writing multiple documents in a single request.

Data sent to this endpoint is expected to be in JSON lines format (1 document per line).

Additionally, you can supply an id-path query parameter to use the value at a particular path in each document as the document's key in the database, so if all your documents have an id field, you could set id-path=id and treat the value in id as the document's key. You can also use any valid path syntax (globs not allowed), e.g. id-path=user.emails.[0].id

If id-path is excluded, random UUID's will be assigned to every document, and the response will have the ID's created corresponding in the same order as the documents were supplied in.

Copy link
Contributor

@ivansenic ivansenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small fixes needed imo.. 👍

@ApiParam(
value = "A JSON Lines payload where each line is a document to write",
required = true)
InputStream payload,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a @NonNull now when we have everything validated..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error response from dropwizard look like for that annotation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generated in ViolationExceptionMapper, but it requires a message, please do:

@NotNull(message = "payload must not be null")

DocumentDB db = dbFactory.getDocDataStoreForToken(authToken, headers);
JsonSurfer surfer = JsonSurferGson.INSTANCE;

boolean created = db.maybeCreateTable(keyspace, collection);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extract this to some private method, isn't there some other places that uses the same as well? what about inserting the single document?


String docId = UUID.randomUUID().toString();
if (idPath.isPresent()) {
String docsPath = convertToJsonPtr(idPath.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so everything related to the idPath conversion to the doc path can be moved out of the loop, this way you are not doing the same thing over and over again as this seems to be a constant..

Comment on lines 431 to 442
try (BufferedReader reader = new BufferedReader(new InputStreamReader(payload, "UTF-8"))) {
Iterator<String> iter = reader.lines().iterator();
ExecutionContext context = ExecutionContext.NOOP_CONTEXT;
String docsPath = convertToJsonPtr(idPath.get());
int chunkIndex = 0;
final int CHUNK_SIZE = 250;
while (iter.hasNext()) {
chunkIndex++;
Map<String, String> docsInChunk = new LinkedHashMap<>();
while (docsInChunk.size() < CHUNK_SIZE && iter.hasNext()) {
String doc = iter.next();
JsonNode json = mapper.readTree(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require one doc per line? Is it not possible to parse JSON input in a stream fashion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the input isn't JSON, since it's actually multiple delimited JSON objects (JSON lines format). I'll look into possible methods to stream JSON lines, but Jackson doesn't have anything for JSON lines afaik

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to make the input valid JSON (i.e. require that it's a JSON array) and then streaming probably becomes possible/easy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess it may depend on how the library works with input streams. Please post your findings :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring input to be a JSON array is not unreasonable from my POV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an option to import and/or export in either format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using a JsonParser (Jackson's impl for streaming JSON) is feasible, but only with valid JSON, so I am going to stick with expecting array data over the wire; then, each object in the array can get parsed one by one using an ObjectMapper, which won't pull in much into memory at once 👍 impl is here

@ApiOperation(
value = "Write multiple documents in one request",
notes =
"Auto-generates an ID for the newly created document if an idPath is not provided as a query parameter. When an idPath is provided, this operation is idempotent.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this operation is idempotent

As in a NOOP or it's an upsert?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be an upsert - if you provide an idPath and do two back-to-back requests, the second request will update instead of inserting (this avoids read-before-write)

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall 👍 Just some minor, but critical tweaks remain to be done, I think.

docs.put(docId, json.toString());
}

// Write the chunk of (at most) 250 documents by firing a single batch with the row inserts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we removed "chunks" 🤔 is this a leftover comment?

@EricBorczuk EricBorczuk requested a review from dimas-b June 23, 2021 22:33
@dougwettlaufer
Copy link
Contributor

@EricBorczuk would you mind writing up a short example and filing an issue in stargate/docs so we don't forget to document this?

@EricBorczuk
Copy link
Collaborator Author

Done stargate/docs#132

Copy link
Contributor

@ivansenic ivansenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comments are resolved now..

@ApiParam(
value = "A JSON Lines payload where each line is a document to write",
required = true)
InputStream payload,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generated in ViolationExceptionMapper, but it requires a message, please do:

@NotNull(message = "payload must not be null")

while (jsonParser.nextToken() != JsonToken.END_ARRAY) {
JsonNode json = mapper.readTree(jsonParser);
String docId;
if (idPath.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be docsPath.isPresent()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it is logically equivalent, but yea that would make way more sense to read :) I'll change

Copy link
Contributor

@clun clun Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could hit a JsonToken.END_ARRAY inside a document itself and loosing track with the streaming parser here?

[
 { "id":1, "sample":[1,2,3] },
 { "id":2, "sample":[4,5,6] }
]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this out, and the parser actually works properly! This is because mapper.readTree(...) on the parser reads the next full object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricBorczuk EricBorczuk merged commit 72eef47 into master Jun 25, 2021
@EricBorczuk EricBorczuk deleted the multi-write-doc branch June 25, 2021 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants