New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add endpoint to write multiple docs #1043
Conversation
89c26d0
to
b48224e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small fixes needed imo.. 👍
@ApiParam( | ||
value = "A JSON Lines payload where each line is a document to write", | ||
required = true) | ||
InputStream payload, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a @NonNull
now when we have everything validated..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the error response from dropwizard look like for that annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generated in ViolationExceptionMapper
, but it requires a message, please do:
@NotNull(message = "payload must not be null")
DocumentDB db = dbFactory.getDocDataStoreForToken(authToken, headers); | ||
JsonSurfer surfer = JsonSurferGson.INSTANCE; | ||
|
||
boolean created = db.maybeCreateTable(keyspace, collection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extract this to some private method, isn't there some other places that uses the same as well? what about inserting the single document?
|
||
String docId = UUID.randomUUID().toString(); | ||
if (idPath.isPresent()) { | ||
String docsPath = convertToJsonPtr(idPath.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so everything related to the idPath
conversion to the doc path can be moved out of the loop, this way you are not doing the same thing over and over again as this seems to be a constant..
testing/src/main/java/io/stargate/it/http/docsapi/BaseDocumentApiV2Test.java
Show resolved
Hide resolved
try (BufferedReader reader = new BufferedReader(new InputStreamReader(payload, "UTF-8"))) { | ||
Iterator<String> iter = reader.lines().iterator(); | ||
ExecutionContext context = ExecutionContext.NOOP_CONTEXT; | ||
String docsPath = convertToJsonPtr(idPath.get()); | ||
int chunkIndex = 0; | ||
final int CHUNK_SIZE = 250; | ||
while (iter.hasNext()) { | ||
chunkIndex++; | ||
Map<String, String> docsInChunk = new LinkedHashMap<>(); | ||
while (docsInChunk.size() < CHUNK_SIZE && iter.hasNext()) { | ||
String doc = iter.next(); | ||
JsonNode json = mapper.readTree(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we require one doc per line? Is it not possible to parse JSON input in a stream fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the input isn't JSON, since it's actually multiple delimited JSON objects (JSON lines format). I'll look into possible methods to stream JSON lines, but Jackson doesn't have anything for JSON lines afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to make the input valid JSON (i.e. require that it's a JSON array) and then streaming probably becomes possible/easy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess it may depend on how the library works with input streams. Please post your findings :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring input to be a JSON array is not unreasonable from my POV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an option to import and/or export in either format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So using a JsonParser (Jackson's impl for streaming JSON) is feasible, but only with valid JSON, so I am going to stick with expecting array data over the wire; then, each object in the array can get parsed one by one using an ObjectMapper, which won't pull in much into memory at once 👍 impl is here
restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/service/DocumentService.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java
Outdated
Show resolved
Hide resolved
@ApiOperation( | ||
value = "Write multiple documents in one request", | ||
notes = | ||
"Auto-generates an ID for the newly created document if an idPath is not provided as a query parameter. When an idPath is provided, this operation is idempotent.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this operation is idempotent
As in a NOOP or it's an upsert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be an upsert - if you provide an idPath
and do two back-to-back requests, the second request will update instead of inserting (this avoids read-before-write)
2f11bf2
to
5d22887
Compare
1712c6c
to
2304be2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall 👍 Just some minor, but critical tweaks remain to be done, I think.
docs.put(docId, json.toString()); | ||
} | ||
|
||
// Write the chunk of (at most) 250 documents by firing a single batch with the row inserts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we removed "chunks" 🤔 is this a leftover comment?
restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java
Outdated
Show resolved
Hide resolved
restapi/src/main/java/io/stargate/web/docsapi/resources/DocumentResourceV2.java
Outdated
Show resolved
Hide resolved
@EricBorczuk would you mind writing up a short example and filing an issue in stargate/docs so we don't forget to document this? |
Done stargate/docs#132 |
1347ce0
to
3d6ffe7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my comments are resolved now..
@ApiParam( | ||
value = "A JSON Lines payload where each line is a document to write", | ||
required = true) | ||
InputStream payload, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generated in ViolationExceptionMapper
, but it requires a message, please do:
@NotNull(message = "payload must not be null")
while (jsonParser.nextToken() != JsonToken.END_ARRAY) { | ||
JsonNode json = mapper.readTree(jsonParser); | ||
String docId; | ||
if (idPath.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be docsPath.isPresent()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically it is logically equivalent, but yea that would make way more sense to read :) I'll change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could hit a JsonToken.END_ARRAY
inside a document itself and loosing track with the streaming parser here?
[
{ "id":1, "sample":[1,2,3] },
{ "id":2, "sample":[4,5,6] }
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this out, and the parser actually works properly! This is because mapper.readTree(...) on the parser reads the next full object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/stargate/stargate/pull/1043/files#diff-585290b1d8a4cd9f6b594dd2e6a35102c4ab7c86988b9df67452177d38c73868R1, which is used in a few tests
3d6ffe7
to
718269c
Compare
Add an endpoint to allow writing multiple documents in a single request.
Data sent to this endpoint is expected to be in JSON lines format (1 document per line).
Additionally, you can supply an
id-path
query parameter to use the value at a particular path in each document as the document's key in the database, so if all your documents have anid
field, you could setid-path=id
and treat the value inid
as the document's key. You can also use any valid path syntax (globs not allowed), e.g.id-path=user.emails.[0].id
If
id-path
is excluded, random UUID's will be assigned to every document, and the response will have the ID's created corresponding in the same order as the documents were supplied in.