-
Notifications
You must be signed in to change notification settings - Fork 444
Add support for scheduling Data Update jobs #891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds the documentation for the `datasources.update_data` method added in PR tableau#891.
@jharris126, @jonas-eckhardt, @FrancoisZim |
This was really intuitive for me. It took me only 15 minutes to refactor an existing use case I had including setting up a new venv and installing other deps, and the refactor cut the number of lines of code needed in half. See my only recommendation about request_id. |
@api(version="3.13") | ||
def update_data(self, datasource_or_connection_item, *, request_id, actions, payload = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation about the parameters and what this method does + links to the documentation of the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #893
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Do we split the documentation from the code or is this somehow merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for TSC, documentation and code are separated.
Not sure why, but that's how it is...
(Personally, I would prefer if functions would be documented in the source code itself, and the docs were generated from the source code, but that's not how TSC is currently handling its documentation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use BackgroundJobItem instead of JobItem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to work out if the update_data method should wait for complete or return the jobitem and it looks like returning the jobitem is more consistent as this matches the other async method here (refresh).
To make this easier to use would you be open to adding a convenience method to the Jobs endpoint to wait for job completion - (e.g.
`
@api(version="3.13")
def wait_for_async_job(self, async_job_id, timeout=None):
ASYNC_JOB_POLL_INTERVAL=30
completed_at = None
finish_code = None
jobinfo = None
wait_start_time = time.time()
while completed_at is None:
time.sleep(ASYNC_JOB_POLL_INTERVAL)
jobinfo = self.get_by_id(async_job_id)
completed_at = jobinfo.completed_at
if completed_at is None:
logger.debug("Job {} ... progress={}".format(async_job_id, jobinfo.progress))
if timeout is not None:
wait_elapsed_time = (time.time() - wait_start_time) * 1000
if wait_elapsed_time > timeout:
raise TimeoutError(f"Timeout after {wait_elapsed_time} seconds waiting for asynchronous job: {async_job_id}")
else:
finish_code = jobinfo.finish_code
logger.info("Job {} Completed: Finish Code: {} - Notes:{}".format(async_job_id, finish_code, jobinfo.notes))
return finish_code
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that the need to write the json for the "actions" parameter makes this harder to use than it needs to be? - I think we should keep this method so that you have the flexibility of writing your own 'actions' json but also create some convenience methods to pass the options as parameters - e.g.
def update_datasource(
self, datasource_or_connection_item, *,
payload,
matching_columns=[['ROWID','ROWID']],
source_schema="Extract",
source_table="Extract",
target_schema="Extract",
target_table="Extract"
):
match_conditions_args = []
for match_pair in matching_columns:
match_conditions_args.append(
{
"op": "eq",
"source-col": match_pair[0],
"target-col": match_pair[1],
}
)
if len(match_conditions_args) > 1:
match_conditions_json = {"op": "and", "args": match_conditions_args}
else:
match_conditions_json = match_conditions_args[0]
json_request = [
# UPDATE action
{
"action": "update",
"source-schema": source_schema,
"source-table": source_table,
"target-schema": target_schema,
"target-table": target_table,
"condition": match_conditions_json,
},
]
request_id = str(uuid.uuid4())
return self.update_data(datasource_or_connection_item, request_id, json_request, payload)
i.e. hide some of the internals and allow an update by just calling:
server.datasources.update_datasource(datasource_id, path_to_database,[[keycol1,keycol1],[keycol2,keycol2]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a convenience method to the Jobs endpoint to wait for job completion
Implemented a proposal in #903. Please provide feedback on the proposed wait_for_job
in that other PR.
create some convenience methods to pass the options as parameters
I would prefer not to do so. We have a documented and powerful "actions" language, i.e. the JSON structures. We shouldn't invent yet another "wrapper mini-language" here in this Python library.
Common cases (such as a short-hand notation for matching_columns
which doesn't need explicit op: eq
elements) should rather be added on the server-side. That way, all clients (also JavaScript clients, etc.) can benefit from it.
(actually, there is a proposal for a matching_colums
short-hand syntax. It's just that we never implemented it... @jonas-eckhardt do you think we should re-prioritize that convenience feature on the server-side?)
eec2f88
to
7129b24
Compare
This commit adds support for the `datasources/<id>/data` endpoint through which one can schedule jobs to update the data within a published live-to-Hyper datasource on the server. The new `datasources.update_data` expects the arguments: * a datasource or a connection: If the datasource only contains a single connections, the datasource is sufficient to identify which Hyper file should be updated. Otherwise, for datasources with multiple connections, the connections has to be provided. This distinction happens on the server, so the client library only needs to provide a way to specify either of both. * a `request_id` which will be used to ensure idempotency on the server. This parameter is simply passed as a HTTP header . * an `actions` list, specifying how exactly the data on the server should be modified. We expect the caller to provide list following the structure documented in the REST API documentation. TSC does not validate this object and simply passes it through to the server. * an optional `payload` file: For actions like `insert`, one can provide a Hyper file which contains the newly inserted tuples or other payload data. TSC will upload this file to the server and then hand it over to the update-API endpoint. Besides the addition of the `datasources.update_data` itself, this commit also adds some infrastructure changes, e.g., to enable sending PATCH requests and HTTP headers. The documentation for the REST endpoint can be found at https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_how_to_update_data_to_hyper.htm
7129b24
to
b8d4a01
Compare
@t8y8 @bcantoni @jacalata I think this PR is ready for code review now. Could I get a review from one of you before I merge this? Also, @FrancoisZim brought up the question
above which I wasn't able to answer since I don't really understand the difference between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, I'm not familiar with the job vs BackgroundJob items, but I think it's safe enough to use JobItem. BackgroundJobItem doesn't appear to be used anywhere.
To test this proposed interface locally, you can install my branch using
You can find the documentation for this method at https://vogelsgesang.github.io/server-client-python/docs/api-ref#datasourceupdate_data
Commit Description
This commit adds support for the
datasources/<id>/data
endpointthrough which one can schedule jobs to update the data within a
published live-to-Hyper datasource on the server.
The new
datasources.update_data
expects the arguments:single connections, the datasource is sufficient to identify which
Hyper file should be updated. Otherwise, for datasources with
multiple connections, the connections has to be provided. This
distinction happens on the server, so the client library only needs
to provide a way to specify either of both.
request_id
which will be used to ensure idempotency on the server.This parameter is simply passed as a HTTP header .
actions
list, specifying how exactly the data on the servershould be modified. We expect the caller to provide list following
the structure documented in the REST API documentation. TSC does
not validate this object and simply passes it through to the server.
payload
file: For actions likeinsert
, one canprovide a Hyper file which contains the newly inserted tuples or
other payload data. TSC will upload this file to the server and then
hand it over to the update-API endpoint.
Besides the addition of the
datasources.update_data
itself, thiscommit also adds some infrastructure changes, e.g., to enable sending
PATCH requests and headers.
The documentation for the REST endpoint can be found at
https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_how_to_update_data_to_hyper.htm