Skip to content

Conversation

vogelsgesang
Copy link
Collaborator

@vogelsgesang vogelsgesang commented Sep 18, 2021

To test this proposed interface locally, you can install my branch using

 pip install git+https://github.com/vogelsgesang/server-client-python.git@update-jobs

You can find the documentation for this method at https://vogelsgesang.github.io/server-client-python/docs/api-ref#datasourceupdate_data

Commit Description

This commit adds support for the datasources/<id>/data endpoint
through which one can schedule jobs to update the data within a
published live-to-Hyper datasource on the server.

The new datasources.update_data expects the arguments:

  • a datasource or a connection: If the datasource only contains a
    single connections, the datasource is sufficient to identify which
    Hyper file should be updated. Otherwise, for datasources with
    multiple connections, the connections has to be provided. This
    distinction happens on the server, so the client library only needs
    to provide a way to specify either of both.
  • a request_id which will be used to ensure idempotency on the server.
    This parameter is simply passed as a HTTP header .
  • an actions list, specifying how exactly the data on the server
    should be modified. We expect the caller to provide list following
    the structure documented in the REST API documentation. TSC does
    not validate this object and simply passes it through to the server.
  • an optional payload file: For actions like insert, one can
    provide a Hyper file which contains the newly inserted tuples or
    other payload data. TSC will upload this file to the server and then
    hand it over to the update-API endpoint.

Besides the addition of the datasources.update_data itself, this
commit also adds some infrastructure changes, e.g., to enable sending
PATCH requests and headers.

The documentation for the REST endpoint can be found at
https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_how_to_update_data_to_hyper.htm

@vogelsgesang vogelsgesang marked this pull request as draft September 18, 2021 16:03
vogelsgesang added a commit to vogelsgesang/server-client-python that referenced this pull request Sep 18, 2021
Adds the documentation for the `datasources.update_data` method
added in PR tableau#891.
@vogelsgesang
Copy link
Collaborator Author

@jharris126, @jonas-eckhardt, @FrancoisZim
I would love to hear your feedback on the interfaces proposed in this commit. Currently, we can still change it without breaking compatibility... So now is the time to chime in with any feedback you might have ;)

@jharris126
Copy link
Contributor

jharris126 commented Sep 18, 2021

This was really intuitive for me. It took me only 15 minutes to refactor an existing use case I had including setting up a new venv and installing other deps, and the refactor cut the number of lines of code needed in half. See my only recommendation about request_id.

@api(version="3.13")
def update_data(self, datasource_or_connection_item, *, request_id, actions, payload = None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation about the parameters and what this method does + links to the documentation of the API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #893

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Do we split the documentation from the code or is this somehow merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for TSC, documentation and code are separated.
Not sure why, but that's how it is...
(Personally, I would prefer if functions would be documented in the source code itself, and the docs were generated from the source code, but that's not how TSC is currently handling its documentation)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use BackgroundJobItem instead of JobItem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the difference between JobItem and BackgroundJobItem is.
@t8y8, @bcantoni which one should I use here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to work out if the update_data method should wait for complete or return the jobitem and it looks like returning the jobitem is more consistent as this matches the other async method here (refresh).

To make this easier to use would you be open to adding a convenience method to the Jobs endpoint to wait for job completion - (e.g.
`
@api(version="3.13")
def wait_for_async_job(self, async_job_id, timeout=None):

    ASYNC_JOB_POLL_INTERVAL=30
    completed_at = None
    finish_code = None
    jobinfo = None

    wait_start_time = time.time()
    
    while completed_at is None:
        time.sleep(ASYNC_JOB_POLL_INTERVAL)
        jobinfo = self.get_by_id(async_job_id)
        completed_at = jobinfo.completed_at
        if completed_at is None:
            logger.debug("Job {} ... progress={}".format(async_job_id, jobinfo.progress))
            if timeout is not None:
                wait_elapsed_time = (time.time() - wait_start_time) * 1000
                if wait_elapsed_time > timeout:
                    raise TimeoutError(f"Timeout after {wait_elapsed_time} seconds waiting for asynchronous job: {async_job_id}")
        else:
            finish_code = jobinfo.finish_code
        
    logger.info("Job {} Completed: Finish Code: {} - Notes:{}".format(async_job_id, finish_code, jobinfo.notes))
    return finish_code

`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that the need to write the json for the "actions" parameter makes this harder to use than it needs to be? - I think we should keep this method so that you have the flexibility of writing your own 'actions' json but also create some convenience methods to pass the options as parameters - e.g.

def update_datasource(
    self, datasource_or_connection_item, *, 
    payload, 
    matching_columns=[['ROWID','ROWID']], 
    source_schema="Extract", 
    source_table="Extract", 
    target_schema="Extract", 
    target_table="Extract"
):
    match_conditions_args = []
    for match_pair in matching_columns:
        match_conditions_args.append(
            {
                "op": "eq",
                "source-col": match_pair[0],
                "target-col": match_pair[1],
            }
        )
    if len(match_conditions_args) > 1:
        match_conditions_json = {"op": "and", "args": match_conditions_args}
    else:
        match_conditions_json = match_conditions_args[0]

    json_request = [
        # UPDATE action
        {
            "action": "update",
            "source-schema": source_schema,
            "source-table": source_table,
            "target-schema": target_schema,
            "target-table": target_table,
            "condition": match_conditions_json,
        },
    ]

    request_id = str(uuid.uuid4())
    return self.update_data(datasource_or_connection_item, request_id, json_request, payload)

i.e. hide some of the internals and allow an update by just calling:
server.datasources.update_datasource(datasource_id, path_to_database,[[keycol1,keycol1],[keycol2,keycol2]])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a convenience method to the Jobs endpoint to wait for job completion

Implemented a proposal in #903. Please provide feedback on the proposed wait_for_job in that other PR.

create some convenience methods to pass the options as parameters

I would prefer not to do so. We have a documented and powerful "actions" language, i.e. the JSON structures. We shouldn't invent yet another "wrapper mini-language" here in this Python library.

Common cases (such as a short-hand notation for matching_columns which doesn't need explicit op: eq elements) should rather be added on the server-side. That way, all clients (also JavaScript clients, etc.) can benefit from it.

(actually, there is a proposal for a matching_colums short-hand syntax. It's just that we never implemented it... @jonas-eckhardt do you think we should re-prioritize that convenience feature on the server-side?)

@vogelsgesang vogelsgesang changed the base branch from master to development September 24, 2021 00:23
@vogelsgesang vogelsgesang marked this pull request as ready for review September 24, 2021 00:24
This commit adds support for the `datasources/<id>/data` endpoint
through which one can schedule jobs to update the data within a
published live-to-Hyper datasource on the server.

The new `datasources.update_data` expects the arguments:
* a datasource or a connection: If the datasource only contains a
  single connections, the datasource is sufficient to identify which
  Hyper file should be updated. Otherwise, for datasources with
  multiple connections, the connections has to be provided. This
  distinction happens on the server, so the client library only needs
  to provide a way to specify either of both.
* a `request_id` which will be used to ensure idempotency on the server.
  This parameter is simply passed as a HTTP header .
* an `actions` list, specifying how exactly the data on the server
  should be modified. We expect the caller to provide list following
  the structure documented in the REST API documentation. TSC does not
  validate this object and simply passes it through to the server.
* an optional `payload` file: For actions like `insert`, one can
  provide a Hyper file which contains the newly inserted tuples or
  other payload data. TSC will upload this file to the server and then
  hand it over to the update-API endpoint.

Besides the addition of the `datasources.update_data` itself, this
commit also adds some infrastructure changes, e.g., to enable sending
PATCH requests and HTTP headers.

The documentation for the REST endpoint can be found at
https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_how_to_update_data_to_hyper.htm
@vogelsgesang
Copy link
Collaborator Author

@t8y8 @bcantoni @jacalata I think this PR is ready for code review now.
The feedback received so far makes me confident that the proposed interface is reasonable and helpful.
Successful test case run: https://github.com/vogelsgesang/server-client-python/actions/runs/1268009840

Could I get a review from one of you before I merge this?

Also, @FrancoisZim brought up the question

Should this use BackgroundJobItem instead of JobItem?

above which I wasn't able to answer since I don't really understand the difference between JobItem and BackgrounJobItem. Maybe one of you is able to answer that

Copy link
Contributor

@jacalata jacalata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, I'm not familiar with the job vs BackgroundJob items, but I think it's safe enough to use JobItem. BackgroundJobItem doesn't appear to be used anywhere.

@vogelsgesang vogelsgesang merged commit 7c03396 into tableau:development Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants