Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: data job synchronizer error handling #2742

Merged

Conversation

mivanov1988
Copy link
Contributor

@mivanov1988 mivanov1988 commented Oct 2, 2023

Why
As part of VEP-2272, we need to introduce a process for synchronizing data jobs from the database to Kubernetes. In the event of a data job deployment failure (due to user or platform errors), the current implementation attempts to deploy the data job during every synchronization cycle.

What
We have added a data job deployment status field to the desired_data_job_deployment table. This status is used to determine whether the deployment failed in the previous cycle and should be skipped in the current one. The failed deployment status can be reset by invocation of updateDeployment API event without a job code change (it will be implemented as part of https://github.com/vmware/versatile-data-kit/pull/2731/files).

This is just the third phase of implementation. The following enhancements are planned for future PRs:

  • We will annotate the method DataJobsSynchronizer.synchronizeDataJobs() with @scheduled.
  • Improved exception handling will be integrated.
  • More tests will be included in subsequent updates.
  • ThreadPool configuration will be tuned and exposed through the application.properties.

Testing done
Unit tests

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

@mivanov1988 mivanov1988 changed the title Person/miroslavi/data job synchronizer error handling control-service: data job synchronizer error handling Oct 2, 2023
@mivanov1988 mivanov1988 force-pushed the person/miroslavi/data-job-synchronizer-error-handling branch from 8964d74 to 4f9ef0e Compare October 2, 2023 14:35
mivanov1988 and others added 2 commits October 2, 2023 21:51
Why
As part of the VEP-2272, we have to introduce a process to synchronize the data jobs from the database to the Kubernetes.

The initial implementation depends on the data_job_deployment table for both reading and writing operations. As the deployment process operates asynchronously with unpredictable durations, users may believe that their deployment has been completed while it may not have.
To address this issue, we'll be introducing a new table called "desired_data_job_deployment" and renaming the current one to "actual_data_job_deployment". The read operations will read from "actual_data_job_deployment" while the write operations will write to "desired_data_job_deployment".

What
We've implemented new deployment tables and modified the deployment logic to operate in conjunction with these new tables.

Just so you know, this is just the second phase of implementation. The following enhancements are planned for future PRs:

We will annotate the method DataJobsSynchronizer.synchronizeDataJobs() with @scheduled.
Improved exception handling will be integrated.
More tests will be included in subsequent updates.
ThreadPool configuration will be tuned and exposed through the application.properties.

Testing Done
Integration tests

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
@mivanov1988 mivanov1988 force-pushed the person/miroslavi/data-job-synchronizer-error-handling branch from 3094e98 to 5f90aa1 Compare October 2, 2023 18:53
Why
As part of VEP-2272, we need to introduce a process for synchronizing data jobs from the database to Kubernetes. In the event of a data job deployment failure (due to user or platform errors), the current implementation attempts to deploy the data job during every synchronization cycle.

What
We have added a data job deployment status field to the desired_data_job_deployment table. This status is used to determine whether the deployment failed in the previous cycle and should be skipped in the current one.

Testing done
Unit tests

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
@mivanov1988 mivanov1988 force-pushed the person/miroslavi/data-job-synchronizer-error-handling branch from 1c40bb8 to ab3c4b2 Compare October 3, 2023 07:14
@mivanov1988 mivanov1988 enabled auto-merge (squash) October 3, 2023 07:17
Copy link
Collaborator

@dakodakov dakodakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mivanov1988 mivanov1988 merged commit 87e3c70 into main Oct 3, 2023
3 checks passed
@mivanov1988 mivanov1988 deleted the person/miroslavi/data-job-synchronizer-error-handling branch October 3, 2023 11:59
if (DeploymentStatus.USER_ERROR.equals(desiredJobDeployment.getStatus())
|| DeploymentStatus.PLATFORM_ERROR.equals(desiredJobDeployment.getStatus())) {
log.debug(
"Skipping the data job [job_name={}] deployment due to the previously failed deployment"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to prevent re-tries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants