Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed data job status in case of OOM #1586

Merged
merged 7 commits into from
Feb 3, 2023

Conversation

mivanov1988
Copy link
Contributor

Why

Currently, in the case of Data Job failure caused by OOM, the Control Services marks the data job with User Error but after some time it overrides the status to Platform Error. This is caused by K8S Job events that come after the data job completion. They don't contain enough data and the Control Service treats them as Platform Errors.

Log of the data job execution:

Jan 27 14:38:38 tpcs-dep-68f96bc974-qhqf4 [thread-5] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:33:24 tpcs-dep-68f96bc974-8jz5h [thread-2] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.116 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-8jz5h

Jan 27 14:32:29 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=OOMKilled)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:32:28 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=OOMKilled)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:29:48 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=null, succeeded=null, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=null, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:29:48 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=null, succeeded=null, opId=01674829787797, startTime=null, endTime=null, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)

What

Implemented a logic wich omits events that come after the Data Job completion.

Testing done

Introduced integration test.

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

@mivanov1988 mivanov1988 enabled auto-merge (squash) February 3, 2023 12:04
@mivanov1988 mivanov1988 merged commit 764c31f into main Feb 3, 2023
@mivanov1988 mivanov1988 deleted the person/miroslavi/fix-oom-data-job-status branch February 3, 2023 13:09
mivanov1988 added a commit that referenced this pull request May 22, 2023
Why
We recently got the following feedback from our internal client:
A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What
Added validation for an already completed job in a more appropriate place.

Testing Done
Added integration test

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
mivanov1988 added a commit that referenced this pull request May 23, 2023
Why
We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What
Added validation for an already completed job in a more appropriate place.

Testing Done
Added integration test

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
mivanov1988 added a commit that referenced this pull request May 25, 2023
# Why
We recently got the following feedback from our internal client: A data
job was listed as successful even though it hit the 12 hour limit and
was killed; the logs do not show that either - the last entry in the log
just shows the last object that was sent for ingestion, but there is no
summary of the data job.

The problem is caused by the following fix -
#1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we
construct partial JobExecutionStatus which enters in the following if
statement and returns Optional.empty() rather than the constructed
object.


https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656

As a result, this job execution becomes stuck in the Running status
until it is detected by emergency logic, which marks such executions as
successful due to the lack of associated Pods to them.

# What
Added validation for an already completed job in a more appropriate
place.

# Testing Done
Added integration test

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

---------

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
Co-authored-by: github-actions <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants