Fixed data job status in case of OOM #1586

mivanov1988 · 2023-02-02T16:05:39Z

Why

Currently, in the case of Data Job failure caused by OOM, the Control Services marks the data job with User Error but after some time it overrides the status to Platform Error. This is caused by K8S Job events that come after the data job completion. They don't contain enough data and the Control Service treats them as Platform Errors.

Log of the data job execution:

Jan 27 14:38:38 tpcs-dep-68f96bc974-qhqf4 [thread-5] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:33:24 tpcs-dep-68f96bc974-8jz5h [thread-2] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.116 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-8jz5h

Jan 27 14:32:29 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=OOMKilled)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:32:28 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=BackoffLimitExceeded, succeeded=false, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=2023-01-27T14:32:28Z, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=OOMKilled)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:29:48 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=null, succeeded=null, opId=01674829787797, startTime=2023-01-27T14:29:48Z, endTime=null, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)
source: 10.216.162.76 event_type: v4_254d2c9c facility: user priority: debug hostname: tpcs-dep-68f96bc974-qhqf4

Jan 27 14:29:48 tpcs-dep-68f96bc974-qhqf4 [thread-1] com.vmware.taurus.service.monitoring.DataJobMonitor Storing Data Job execution status: KubernetesService.JobExecution(executionId=miroslavi-test44-1674829787, executionType=manual, jobName=miroslavi-test44, podTerminationMessage=null, jobTerminationReason=null, succeeded=null, opId=01674829787797, startTime=null, endTime=null, jobVersion=355c5077da74ecf3623e03b7cd17996faf9702b5, jobSchedule=11 23 5 8 1, resourcesCpuRequest=0.5, resourcesCpuLimit=1.0, resourcesMemoryRequest=1000, resourcesMemoryLimit=4000, deployedDate=2023-01-27T14:28:37.088538Z, deployedBy=miroslavi, containerTerminationReason=null)

What

Implemented a logic wich omits events that come after the Data Job completion.

Testing done

Introduced integration test.

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

...c/integration-test/java/com/vmware/taurus/datajobs/it/common/DataJobDeploymentExtension.java

...es_control_service/src/integration-test/java/com/vmware/taurus/datajobs/it/DataJobOOMIT.java

Why We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job. The problem is caused by the following fix - #1586. When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object. https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656 As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them. What Added validation for an already completed job in a more appropriate place. Testing Done Added integration test Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

# Why We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job. The problem is caused by the following fix - #1586. When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object. https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656 As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them. # What Added validation for an already completed job in a more appropriate place. # Testing Done Added integration test Signed-off-by: Miroslav Ivanov miroslavi@vmware.com --------- Signed-off-by: Miroslav Ivanov miroslavi@vmware.com Co-authored-by: github-actions <>

mivanov1988 and others added 2 commits February 2, 2023 17:54

Fixed data job status in case of OOM

038ee1e

Google Java Format

347cc12

vmwclabot added the cla-not-required label Feb 2, 2023

antoniivanov reviewed Feb 2, 2023

View reviewed changes

...c/integration-test/java/com/vmware/taurus/datajobs/it/common/DataJobDeploymentExtension.java Show resolved Hide resolved

...es_control_service/src/integration-test/java/com/vmware/taurus/datajobs/it/DataJobOOMIT.java Show resolved Hide resolved

dakodakov approved these changes Feb 3, 2023

View reviewed changes

antoniivanov approved these changes Feb 3, 2023

View reviewed changes

mivanov1988 and others added 5 commits February 3, 2023 13:50

Fixed unit tests

025b2ad

Google Java Format

238cea3

Fixed integration tests

723ffb1

Google Java Format

50ff8e6

Merge branch 'main' into person/miroslavi/fix-oom-data-job-status

4763ba8

mivanov1988 enabled auto-merge (squash) February 3, 2023 12:04

mivanov1988 merged commit 764c31f into main Feb 3, 2023

mivanov1988 deleted the person/miroslavi/fix-oom-data-job-status branch February 3, 2023 13:09

mivanov1988 mentioned this pull request May 22, 2023

control-service: killed job was shown as successful #2103

Closed

mivanov1988 mentioned this pull request May 23, 2023

control-service: killed job was shown as successful #2116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed data job status in case of OOM #1586

Fixed data job status in case of OOM #1586

mivanov1988 commented Feb 2, 2023

Fixed data job status in case of OOM #1586

Fixed data job status in case of OOM #1586

Conversation

mivanov1988 commented Feb 2, 2023

Why

What

Testing done