Skip to content
This repository has been archived by the owner on Jul 12, 2024. It is now read-only.

A simple tool to clean up old Job resources in Kubernetes

License

Notifications You must be signed in to change notification settings

repaygithub/k8s-job-reaper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

k8s-job-reaper

A simple tool to clean up old Job resources in Kubernetes

Motivation

As it currently stands in alpha, the TTL feature gate, which offers the ability to automatically clean up Job resources in Kubernetes based on a configured TTL, is weakly supported in managed Kubernetes offerings. For example, it's not supported at all in EKS. As a result, Job resources can quickly pile up and waste cluster resources.

This tool aims to deliver the same functionality via a script that looks for an annotation on Job resources called ttl.

Note that setting restartPolicy: OnFailure is another possible solution for cleanup, but it deletes the underlying pod (including its logs) immediately after Job completion, as documented here. Therefore it is not considered a viable approach for many use cases.

Example

apiVersion: batch/v1
kind: Job
metadata:
  generateName: example-job-ttl-
  annotations:
    ttl: "2 hours"
spec:
  template:
    spec:
      containers:
      - name: example
        image: centos
        command: ["sleep", "90"]
      restartPolicy: Never
  backoffLimit: 0

The ttl annotation can be specified with any value supported by GNU relative dates.

Note that this example Job is deployed with kubectl create rather than kubectl apply due to its usage of generateName.

Deployment

Prerequisites

  • docker
  • kubectl

Deploying this tool is as simple as running:

./build.sh [IMAGE_URL]

where [IMAGE_URL] is the full URL of the container image you want to build/push/deploy. For example, if your container registry is hosted on gcr.io/acme-123, you may run:

./build.sh gcr.io/acme-123/k8s-job-reaper

Configuration

This tool also supports the following configurations.

Field Location Description Default
DEFAULT_TTL Environment variable in cronjob.yaml An optional global default TTL for completed Jobs, this does not take precedence over an annotation set in your manifest file. If a ttl is set within your mainfest file and your job does not complete in that span of time, the reaper will wait for both a ttl expiration and a success on the job ""
DEFAULT_TTL_FAILED Environment variable in cronjob.yaml An optional global default TTL for uncompleted/failed Jobs (DEFAULT_TTL must also be set for this to take effect) ""
NS_BLACKLIST Environment variable in cronjob.yaml A list of Kubernetes Namespaces (space-delimited) to ignore when looking for Jobs "kube-system"
schedule Field in cronjob.yaml The cron schedule at which to look for Jobs to delete "0 */1 * * *" (once an hour)

Use-case Scenarios:

  • .status.active = The number of actively running pods (integer)
  • .status.succeeded = The number of pods which reached phase Succeeded
  • .status.failed = The number of pods which reached phase Failed
  • creationTimestamp = a string representing an RFC 3339 date of the date and time an object was created
  • completionTimestamp = Represents time when the job was completed. It is represented in RFC3339 form and is in UTC.

NOTE: There will always be a DEFAULT_TTL and a DEFAULT_TTL_FAILED of 12 hours

NS_BLACKLIST DEFAULT_TTL DEFAULT_TTL_FAILED ttl annotation .status.active .status.succeeded result
set NA NA NA NA NA no job will be deleted since NA_BLACKLIST is set
not set set set set 0 1 job will get deleted when it reaches ttl annotation limit or DEFAULT_TTL whichever come first based on completionTimestamp
not set set set not set 0 0 jobs will get deleted based on DEFAULT_TTL_FAILED based on creationTimestamp
not set set set set 0 0 jobs will get deleted based on DEFAULT_TTL_FAILED based on creationTimestamp

NOTE: If active flag is set to one above, no pods will ever get deleted, those pods can include pending, evicted or orphaned.

How to troubleshoot

How to find status of metadata and status of jobs for above use case scenarios

kubectl get jobs --all-namespaces -o json | jq -r ".items[] | select( .metadata | has(\"ownerReferences\") | not) | [.metadata.name,.metadata.namespace,.metadata.creationTimestamp,.status.completionTime,.metadata.annotations.ttl,.status.active,.status.succeeded] | @csv" |  sed 's/"//g'

How to log at logs

You can look into the logs of the reaper pod using the below. Note, your pod name will differ, select the latest one.

kubectl get pods -n kube-system
kubectl logs job-reaper-1611779520-sh8qg -n kube-system

you will see something similar to the below

starting reaper with:
  DEFAULT_TTL: 1 minutes
  DEFAULT_TTL_FAILED: 12 hours
  NS_BLACKLIST:
Finished job kube-system/example-job-nottl expired (at 2021-01-27T16:43:10Z) due to global TTL(1 minutes), deleting
job.batch "example-job-nottl" deleted
reaper finished

About

A simple tool to clean up old Job resources in Kubernetes

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 95.4%
  • Dockerfile 4.6%