Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic because concurrent access of JobInfo.TaskMinAvailable #3007

Closed
Tongruizhe opened this issue Jul 30, 2023 · 1 comment · Fixed by #3008
Closed

Panic because concurrent access of JobInfo.TaskMinAvailable #3007

Tongruizhe opened this issue Jul 30, 2023 · 1 comment · Fixed by #3008
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Tongruizhe
Copy link
Contributor

Tongruizhe commented Jul 30, 2023

What happened:
Volcano scheduler panic in CheckTaskValid, due to concurrent access of JobInfo.TaskMinAvailable

2023-06-17T00:58:17+08:00 fatal error: concurrent map iteration and map write
2023-06-17T00:58:17÷08:00
2023-06-1700:58:17+08:00 goroutine 492 [running] :
2023-06-17T00:58:17+08:00 volcano.sh/volcano/pkg/scheduler/api.(*JobInfo).CheckTaskValid(0xc00c04f3b0)
2023-06-1700:58:17+08:00 /go/src/volcano.sh/volcano/pkg/scheduler/api/job_info.go:720 +0x274
2023-06-17T00:58:17+08:00 volcano. sh/volcano/pkg/scheduler/plugins/gang.(*gangPlugin).OnSessionOpen.func1({0x1a8d0a0?,Oxc00c04f3b0?})
2023-06-1700:58:17+08:00
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/gang/gang.go:61 +0×65
2023-06-17T00:58:17+08:00 volcano.sh/volcano/pkg/scheduler/framework. ("Session). JobValid(Oxcoodd36b40, {0x1a8doao, Oxc00c04f3b01)
2023-06-1700:58:17+08:00 /go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:378 +0x159
2023-06-17T00:58:17+08:00 volcano. sh/volcano/pkg/scheduler/actions/backfill.(*Action).Execute(Oxlaf03597, Oxcoodd36b40)
2023-06-1700:58:17+08:00
/go/src/volcano.sh/volcano/pkg/scheduler/actions/backfill/backfi11.go:51 +0x171
2023-06-1700:58:17+08:00 volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc000348d20)
2023-06-1700:58:17+08:00 /go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:116 +0x348
2023-06-17T00:58:17+08:00 k8s.io/apimachinery/pkg/util/wait. BackoffUntil.func1(0xc007dc7f00?)
2023-06-1700:58:17+08:00
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157+0x3e
2023-06-17T00:58:17+08:00 k8s.io/apimachinery/pkg/util/wait. BackoffUntil(0x0?, {0x1d6ca20, 0xc004a135c0], 0x1, 0x0003acba0)
2023-06-1700:58:17+08:00
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
2023-06-17T00:58:17+08:00 k8s.io/apimachinery/pkg/util/wait.JitterUntil (0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
2023-06-17700-58:17+08:00
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
2023-06-17T00:58:17+08:00 k8s.io/apimachinery/pkg/util/wait. Until (0x0?, 0x0?, 0x0?)
2023-06-17T00:58:17+08:00 /go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92+0×25
2023-06-17T00:58:17+08:00 created by volcano.sh/volcano/pkg/scheduler. (*Scheduler). Run
2023-06-1700-58:17+08:00
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:91 +Oxlaa

What you expected to happen:
No panic occurs.

How to reproduce it (as minimally and precisely as possible):
This is a concurrency bug, so there is no guaranteed reproducible way. But one of the necessary precondition is to enable the gang plugins, because this plugin calls the CheckTaskValid function.

The reason is the code here

// Clone is used to clone a jobInfo object
func (ji *JobInfo) Clone() *JobInfo {
	info := &JobInfo{
		UID:       ji.UID,
		Name:      ji.Name,
		Namespace: ji.Namespace,
		Queue:     ji.Queue,
		Priority:  ji.Priority,

		MinAvailable:   ji.MinAvailable,
		WaitingTime:    ji.WaitingTime,
		JobFitErrors:   ji.JobFitErrors,
		NodesFitErrors: make(map[TaskID]*FitErrors),
		Allocated:      EmptyResource(),
		TotalRequest:   EmptyResource(),

		PodGroup: ji.PodGroup.Clone(),

		TaskStatusIndex:       map[TaskStatus]tasksMap{},
		TaskMinAvailable:      ji.TaskMinAvailable,
		TaskMinAvailableTotal: ji.TaskMinAvailableTotal,
		Tasks:                 tasksMap{},
		Preemptable:           ji.Preemptable,
		RevocableZone:         ji.RevocableZone,
		Budget:                ji.Budget.Clone(),
	}

	ji.CreationTimestamp.DeepCopyInto(&info.CreationTimestamp)

	for _, task := range ji.Tasks {
		info.AddTaskInfo(task.Clone())
	}

	return info
}

When cloning a JobInfo, the TaskMinAvailable is shadow-copied. Therefore, multiple goroutines may access to the same map.
Anything else we need to know?:

Environment:

  • Volcano Version: v1.7.0
  • Kubernetes version (use kubectl version): v1.25.0
@Tongruizhe Tongruizhe added the kind/bug Categorizes issue or PR as related to a bug. label Jul 30, 2023
@Tongruizhe
Copy link
Contributor Author

The issue maybe be assigned to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant